28th IEEE International Conference on Data Engineering - ICDE 2012

★ ★ ★ 

<strong>28th</strong> <strong>IEEE</strong> <strong>International</strong> <strong>Conference</strong> on 

Washington, DC • April 1-5, 2012 

Data Engineering (ICDE)

<strong>Conference</strong> Program 

<strong>28th</strong> <strong>IEEE</strong> <strong>International</strong> <strong>Conference</strong> on 

Data Engineering (ICDE) 

April 1-5, 2012 

Washington, DC 

COVER PHOTOS: Copyright © 2012 by Tasos Kementsietsidis

Table of Contents 

Table of Contents ................................................................................................3 

Message from the ICDE 2012 Program Chairs ........................................5 

and the General Chair 

<strong>Conference</strong> Organization .................................................................................7 

<strong>Conference</strong> Venue ..............................................................................................17 

Program at a Glance .........................................................................................21 

Session Contents ...............................................................................................25 

Keynotes ................................................................................................................51 

Seminars ............................................................................................................... 55 

Panels ..................................................................................................................... 61 

Awards .................................................................................................................. 67 

Abstracts .............................................................................................................. 69 

Co-Located Workshops ................................................................................139 

Local Information ............................................................................................153 

Page 

3

ICDE 2012 <strong>Conference</strong> 

Page 

4

Message from the ICDE 

2012 Program Chairs and 

the General Chair 

Since 1984, ICDE has established itself as a premier forum in the area of data management, 

providing a unique opportunity for database researchers, users, practitioners, 

and developers to exchange new ideas. The <strong>28th</strong> <strong>IEEE</strong> <strong>International</strong> <strong>Conference</strong> on 

Data Engineering takes place in the city of Washington, United States, from April 1 to 5, 

2012. We are proud to present its program in these proceedings. 

Each of the main days of the conference starts out with a keynote by a distinguished 

scientist: Serge Abiteboul from INRIA in France on April 2; Surajit Chaudhuri from 

Microsoft Research in the United States on April 3; and Peter Druschel from the Max- 

Planck Institute for Software Systems in Germany on April 4. 

We thank all the authors who submitted their work to ICDE for making the conference 

happen. We received 413 paper submissions for the research track, 22 submissions for 

the industrial track, and 68 demo proposals. The program committee was organized 

into fifteen topic-based tracks. Each track was headed by a vice-chair who formed a committee 

to evaluate the papers assigned to that track. This resulted in a research program 

committee consisting of 188 members for the research tracks, 12 members for the 

industrial track, and 30 members for the demo track. The evaluation process consisted 

of three distinct phases: initial reviews of the papers by PC members, some initial dis- 

Page 

5


cussions, author responses to these reviews, and then further discussion by the PC and 

fine-tuning of the reviews. 

The research program features 100 papers, the industrial program 9 papers, and the 

demonstration program 28 demos. The conference program also includes 6 seminar 

tutorials and one panel. As a feature of ICDE conferences in recent years, all papers are 

presented at a poster session. Accompanying the main conference are seven workshops. 

The success of ICDE 2012 is a result of collegial teamwork from many individuals who 

worked tirelessly to make the conference a success. We thank Nico Bruno and Ken Ross 

who served as Industrial Chairs; Christof Bornhoevd, Richard Goodwin, and Mirek Riedewald 

who served as Demo Chairs; Aryya Gangopadhyay who served as Seminar Chair; 

Michael Gertz and Alex Tuzhilin who served as Panel Chairs; Anupam Joshi and Sharad 

Mehrotra who served as Workshop Chairs; and also the organizers of the accompanying 

workshops. We also express our deep appreciation of the outstanding work put in over 

many months by the organization team: Nabil Adam, Alex Brodsky and Vijay Atluri 

served as general (vice-)chairs, Carlotta Domeniconi and Huzefa Rangwala were the 

Local Organization and Sponsorship Chairs, Hui Xiong served as Finance Chair, Soon 

Ae Chun served as Publicity Chair, Anastasios Kementsietsidis and Marcos Vaz Salles 

as Proceedings Chairs, and Micah Sherr as Web Chair. We thank Carmen Saliba and 

Alkenia Winston from the <strong>IEEE</strong> Computer Society’s <strong>Conference</strong> Support Services for 

helping secure the various necessary contracts in a timely manner, and Beth Grohnke of 

GMU’s Office of Event Management for helping with many local arrangement issues. 

The Best Paper Award Committee included Minos Garofalakis (chair), Anthony Tung, 

and Ugur Centintemel. Without the contributions of all of these excellent conference officers, 

this conference would not have been a success. We are also thankful to the many 

student volunteers from George Mason University. 

We also thank the Microsoft CMT Team and the <strong>IEEE</strong> <strong>Conference</strong> Publications Team 

for their assistance and quick replies to our multitude of requests. 

We also gratefully acknowledge the financial support of our sponsors: Microsoft and 

the National Science Foundation as Platinum Sponsors, EMC and Greenplum as Gold 

Sponsors, HP and IBM Research as Silver Sponsors, and Google as a Bronze Sponsor. 

Finally, we thank all the authors, presenters, and participants of the conference. We 

hope that all of you enjoy the conference! 

ICDE 2012 PC Chairs 

Johannes Gehrke (Cornell University, USA) 

Beng Chin Ooi (National University of Singapore, Singapore) 

Evaggelia Pitoura (University of Ioannina, Greece) 

ICDE 2012 General Chair 

X. Sean Wang (Fudan University, China) 

Page 

6

<strong>Conference</strong> Organization 

Organizing COmmittee 

General Chairs 

X. Sean Wang (Fudan University) 

nabil r. adam (US DHS S&t, rutgers University) 

General Vice Chairs 

alex Brodsky (george mason University) 

Vijay atluri (rutgers University) 

Program Chairs 

Johannes gehrke (Cornell University) 

Beng Chin Ooi (national University of Singapore) 

evaggelia Pitoura (University of ioannina) 

Industrial Program Chairs 

nicolas Bruno (microsoft research) 

Liang-Jie zhang (iBm research) 

Kenneth ross (Columbia University) 

Seminar/Tutorial Chair 

aryya gangopadhyay (UmBC) 

Page 

7


Workshop Chairs 

Sharad mehrotra (Univ. of California, irvine) 

anupam Joshi (UmBC) 

Panel Chairs 

alex tuzhilin (new York University) 

michael gertz (University of Heidelberg) 

Poster Chairs 

Jaideep Vaidya (rutgers University) 

zachary ives (University of Pennsylvania) 

Demo Chairs 

Christof Bornhoevd (SaP research) 

richard goodwin (iBm) 

mirek riedewald (northeastern University) 

Proceedings Chairs 

anastasios Kementsietsidis (iBm) 

marcos Vaz Salles (University of Copenhagen) 

Local Organization Chairs and Sponosrship Chairs 

Carlotta Domeniconi (george mason University) 

Huzefa rangwala (george mason University) 

Finance Chair 

Hui Xiong (rutgers University) 

Publicity Chair 

Soon ae Chun (City University of new York) 

Web Chair 

micah Sherr (georgetown University) 

PrOgram COmmittee 

Program Committee Area Vice Chairs 

Cloud, data warehousing, and large data 

Volker markl (tU Berlin, germany) 

Data Integration, metadata management, interoperability 

erhard rahm (Univ. of Leipzig, germany) 

Page 

8

Data mining and knowledge discovery 

anthony tung (national University of Singapore, Singapore) 


Distributed, peer-to-peer, grid, and mobile data management 

aoying zhou (east normal University, China) 

Indexing and storage 

Lei Chen (University of Science and technology, Hongkong) 

Privacy and security 

elena Ferrari (University of insubria, italy) 

Query processing and query optimization 

Kaushik Chakrabarti (microsoft research, USa) 

Scientific data and data visualization 

zachary ives (University of Pennsylvania, USa) 

Semistructured data, XML 

ioana manolescu (inria, France) 

Social networks, web, and personal information management 

aris gionis (Yahoo! research, Spain) 

Spatial, temporal, and multimedia data 

Heng tao Shen (University of Queensland, australia) 

Streams, sensor networks, and complex events processing 

Ugur Cetintemel (Brown University, USa) 

Systems, performance, and transaction management 

Bettina Kemme (mcgill University, Canada) 

Text, graphs, and search 

Venkatesh ganti (google) 

Uncertain and probabilistic data 

minos garofalakis (technical University of Crete, greece) 

Research Program Committee Members 

Yanif ahmad, Johns Hopkins University 

aris anagnostopoulos, Sapienza University of Rome 

Walid aref, Purdue University 

ismail ari, Ozyegin University 

Soeren auer, Leipzig School of Media 

Page 

9


Shivnath Babu, Duke University 

roger Barga, Microsoft 

zohra Bellahsene, University of Montpellier II 

elisa Bertino, Purdue University 

Claudio Bettini, University of Milan 

michael Bohlen, University of Zurich 

Paolo Boldi, University of Milan 

Francesco Bonchi, Yahoo! Research 

Peter Boncz, CWI 

angela Bonifati, ICAR-CNR, Italy 

Vinayak Borkar, University of California, Irvine 

Christof Bornhoevd, SAP 

randal Burns, Johns Hopkins University 

andrea Cali, University of Oxford 

Selcuk Candan, Arizona State University 

Barbara Carminati, University of Insubria, Italy 

Deepayan Chakrabarti, Yahoo! Research 

Chee Yong Chan, National University of Singapore 

Badrish Chandramouli, Microsoft 

gang Chen, Zhejing University, China 

Shimin Chen, Intel Labs Pittsburgh 

Su Chen, National University of Singapore 

Yi Chen, Arizona State University 

reynold Cheng, University of Hong-Kong 

Sarah Cohen-Boulakia, LRI Orsay 

gao Cong, Nanyang Technological University, Singapore 

Stefan Conrad, University of Dortmund 

mariano Consens, University of Toronto 

graham Cormode, AT&T Research 

isabel Cruz, University of Illinois at Chicago 

Bin Cui, Beijing University, China 

alfredo Cuzzocrea, ICAR-CNR and University of Calabria, Italy 

Colazzo Dario, University Paris Sud 

gautam Das, University of Texas-Arlington 

anish Das Sarma, Google 

Khuzaima Daudjee, University of Waterloo 

antonios Deligiannakis, Technical University of Crete 

Stefan Dessloch, University of Kaiserslautern 

zhiming Ding, Institute of Software, Chinese Academy of Science 

Jens Dittrich, Universitaet Saarland 

anhai Doan, University of Wisconsin 

eduard Dragut, Purdue University 

Sameh elnikety, Microsoft 

Vuk ercegovac, IBM Almaden 

Wenfei Fan, University of Edinburgh 

alan Fekete, University of Sidney 

Page 

10


alvaro Fernandes, University of Manchester 

Johann Christoph Freytag, University of Berlin 

avigdor gal, Technion 

Helena galhardas, Instituto Superior Tecnico, Portugal 

tingjian ge, University of Kentucky 

Bugra gedik, IBM 

Floris geerts, University of Edinburgh 

Sreenivas gollapudi, Microsoft Research 

Le gruenwald, University of Oklahoma 

torsten grust, University of Tuebingen 

amarnath gupta, San Diego Supercomputing Center 

Peter Haas, IBM Almaden 

Jeff Hammerbacher, Cloudera 

Wook-Shin Han, Korean National University 

Oktie Hassanzadeh, University of Toronto 

magnus Lie Hetland, NTNU, Norway 

Vagelis Hristidis, Florida <strong>International</strong> University 

zi Huang, University of Queensland 

Seung-won Hwang, POSTECH, Korea 

Stratos idreos, CWI 

Yoshiharu ishikawa, Nagoya University 

ryan Johnson, University of Toronto 

theodore Johnson, AT&T Research 

Panos Kalnis, King Abdullah University of Science and Technology (KAUST) 

murat Kantarcioglu, University of Texas-Dallas 

Panagiotis Karras, National University of Singapore 

alfons Kemper, TU Muenchen 

eamonn Keogh, University of California, Riverside 

Christoph Koch, EPFL 

george Kollios, Boston University 

nick Koudas, University of Toronto 

tim Kraska, University of California Berkeley 

Wang-Chien Lee, Penn State University 

Ulf Leser, Humboldt University Berlin 

Jure Leskovec, Stanford 

guiping Li, Renmin University of China 

Feifei Li, Florida State University 

guoliang Li, Tsinghua University 

ninghui Li, Purdue University 

Xiang Lian, Hong Kong University of Science and Technology 

Xueming Lin, University of South Wales 

Kun Liu, Yahoo! Labs 

Ling Liu, Georgia Tech 

eric Lo, Hong Kong Polytechnic University 

Boon thau Loo, University of Pennsylvania 

alexander Losup, TU Delft 

Page 

11


Hua Lu, Aalborg University 

Bertram Ludaescher, University of California Davis 

Bradley malin, Vanderbilt University 

nikos mamoulis, The University of Hong Kong 

Stefan manegold, CWI 

Sebastian maneth, NICTA, Australia 

ioana manolescu, Inria 

alexandra meliou, University of Washington 

Paolo missier, Newcastle University 

mohamed F. mokbel, University of Minnesota 

mirella moro, Universidade Federal de Minas Gerais, Brazil 

Vivek narasayya, Microsoft Research 

thomas neumann, Technical University Munich 

Silvia nittel, University of Maine 

Dan Olteanu, Oxford University 

tamer Ozsu, University of Waterloo 

thanasis Papaioannou, EPFL 

marta Patino-martinez, Technical University of Madrid 

glenn Paulley, Sybase 

Dino Pedreschi, University of Pisa 

Jian Pei, Simon Fraser University 

Peter Pietzuch, Imperial College London 

neoklis Polyzotis, University of California Santa Cruz 

rachel Pottinger, UBC 

Sunil Prabhakar, Purdue University 

Weining Qian, East China Normal University 

Christoph Quix, RWTH Aachen 

ravi ramamurthy, Microsoft 

Vijayshankar raman, IBM 

Vibhor rastogi, Yahoo! Research 

indrakshi ray, Colorado State University 

Christopher re, University of Wisconsin-Madison 

matthias renz, Ludwig-Maximilians-University Munich 

marcos Vaz Salles, University of Copenhagen 

Jagan Sankaranarayanan, NEC Labs America 

ralf Schenkel, Saarland University 

Heiko Schuldt, University of Basel 

Sudipta Sengupta, Microsoft Research 

Jayavel Shanmugasundaram, Google 

Jie Shao, University of Melbourne 

Jialie Shen, Singapore Management University 

elaine Shi, UC Berkeley 

Kyuseok Shim, Seoul National University 

Pavel Shvaiko, Informatica Trentina 

Claudio Silva, University of Utah 

mauro Sozio, Max Planck Institute for Computer Science, Germany 

Page 

12

Divesh Srivastava, AT&T Research 

Jessica Staddon, Google 

S Sudarshan, IIT Bombay 

torsten Suel, Polytechnic Institute of NYU 

Kian-Lee tan, National University of Singapore 

Yufei tao, Chinese University of Hong Kong 

James terwilliger, Microsoft 

evimaria terzi, Boston University 

Jens teubner, ETH Zurich 

Hannu toivonen, University of Helsinki 

Panayiotis tsaparas, Microsoft Research 

antti Ukkonen, Yahoo! Research 

Shivakumar Vaithyanathan, IBM Almaden 

Vasilis Vassalos, Athens University of Economics and Business 

Yannis Velegrakis, University of Trento 

Quang Hieu Vu, EBTIC 

Daisy zhe Wang, University of Florida 

guoren Wang, Northeastern University of China 

Haixun Wang, Microsoft Research 

Jianyong Wang, Tsinghua University 

Junhu Wang, Griffith University, Australia 

Wei Wang, UNC 

Kyu-Young Whang, KAIST 

andrew Witkowski, Oracle 

raymond Wong, Hong Kong University of Science and Technology 

Sai Wu, National University of Singapore 

tianyi Wu, Microsoft 

Xiaokui Xiao, Nanyang Technological University, Singapore 

Dong Xin, Google 

Jianliang Xu, Hong Kong Baptist University 

Linhao Xu, IBM Research 

Xifeng Yan, UCSB 

Bin Yang, Max-Planck-Institut für Informatik 

Jun Yang, Duke University 

Linjun Yang, Microsoft Research Asia 

Ke Yi, Hong-Kong University of Science and Technology 

ge Yu, Northeastern University, China 

Hwanjo Yu, POSTECH 

Carlo zaniolo, UCLA 

Dongxiang zhang, National University of Singapore 

rui zhang, University of Melbourne 

zhenjie zhang, NUS 

minqi zhou, East China Normal University 

Xiangmin zhou, CSIRO 

Freida zhu, Singapore Management University 


Page 

13


Industrial Program Committee Members 

Bishwaranjan Bhattacharjee, IBM 

Philippe Bonnet, IT University of Copenhagen 

John Cieslewicz, Aster Data 

amol Deshpande, University of Maryland 

Cesar galindo-Legaria, Microsoft 

Leo giakoumakis, Microsoft 

masaru Kitsuregawa, The University of Tokyo 

Harumi Kuno, HP 

Jun rao, LinkedIn 

rajeev rastogi, Yahoo! 

Florian Waas, EMC 

mohammed zait, Oracle 

Demo Program Committee Members 

Sihem amer-Yahia, Qatar Computing Research Institute 

arvind arasu, Microsoft Research 

Sunil arvindam, SAP Research, India 

magdalena Balazinska, University of Washington 

Fabio Casati, University of Trento, Italy 

malu Castellanos, HP Labs, USA 

mariano Cilia, Intel Corporation, Argentina 

Brian F Cooper, Google 

adina Crainiceanu, US Naval Academy 

abhinandan Das, Google 

alin Dobra, University of Florida 

Javier garcia-garcia, UNAM University, Mexico 

Pablo guerrero, TU Darmstadt, Germany 

melanie Herschel, Tubingen University 

Christian Konig, Microsoft Research 

georgia Koutrika, IBM Almaden Research Center 

Wolfgang Lehner, TU Dresden, Germany 

Feifei Li, Florida State University 

ashwin machanavajjhala, Yahoo Research 

thomas neumann, TU Munchen 

Dan Olteanu, University of Oxford 

Carlos Ordonez, University of Houston 

Peter Pietzuch, Imperial College London 

Lin Qiao, IBM Almaden 

Berthold reinwald, IBM Almaden, USA 

Vladislav Shkapenyuk, ATT Research 

adam Silberstein, Yahoo Research 

alkis Simitsis, HP Labs 

Page 

14

ioana r Stanoi, IBM Almaden 

ming-Chuan Wu, Microsoft, USA 

External Reviewers 

albert angel 

Pantelis aravogliadis 

Vassilis athitsos 

evandrino Barros 

nicole Bidoit 

nicolas Bonvin 

Daniele Braga 

Lorenz Buehmann 

ruichu Cai 

Xin Cao 

Bogdan Cautis 

Yi-Ling Chen 

Songting Chen 

Shiwen Cheng 

Fei Chiang 

Byron Choi 

Juan Da Cruz Pinto 

maria Daltayanni 

mahashweta Das 

David DeHaan 

Bolin Ding 

marius Dumitru 

Santiago ezcurra 

Ju Fan 

Wei Feng 

Chuancong gao 

Shen ge 

Haris georgiadis 

Christan grant 

nitin gupta 

Yeye He 

arvid Heise 

Haibo Hu 

Heng Huang 

Lili Jiang 

Xin Jin 

alekh Jindal 

matti Järvisalo 

abhijith Kashyap 

asterios Katsifodimos 

Batya Kenig 

arijit Khan 

Julien Leblay 

Jae-gil Lee 

aurelien Lemay 

Jianxin Li 

nan Li 

Xingjie Liu 

Shuai ma 

Vincenzo maltese 

Bruno martins 

michael mathioudakis 

manuel mayr 

giansalvatore mecca 

gengxin miao 

Pablo michelis 

nabeel mohamed 

miyuki nakano 

akash nanavati 

axel ngonga 

anisoara nica 

Bart niechweij 

tomasz nykiel 

matteo Palmonari 

Panagiotis Papadimitriou 

Charalampos 

Papamanthou 

Xu Pu 

Jianzhong Qi 

Hongda ren 

astrid rheinlaender 

Daniele riboni 

Jan rittinger 

Senjuti Basu roy 

eduardo ruiz 

michael rys 

tomer Sagi 

Simonas Saltenis 

Carlo Sartiani 

Jörg Schad 

Stefan Schuh 

Pierre Senellart 


Chih-Ya Shen 

reza Sherkat 

Kelvin Sim 

guojie Son 

Claus Stadler 

Johannes Starlinger 

Yizhou Sun 

andrej taliun 

takayuki tamura 

nan tang 

Saravanan 

thirumuruganathan 

andreas thor 

Xinmei tian 

masashi toyoda 

Frederico Ulliana 

Jörg Unbehauen 

Jiannan Wang 

gerhard Weikum 

zeyi Wen 

raymond Chi-Wing 

Wong 

Yinghui Wu 

mao Ye 

Peifeng Ying 

man Lung Yiu 

Wenyuan Yu 

ning zhang 

Qijun zhu 

Bo zong 

Page 

15


Page 

16

<strong>Conference</strong> Venue 

ERENCE VENUE 

The conference will take place in the Renaissance Arlington Capital View Hotel 

located at 2800 South Potomac Avenue, Arlington, Virginia 22202 USA. If using a 

GPS navigator, you may try to search for the address 2899 Jefferson Davis Highway, 

Arlington, VA 22202 as an alternative address for locating the destination. 

onference will take place in the Renaissance Arlington Capital View Hotel loc 

0 South Potomac Avenue, Arlington, Virginia 22202 USA 

earest Metro Station is Crystal City Metro (Blue and Red Line) 

Page 

17


The Nearest Metro Station is Crystal City Metro (Blue and Yellow Lines). 

To take metro to the hotel, you may take off at the Crystal City Metro Station. Complimentary 

hotel shuttle to and from Crystal City Metro station every 20 minutes between 

7am-11pm. (Call 1-703-413-1300 if problem). 

Page 

18

<strong>Conference</strong> Venue 

Complimentary hotel shuttle to and from Reagan (DCA) airport every (20) thirty minutes 

between 5am-11pm. Pick up and drop off at Terminal A (hotel shuttle area) or Gates 5 and 

9 on Level 1 of Terminal B & C. 

Nation’s Capital 

Washington, DC 

ICDE 

Hotel 

Historic Old Towne 

Alexandria, VA 

Page 

19


<strong>Conference</strong> will be held on the 

second floor of the hotel. 

Page 

20 

Registration 

Internet 

Room

Program at a Glance 

(see next page) 

Page 

21


8aM BreaKFaSt eacH day (Prefunction Area) 

Page 

22 

9aM — 10aM 

10aM — 10:30aM 

10:30aM — noon 

noon — 2pM 

2pM — 3:30pM 

3:30pM — 4pM 

4pM — 5:30pM 

afternoon & evening 

Sunday, april 1 

WOrKSHOpS 

DGSS (Studio F), SMDB 

(Studio B), STIR (Studio D), 

DESWEB (Studio E) 

Coffee Break 



and DESWEB (Studio E) 

Lunch Break (on your own) 




Coffee Break 




receptiOn 

5:30-8 (Salon 4567) 

MOnday, april 2 

Keynote 1 (Salon 4567) 

Serge Abiteboul 

Coffee Break 

Session 1 (Studio F) 

Privacy 

Session 2 (Studio B) 

Web 2.0 Applications 

Session 3 (Studio C) 

Storage Management 

Session 4 (Studio D) 

Data Streams Processing 

Seminar 1 (Salon 123) 

Demo Group 1 (Studio E) 

Business Lunch and Award 

Ceremony (Salon 4567) 

Session 5 (Studio F) Graphs 


Uncertain and Probabilistic 

Databases 

Session 7 (Studio C) Data 

Integration and Extraction 


Spatio-Temporal Data 

Management 



Coffee Break 


Query Processing 

Session 10 (Studio B) Location 

Aware Data Processing 

Session 11 (Studio C) Map- 

Reduce based Data Processing 


Social Media 



nSF icde 2012 career 

panel 7:30-9PM (Salon 123)

tueSday, april 3 


Surajit Chaudhuri 

Coffee Break 


P2P and Distributed 

Processing 


XML and RDF Data 

Management 


Performance 

Industrial Session 1 

(Studio D) Support for 

Large Scale Data Analytics 



Funders session with lunch 

(Salon 4567) 

Session 16 (Studio F) Data 

Extraction and Quality 


Top-K Processing 


(Studio C) Evolving Platforms 

for New Applications 

Seminar 5 (Studio 123) 

Panel (Studio D) The Future 

of Scientific Data Bases 


Coffee Break 

Posters (Salon 4567) 

cruiSe and Banquet 

5:30PM (Bus leaves hotel) 

WedneSday, april 4 


Peter Druschel 

Coffee Break 


Similarity 


Text and Strings 


Query Processing II 


(Studio D) Indexing, 

Updates and Processing 



Lunch (Salon 4567) 


Data Mining 


Scientific Data, Analysis 

and Visualization 


Similarity Search and 

Detection 


Coffee Break 


Sensors Network and 

Trajectory 


Error Reduction and 

Data Security 


Program at a Glance 

tHurSday, april 5 

WOrKSHOpS 

DMC (Studio B), 

GDM (Studio D), and 

SDMSM (Studio F) 

Coffee Break 




Lunch Break 




Coffee Break 




Page 

23


Page 

24

Session Contents 

Sunday, april 1 

8AM - 9AM Breakfast (Prefunction) 

9AM - 5:30PM Workshops 

Studio F: data-driven decision Guidance and 

Support Systems (dGSS) 

Studio B: Self-Managing database Systems (SMdB) 

Studio d: Spatio Temporal data integration and 

retrieval (STir) 

Studio E: data Engineering Meets the Semantic Web 

(dESWEB) 

5:30PM - 8PM <strong>Conference</strong> Reception (Salon 4567) 

Monday, april 2 


9AM - 10AM Keynote 1 (Salon 4567): Serge Abiteboul — Viewing 

the Web as a Distributed Knowledge Base 

Session Chair: Evaggelia Pitoura 

Page 

25


10AM - 10:30AM Coffee break 

10:30AM - Noon Sessions 1-4, Seminar 1, Demo Group 1 

Page 

26 

Session 1: Privacy (Studio F) 

Session Chair: Murat Kantarcioglu 

Privacy in Social Networks: How Risky is Your Social Graph? 

Cuneyt Gurcan akcora (university of insubria) 

Barbara Carminati (university of insubria) 

Elena Ferrari (university of insubria) 

Differentially Private Spatial Decompositions 

Graham Cormode (aT&T labs – research) 

Cecilia procopiuc (aT&T labs – research) 

Entong Shen (north Carolina State university) 

divesh Srivastava (aT&T labs – research) 

Ting yu (north Carolina State university) 

Differentially Private Histogram Publication 

Jia Xu (northeastern university, China) 

Zhenjie Zhang (advanced digital Sciences Center, illinois at 

Singapore pte.) 

Xiaokui Xiao (nanyang Technological university) 

yin yang (advanced digital Sciences Center, illinois at 


Ge yu (northeastern university, China) 

Privacy-Preserving and Content-Protecting Location 

Based Queries 

russell paulet (Victoria university) 

Md. Golam Kaosar (Victoria university) 

Xun yi (Victoria university) 

Elisa Bertino (purdue university) 

Session 2: Web 2.0 Applications (Studio B) 

Session Chair: Kyuseok Shim 

GeoFeed: A Location-Aware News Feed 

Jie Bao (university of Minnesota at Twin Cities) 

Mohamed F. Mokbel (university of Minnesota at Twin Cities) 

Chi-yin Chow (City university of Hong Kong) 

Entity Search Strategies for Mashup Applications 

Stefan Endrullis (university of leipzig) 

andreas Thor (university of leipzig) 

Erhard rahm (university of leipzig)


CI-Rank: Ranking Keyword Search Results Based on 

Collective Importance 

Xiaohui yu (york university & Shandong university) 

Huxia Shi (york university) 

Temporal Analytics on Big Data for Web Advertising 

Badrish Chandramouli (Microsoft research) 

Jonathan Goldstein (Microsoft Corporation) 

Songyun duan (iBM T. J. Watson research Center) 

Session 3: Storage Management (Studio C) 

Session Chair: Alfons Kemper 

Lookup Tables: Fine-Grained Partitioning for 

Distributed Databases 

aubrey l. Tatarowicz (MiT) 

Carlo Curino (MiT) 

Evan p. C. Jones (MiT) 

Sam Madden (MiT) 

Temporal Support for Persistent Stored Modules 

richard T. Snodgrass (university of arizona) 

dengfeng Gao (iBM Silicon Valley lab) 

rui Zhang (university of arizona) 

Stephen W. Thomas (Queen’s university, Kingston) 

Energy Efficient Storage Management Cooperated with 

Large Data Intensive Applications 

norifumi nishikawa (The university of Tokyo) 

Miyuki nakano (The university of Tokyo) 

Masaru Kitsuregawa (The university of Tokyo) 

ISOBAR Preconditioner for Effective and High-throughput 

Lossless Data Compression 

Eric r. Schendel (north Carolina State university) 

ye Jin (north Carolina State university) 

neil Shah (north Carolina State university) 

Jackie Chen (Sandia national laboratory) 

C.S. Chang (princeton plasma physics laboratory, 

princeton, nJ 08543, uSa) 

Seung-Hoe Ku (new york university) 

Stephane Ethier (princeton plasma physics laboratory) 

Scott Klasky (oak ridge national laboratory) 

robert latham (argonne national laboratory) 

robert ross (argonne national laboratory) 

nagiza F. Samatova (north Carolina State university & 

oak ridge national laboratory) 

Page 

27


Page 

28 

Session 4: Data Streams Processing (Studio D) 

Session Chair: Bugra Gedik 

Physically Independent Stream Merging 

Badrish Chandramouli (Microsoft research) 

david Maier (portland State university) 

Jonathan Goldstein (Microsoft Corporation) 

On Computing Correlated Aggregates over a Data Stream 

Srikanta Tirthapura (iowa State university) 

david p. Woodruff (iBM almaden research Center) 

Accuracy-Aware Uncertain Stream Databases 

Tingjian Ge (university of Kentucky) 

Fujun liu (university of Kentucky) 

On Discovery of Traveling Companions from Streaming 

Trajectories 

lu-an Tang (uiuC) 

yu Zheng (MSra) 

Jing yuan (MSra) 

Jiawei Han (uiuC) 

alice leung (BBn) 

Chih-Chieh Hung (yahoo!) 

Wen-Chih peng (nCTu) 


Data Management Issues on the Semantic Web 

oktie Hassanzadeh (university of Toronto & iBM research) 

anastasios Kementsietsidis (iBM research) 

yannis Velegrakis (university of Trento) 


SMIX Live – A Self-Managing Index Infrastructure for 

Dynamic Workloads 

Thomas Kissinger (dresden university of Technology) 

Hannes Voigt (dresden university of Technology) 

Wolfgang lehner (dresden university of Technology) 

Multi-Query Stream Processing on FPGAs 

Mohammad Sadoghi (university of Toronto) 

rohan palaniappan (university of Toronto) 

rija Javed (university of Toronto) 

naif Tarafdar (university of Toronto), 

Harsh Singh (university of Toronto) 

Hans-arno Jacobsen (university of Toronto)


EUDEMON: A System for Online Video Frame Copy 

Detection by Earth Mover Distance 

Jia Xu (northeastern university, China) 

Qiushi Bai (northeastern university, China), 

yu Gu (northeastern university, China) 

anthony Tung (national university of Singapore), 

Guoren Wang (northeastern university, China), 

Ge yu (northeastern university, China), 

Zhenjie Zhang (advanced digital Sciences Center, illinois at 


A Dataset Search Engine for the Research 

Document Corpus 

Meiyu lu (national univ. of Singapore) 

Srinivas Bangalore (aT&T research labs), 

Graham Cormode (aT&T labs – research), 

Marios Hadjieleftheriou (aT&T labs – research), 


AskFuzzy: Attractive Visual Fuzzy Query Builder 

Keivan Kianmehr (university of Western ontario) 

negar Koochakzadeh (university of Calgary) 

reda alhajj (university of Calgary) 

F2DB: The Flash-Forward Database System 

ulrike Fischer (dresden university of Technology) 

Frank rosenthal (dresden university of Technology) 

Wolfgang lehner (dresden university of Technology) 

Provenance-Based Debugging and Drill-Down in 

Data-Oriented Workflows 

robert ikeda (Stanford university) 

Junsang Cho (Stanford university), 

Charlie Fang (Stanford university) 

Semih Salihoglu (Stanford university), 

Satoshi Torikai (Stanford university) 

Jennifer Widom (Stanford university) 

Noon – 2PM Business Lunch & Award Ceremony (Salon 4567) 

Page 

29


2PM - 3:30PM Sessions 5-8, Seminar 2, Demo Group 2 

Page 

30 

Session 5: Graphs (Studio F) 

Session Chair: Sameh Elnikety 

Iterative Graph Feature Mining for Graph Indexing 

dayu yuan (penn State university) 

prasenjit Mitra (penn State university) 

Huiwen yu (penn State university) 

C. lee Giles (penn State university) 

An Efficient Graph Indexing Method 

Xiaoli Wang (national university of Singapore) 

Xiaofeng ding (Huazhong university of Science and 

Technology) 

anthony K.H. Tung (national university of Singapore) 

Shanshan ying (national university of Singapore) 

Hai Jin (Huazhong university of Science and Technology) 

PRAGUE: Towards Blending Practical Visual Subgraph 

Query Formulation and Query Processing 

Changjiu Jin (nanyang Technological university) 

Sourav S Bhowmick (nanyang Technological univ) 

Byron Choi (Hong Kong Baptist university) 

Shuigeng Zhou (Fudan university) 

Ego-centric Graph Pattern Census 

Walaa Eldin Moustafa (university of Maryland, College park) 

amol deshpande (university of Maryland, College park) 

lise Getoor (university of Maryland, College park) 

Session 6: Uncertain and Probabilistic 

Databases (Studio B) 

Session Chair: Elena Ferrari 

Searching Uncertain Data Represented by Non-Axis Parallel 

Gaussian Mixture Models 

Katrin Haegler (university of Munich) 

Frank Fiedler (university of Munich) 

Christian Boehm (university of Munich) 

Aggregate Query Answering on Possibilistic Data with Cardinality 

Constraints 

Graham Cormode (aT&T labs – research) 

Entong Shen (north Carolina State university) 


Ting yu (north Carolina State university)


Discovering Threshold-based Frequent Closed Itemsets 

over Probabilistic Data 

yongxin Tong (Hong Kong univeristy of Science and 

Engineering) 

lei Chen (Hong Kong univeristy of Science and Engineering) 

Bolin ding (university of illinois at urbana-Champaign) 

Ranking Query Results in Probabilistic Databases: 

Complexity and Efficient Algorithms 

dan olteanu (university of oxford) 

Hongkai Wen (university of oxford) 

Session 7: Data Integration and Extraction (Studio C) 

Session Chair: Daisy Zhe Wang 

Joint Entity Resolution 

Steven Whang (Stanford university) 

Hector Garcia-Molina (Stanford university) 

A Self-Configuring Schema Matching System 

Eric peukert (Sap research dresden) 

Julian Eberius (dresden university of Technology) 

Erhard rahm (university of leipzig) 

Incremental Detection of Inconsistencies in 

Distributed Data 

Wenfei Fan (university of Edinburgh) 

Jianzhong li (Harbin institute of Technology) 

nan Tang (university of Edinburgh & Qatar Computing research 

institute) 

Wenyuan yu (university of Edinburgh) 

Recomputing Materialized Instances after Changes to 

Mappings and Data 

Todd J. Green (university of California, davis) 

Zachary G. ives (university of pennsylvania) 

Session 8: Spatio-Temporal Data 

Management (Studio D) 

Session Chair: Lei Chen 

SWST: A Disk Based Index for Sliding Window 

Spatio-Temporal Data 

Manish Singh (university of Michigan, ann arbor) 

Qiang Zhu (university of Michigan, dearborn) 

H.V. Jagadish (university of Michigan, ann arbor) 

Page 

31


Page 

32 

Querying Uncertain Spatio-Temporal Data 

Tobias Emrich (ludwig-Maximilians-universität München) 

Hans-peter Kriegel (ludwig-Maximilians-universität München) 

nikos Mamoulis (university of Hong Kong) 

Matthias renz (ludwig-Maximilians-universität München) 

andreas Züfle (ludwig-Maximilians-universität München) 

The Min-dist Location Selection Query 

Jianzhong Qi (university of Melbourne) 

rui Zhang (university of Melbourne) 

lars Kulik (university of Melbourne) 

dan lin (Missouri university of Science and Technology) 

yuan Xue (university of Melbourne) 

Bi-level Locality Sensitive Hashing for K-Nearest 

Neighbor Computation 

Jia pan (unC Chapel Hill) 

dinesh Manocha (unC Chapel Hill) 


Discovering Multiple Clustering Solutions: Grouping 

Objects in Different Views of the Data 

Emmanuel Müller (Karlsruhe institute of Technology) 

Stephan Günnemann (rWTH aachen university) 

ines Färber (rWTH aachen university) 

Thomas Seidl (rWTH aachen university) 


M 3 : Stream Processing on Main-Memory MapReduce 

ahmed M. aly (purdue university) 

asmaa Sallam (purdue university) 

Bala M. Gnanasekaran (purdue university) 

long-Van nguyen-dinh (purdue university) 

Walid G. aref (purdue university) 

Mourad ouzzani (Qatar Computing research institute) 

arif Ghafoor (purdue university) 

A Deep Embedding of Queries into Ruby 

Torsten Grust (university of Tübingen) 

Manuel Mayr (university of Tübingen)

3:30PM - 4PM Coffee Break 


Asking the Right Questions in Crowd Data Sourcing 

rubi Boim (Tel-aviv university) 

ohad Greenshpan (Tel-aviv university) 

Tova Milo (Tel-aviv university) 

Slava novgorodov (Tel-aviv university), 

neoklis polyzotis (university of California, Santa Cruz) 

Wang-Chiew Tan (university of California, Santa Cruz) 

LotusX: A Position-Aware XML Graphical Search System 

with Auto-Completion 

Chunbin lin (renmin university of China) 

Jiaheng lu (renmin university of China), 

Tok Wang ling (national universtiy of Singapore) 

Bogdan Cautis (Télécom parisTech) 

Efficient Top-k Keyword Search in Graphs with 

Polynomial Delay 

Mehdi Kargar (york university) 

aijun an (york university) 

TEDAS: a Twitter Based Event Detection and 

Analysis System 

rui li (university of illinois at urbana-Champaign) 

Kin Hou lei (Brigham young university), 

ravi Khadiwala (university of illinois at urbana-Champaign) 

Kevin Chen-Chuan Chang (university of illinois at 

urbana-Champaign) 

AutoDict: Automated Dictionary Discovery 

Fei Chiang (university of Toronto) 

periklis andritsos (university of Toronto), 

Erkang Zhu (university of Toronto) 

renee Miller (university of Toronto) 

4PM - 5:30PM Sessions 9-12, Seminar 3, Demo Group 3 

Session 9: Query Processing (Studio F) 

Session Chair: Walid G. Aref 

Learning-based Query Performance Modeling 

and Prediction 

Mert akdere (Brown university) 

ugur Cetintemel (Brown university) 

Matteo riondato (Brown university) 

Eli upfal (Brown university) 

Stanley B. Zdonik (Brown university) 

Page 

33


Page 

34 

Parametric Plan Caching Using Density-Based Clustering 

Gunes aluc (university of Waterloo) 

david E. deHaan (Sybase, an Sap Company) 

ivan T. Bowman (Sybase, an Sap Company) 

Effective and Robust Pruning for Top-Down Join 

Enumeration Algorithms 

pit Fender (Mannheim university) 

Guido Moerkotte (Mannheim university) 

Thomas neumann (Technical university of Munich) 

Viktor leis (Technical university of Munich) 

Towards Preference-aware Relational Databases 

anastasios arvanitis (national Technical university of athens) 

Georgia Koutrika (iBM almaden research Center) 

Session 10: Location Aware Data 

Processing (Studio B) 

Session Chair: Oktie Hassanzadeh 

A Foundation for Efficient Indoor Distance-Aware 

Query Processing 

Hua lu (aalborg university) 

Xin Cao (nanyang Technological university) 

Christian S. Jensen (aarhus university) 

LARS: A Location-Aware Recommender System 

Justin J. levandoski (Microsoft research) 

Mohamed Sarwat (university of Minnesota) 

ahmed Eldawy (university of Minnesota) 

Mohamed F. Mokbel (university of Minnesota) 

Approximate Shortest Distance Computing: 

A Query-Dependent Local Landmark Scheme 

Miao Qiao (The Chinese university of Hong Kong) 

Hong Cheng (The Chinese university of Hong Kong) 

lijun Chang (The Chinese university of Hong Kong) 

Jeffrey Xu yu (The Chinese university of Hong Kong) 

Desks: Direction-Aware Spatial Keyword Search 

Guoliang li (Tsinghua university) 

Jianhua Feng (Tsinghua university) 

Jing Xu (Tsinghua university)


Session 11: Map-Reduce based Data Processing 

(Studio C) 

Session Chair: Minqi Zhou 

Extending Map-Reduce for Efficient Predicate-Based 

Sampling 

raman Grover (university of California, irvine) 

Michael Carey (university of California, irvine) 

Fuzzy Joins Using MapReduce 

Foto afrati (national Technical university athens) 

anish das Sarma (Google, inc.-work initiated at yahoo! research) 

david Menestrina (Google, inc.) 

aditya parameswaran (Stanford university) 

Jeffrey d. ullman (Stanford university) 

Parallel Top-K Similarity Join Algorithms Using MapReduce 

younghoon Kim (Seoul national university) 

Kyuseok Shim (Seoul national university) 

Load Balancing in MapReduce Based on Scalable 

Cardinality Estimates 

Benjamin Gufler (Technische universität München) 

nikolaus augsten (Free university of Bolzano-Bozen) 

angelika reiser (Technische universität München) 

alfons Kemper (Technische universität München) 

Session 12: Social Media (Studio D) 

Session Chair: Zack Ives 

Community Detection with Edge Content in Social 

Media Networks 

Guo-Jun Qi (university of illinois at urbana-Champaign) 

Charu C. aggarwal (iBM T. J. Watson research Center) 

Thomas S. Huang (university of illinois at urbana-Champaign) 

Cross Domain Search by Exploiting Wikipedia 

Chen liu (national university of Singapore) 

Sai Wu (national university of Singapore) 

Shouxu Jiang (Harbin institute of Technology) 

anthony K.H. Tung (national university of Singapore) 

Provenance-based Indexing Support in Micro-blog 

Platforms 

Junjie yao (peking university) 

Bin Cui (peking university) 

Zijun Xue (peking university) 

Qingyun liu (peking university) 

Page 

35


Page 

36 

Learning Stochastic Models of Information Flow 

luke dickens (imperial College london) 

ian Molloy (iBM T. J. Watson research Center) 

Jorge lobo (iBM T. J. Watson research Center) 

pau-Chen Cheng (iBM T. J. Watson research Center) 

alessandra russo (imperial College london) 


Detecting Clones, Copying and Reuse on the Web 

Xin luna dong (aT&T labs–research) 

divesh Srivastava (aT&T labs–research) 


Trust & Share: Trusted Information Sharing in Online 

Social Networks 

Barbara Carminati (university of insubria) 

Elena Ferrari (university of insubria) 

Jacopo Girardi (university of insubria) 

Evaluation of Clusterings – Metrics and Visual Support 

Elke achtert (ludwig-Maximilians-universität München) 

Sascha Goldhofer (ludwig-Maximilians-universität München) 

Hans-peter Kriegel (ludwig-Maximilians-universität München) 

Erich Schubert (ludwig-Maximilians-universität München) 

arthur Zimek (ludwig-Maximilians-universität München) 

Horton: Online Query Execution Engine For Large 

Distributed Graphs 

Mohamed Sarwat (university of Minnesota) 

Sameh Elnikety (Microsoft research) 

yuxiong He (Microsoft research) 

Gabriel Kliot (Microsoft research) 

MXQuery With Hardware Acceleration 

Jens Teubner (ETH Zurich) 

peter Fischer (university of Freiburg) 

Data 3 – A Kinect Interface for OLAP using Complex 

Event Processing 

Steffen Hirte (ilmenau university of Technology) 

andreas Seifert (ilmenau university of Technology) 

Stephan Baumann (ilmenau university of Technology) 

daniel Klan (ilmenau university of Technology) 

Kai-uwe Sattler (ilmenau university of Technology)


Analyzing Query Optimization Process: Portraits of Join 


anisoara nica (Sybase, an Sap Company) 

ian Charlesworth (university of Waterloo) 

Maysum panju (university of Waterloo) 

DPCube: Releasing Differentially Private Data Cubes for 

Health Information 

yonghui Xiao (Emory university) 

James Gardner (digital reasoning Systems inc.) 

li Xiong (Emory university) 

7:30PM - 9PM NSF ICDE 2012 Career Panel (Salon 123) 

Panel Moderator: Philip Bernstein (Microsoft Research) 

Panelists: Alexandros Labrindis (CS, UPitt), James M. 

Kang (NGA), Srinivasan Parthasarathy (CS, OSU), and 

Yuanyuan Tian (IBM Research) 

TuESday, april 3 


9AM - 10AM Keynote 2 (Salon 4567): Surajit Chaudhuri — How 

Different Is Big Data? 

Session Chair: Beng Chin Ooi 

10AM - 10:30AM Coffee Break 

10:30AM - Noon Sessions 13-15, Industrial Session 1, Seminar 4, 

Demo Group 4 

Session 13: P2P and Distributed 

Processing (Studio F) 

Session Chair: Guoliang Li 

BestPeer++: A Peer-to-Peer based Large-scale 

Data Processing 

Gang Chen (netEase.com inc. & Zhejiang university) 

Tianlei Hu (netEase.com inc. & Zhejiang university) 

dawei Jiang (national university of Singapore) 

peng lu (national university of Singapore) 

Kian-lee Tan (national university of Singapore) 

Hoang Tam Vo (national university of Singapore) 

Sai Wu (Bestpeer pte. ltd. & national university of Singapore) 

Page 

37


Page 

38 

Effective Data Density Estimation in Ring-based 

2P Networks 

Minqi Zhou (East China normal university) 

Heng Tao Shen (The university of Queensland) 

Xiaofang Zhou (The university of Queensland) 

Weining Qian (East China normal university) 

aoying Zhou (East China normal university) 

Processing of Rank Joins in Highly Distributed Systems 

Christos doulkeridis (norwegian university of Science and 

Technology (nTnu)) 

akrivi Vlachou (norwegian university of Science and 


Kjetil nørvåg (norwegian university of Science and 


yannis Kotidis (athens university of Economics and 

Business (auEB)) 

neoklis polyzotis (uC Santa Cruz (uCSC)) 

Load Balancing for MapReduce-based Entity Resolution 

lars Kolb (university of leipzig) 

andreas Thor (university of leipzig) 

Erhard rahm (university of leipzig) 

Session 14: XML and RDF Data 

Management (Studio B) 

Session Chair: Dan Olteanu 

Mapping XML to a Wide Sparse Table 

liang Jeff Chen (uCSd) 

philip a. Bernstein (Microsoft Corp.) 

peter Carlin (Microsoft Corp.) 

dimitrije Filipovic (Microsoft Corp.) 

Michael rys (Microsoft Corp.) 

nikita Shamgunov (Facebook inc.) 

James F. Terwilliger (Microsoft Corp.) 

Milos Todic (Microsoft Corp.) 

Sasa Tomasevic (Microsoft Corp.) 

dragan Tomic (Microsoft Corp.) 

Querying XML Data: As You Shape It 

Curtis E. dyreson (utah State university) 

Sourav S. Bhowmick (nanyang Technological university)


Branch Code: A Labeling Scheme for Efficient Query 

Answering on Trees 

yanghua Xiao (Fudan university) 

Ji Hong (Fudan university) 

Wanyun Cui (Fudan university) 

Zhenying He (Fudan university) 

Wei Wang (Fudan university) 

Guodong Feng (Fudan university) 

Scalable Multi-Query Optimization for SPARQL 

Wangchao le (university of utah) 

anastasios Kementsietsidis (iBM T. J. Watson research Center) 

Songyun duan (iBM T. J. Watson research Center) 

Feifei li (university of utah) 

Session 15: Performance (Studio C) 

Session Chair: Eric Lo 

GSLPI: a Cost-based Query Progress Indicator 

Jiexing li (university of Wisconsin-Madison) 

rimma V. nehme (Microsoft Jim Gray Systems lab) 

Jeffrey naughton (university of Wisconsin-Madison) 

Micro-Specialization in DBMSes 

rui Zhang (university of arizona) 

richard T. Snodgrass (university of arizona) 

Saumya debray (university of arizona) 

Towards Multi-Tenant Performance SLOs 

Willis lang (university of Wisconsin-Madison) 

Srinath Shankar (Microsoft Jim Gray Systems lab) 

Jignesh M. patel (university of Wisconsin-Madison) 

ajay Kalhan (Microsoft Corp.) 

Multi-Version Concurrency via Timestamp Range 

Conflict Management 

david lomet (Microsoft research) 

alan Fekete (university of Sydney) 

rui Wang (Microsoft research) 

peter Ward (university of Sydney) 

Industrial Session 1: Support for Large Scale Data 

Analytics (Studio D) 

Session Chair: Arbee L.P. Chen 

Exploiting Common Subexpressions for Cloud Query Processing 

yasin n. Silva (arizona State university) 

per-ake larson (Microsoft research) 

Jingren Zhou (Microsoft Corp.) 

Page 

39


Page 

40 

Vectorwise: a Vectorized Analytical DBMS 

Marcin Zukowski (actian netherlands) 

Mark van de Wiel (actian Corp.) 

peter Boncz (CWi) 

Scalable and Numerically Stable Descriptive Statistics 

in SystemML 

yuanyuan Tian (iBM almaden research Center) 

Shirish Tatikonda (iBM almaden research Center) 

Berthold reinwald (iBM almaden research Center) 


Mining Knowledge from Data: An Information Network 

Analysis Approach 

Jiawei Han (university of illinois at urbana-Champaign) 

yizhou Sun (university of illinois at urbana-Champaign) 

Xifeng yan (university of California at Santa Barbara) 

philip S. yu (university of illinois at Chicago) 


Nyaya: a System Supporting the Uniform Management of 

Large Sets of Semantic Data 

roberto de Virgilio (universita’ roma Tre) 

Giorgio orsi (university of oxford) 

letizia Tanca (politecnico di Milano) 

riccardo Torlone (universita’ roma Tre) 

R2DB: A System for Querying and Visualizing Weighted 

RDF Graphs 

Songling liu (arizona State university) 

Juan Cedeno (arizona State university) 

Selcuk Candan (arizona State university) 

Maria luisa Sapino (university of Turin) 

Shengyu Huang (arizona State university) 

Xinsheng li (arizona State university) 

Project Daytona: Data Analytics as a Cloud Service 

roger Barga (Microsoft) 

Jaliya Ekanayake (Microsoft research) 

Wei lu (Microsoft research) 

Interactive User Feedback in Ontology Matching Using 

Signature Vector 

isabel Cruz (university of illinois at Chicago) 

Cosmin Stroe (university of illinois at Chicago) 

Matteo palmonari (university of Milano-Bicocca)

DObjects+: Enabling Privacy-Preserving Data 

Federation Services 

pawel Jurczyk (Google inc.) 

li Xiong (Emory university) 

Slawomir Goryczka (Emory university) 


Dragoon: An Information Accountability System for 

High-Performance Databases 

Kyriacos pavlou (university of arizona) 

richard Snodgrass (university of arizona) 

Intuitive Interaction With Encrypted Query Execution 

in DataStorm 

Ken Smith (MiTrE) 

ameet Kini (MiTrE) 

William Wang (MiTrE) 

Chris Wolf (MiTrE) 

M. david allen (MiTrE) 

andrew Sillers (MiTrE) 

Noon - 2PM Funders Session and Lunch (Salon 4567) 

Panel Organizer: Frank Olken (Consultant) 

Panelists: Le Gruenwald (National Science Foundation), 

Ceren Sust (Department of Energy), and Olga Brazhnik 

(National Institutes of Health) 

2PM - 3:30PM Sessions 16-17, Industrial Session 2, Seminar 5, Panel, 

Demo Group 1 

Session 16: Data Extraction and Quality (Studio F) 

Session Chair: Anish Das Sarma 

Automatic Extraction of Structured Web Data with 

Domain Knowledge 

nora derouiche (Télécom parisTech – CnrS lTCi) 

Bogdan Cautis (Télécom parisTech – CnrS lTCi) 

Talel abdessalem (Télécom parisTech – CnrS lTCi) 

Discovering Conservation Rules 

lukasz Golab (university of Waterloo) 

Howard Karloff (aT&T labs–research) 

Flip Korn (aT&T labs–research) 

Barna Saha (aT&T labs–research) 

divesh Srivastava (aT&T labs–research) 

Answering Why-not Questions on Top-k Queries 

Zhian He (Hong Kong polytechnic university) 

Eric lo (Hong Kong polytechnic university) 

Page 

41


Page 

42 

An Efficient Trie-based Method for Approximate Entity 

Extraction with Edit-Distance Constraints 

dong deng (Tsinghua university) 

Guoliang li (Tsinghua university) 

Jianhua Feng (Tsinghua university) 

Session 17: Top-K Processing (Studio B) 

Session Chair: Tingjian Ge 

On Top-k Structural Similarity Search 

pei lee (university of British Columbia) 

laks V.S. lakshmanan (university of British Columbia) 

Jeffrey Xu yu (Chinese university of Hong Kong) 

Relevance Matters: Capitalizing on Less 

(Top-k Matching in Publish/Subscribe) 

Mohammad Sadoghi (university of Toronto) 

Hans-arno Jacobsen (university of Toronto) 

Efficiently Monitoring Top-k Pairs over Sliding Windows 

Zhitao Shen (unSW) 

Muhammad aamir Cheema (unSW) 

Xuemin lin (unSW & ECnu) 

Wenjie Zhang (unSW) 

Haixun Wang (Microsoft research asia) 

Processing and Notifying Range Top-k Subscriptions 

albert yu (duke university) 

pankaj K. agarwal (duke university) 

Jun yang (duke university) 

Industrial Session 2: Evolving Platforms for New 

Applications (Studio C) 

Session Chair: Rui Zhang 

Earlybird: Real-Time Search at Twitter 

Michael Busch (Twitter) 

Krishna Gade (Twitter) 

Brian larson (Twitter) 

patrick lok (Twitter) 

Samuel luckenbill (Twitter) 

Jimmy lin (Twitter) 

Data Infrastructure at LinkedIn 

linkedin data infrastructure Team

The Credit Suisse Meta-data Warehouse 

Claudio Jossen (Credit Suisse aG) 

lukas Blunschi (ETH Zurich) 

Magdalini Mori (Credit Suisse aG) 

donald Kossmann (ETH Zurich) 

Kurt Stockinger (Credit Suisse aG) 


Panel: The Future of Scientific Data Bases (Studio D) 

Panel Moderator: Michael Stonebraker (MIT) 

Panelists: Anastasia Ailamaki (EPFL), Jeremy Kepner 

(MIT), and Alex Szalay (Johns Hopkins University) 



Emerging Graph Queries In Linked Data 

arijit Khan (university of California, Santa Barbara) 

yinghui Wu (university of California, Santa Barbara) 

Xifeng yan (university of California, Santa Barbara) 


See “demo Group 1” listing above 

4PM - 5:30PM Poster Session, all papers (Salon 4567) 

5:30PM Departure for cruise and conference banquet 

WEdnESday, april 4 


9AM - 10AM Keynote 3 (Salon 4567): Peter Druschel — 

Accountability and Trust in Cooperative 

Information Systems 

Session Chair: Johannes Gehrke 

10AM - 10:30AM Coffee Break 

10:30AM - Noon Sessions 18-20, Industrial Session 3, Seminar 6, 

Demo Group 2 

Page 

43


Page 

44 

Session 18: Similarity (Studio F) 

Session Chair: Matthias Renz 

Efficient Exact Similarity Searches using Multiple 

Token Orderings 

Jongik Kim (Chonbuk national university) 

Hongrae lee (Google inc.) 

Efficient Graph Similarity Joins with Edit 

Distance Constraints 

Xiang Zhao (The university of new South Wales & niCTa) 

Chuan Xiao (The university of new South Wales) 

Xuemin lin (The university of new South Wales & East China 

normal university) 

Wei Wang (The university of new South Wales) 

Parameter-Free Determination of Distance Thresholds for 

Metric Distance Constraints 

Shaoxu Song (Tsinghua university) 

lei Chen (The Hong Kong university of Science and 

Technology) 

Hong Cheng (The Chinese university of Hong Kong) 

Random Error Reduction in Similarity Search on Time 

Series: A Statistical Approach 

Wush Chi-Hsuan Wu (academia Sinica) 

Mi-yen yeh (academia Sinica) 

Jian pei (Simon Fraser university) 

Session 19: Text and Strings (Studio B) 

Session Chair: Feifei Li 

Optimizing Statistical Information Extraction Programs 

Over Evolving Text 

Fei Chen (Hp labs China) 

Xixuan Feng (university of Wisconsin-Madison) 

Christopher re (university of Wisconsin-Madison) 

Min Wang (Hp labs China) 

Approximate String Membership Checking: A Multiple 

Filter, Optimization-Based Approach 

Chong Sun (university of Wisconsin-Madison) 

Jeffrey F. naughton (university of Wisconsin-Madison) 

Siddharth Barman (university of Wisconsin-Madison) 

On Text Clustering with Side Information 

Charu C. aggarwal (iBM T. J. Watson research Center) 

yuchen Zhao (university of illinois at Chicago) 

philip S. yu (university of illinois at Chicago)


Fast SLCA and ELCA Computation for XML Keyword 

Queries based on Set Intersection 

Junfeng Zhou (yanshan university) 

Zhifeng Bao (national university of Singapore) 

Wei Wang (The university of new South Wales) 

Tok Wang ling (national university of Singapore) 

Ziyang Chen (yanshan university) 

Xudong lin (yanshan university) 

Jingfeng Guo (yanshan university) 

Session 20: Query Processing II (Studio C) 

Session Chair: Volker Markl 

Optimization of Massive Pattern Queries by Dynamic 

Configuration Morphing 

nikolay laptev (university of California, los angeles) 

Carlo Zaniolo (university of California, los angeles) 

Three-level Processing of Multiple Aggregate 

Continuous Queries 

Shenoda Guirguis (university of pittsburgh) 

Mohamed a. Sharaf (The university of Queensland) 

panos K. Chrysanthis (university of pittsburgh) 

alexandros labrinidis (university of pittsburgh) 

Accelerating Range Queries For Brain Simulations 

Farhan Tauheed (EpFl) 

laurynas Biveinis (aalborg university) 

Thomas Heinis (EpFl) 

Felix Schürmann (EpFl) 

Henry Markram (EpFl) 

anastasia ailamaki (EpFl) 

Keyword Query Reformulation on Structured Data 

Junjie yao (peking university) 

Bin Cui (peking university) 

liansheng Hua (peking university) 

yuxin Huang (peking university) 

Industrial Session 3: Indexing, Updates and 

Processing (Studio D) 

Efficient Support of XQuery Update Facility in XML 

Enabled RDBMS 

Zhen Hua liu (oracle) 

Hui Chang (oracle) 

Balasubramanyam Sthanikam (oracle) 

Page 

45


Page 

46 

Making Unstructured Data SPARQL Using Semantic 

Indexing in Oracle Database 

Souripriya das (oracle) 

Seema Sundara (oracle ) 

Matthew perry (oracle) 

Jagannathan Srinivasan (oracle) 

Jayanta Banerjee (oracle) 

aravind yalamanchi (oracle) 

A meta-language for MDX queries in eLog 

Business Solution 

Sonia Bergamaschi (university of Modena and reggio Emilia) 

Matteo interlandi (university of Modena and reggio Emilia) 

Mario longo (eBilling S.p.a.) 

laura po (university of Modena and reggio Emilia) 

Maurizio Vincini (university of Modena and reggio Emilia) 


Boolean Matrix Decomposition Problem: Theory, Variations 

and Applications to Data Engineering 

Jaideep Vaidya (rutgers university) 



Noon - 2PM Lunch (Provided by <strong>Conference</strong> with Salon 4567) 

2PM - 3:30PM Sessions 21-23, Demo Group 3 

Session 21: Data Mining (Studio F) 

Session Chair: Anthony Tung 

Predicting Approximate Protein-DNA Binding Cores Using 

Association Rule Mining 

po-yuen Wong (The Chinese university of Hong Kong) 

Tak-Ming Chan (The Chinese university of Hong Kong) 

Man-Hon Wong (The Chinese university of Hong Kong) 

Kwong-Sak leung (The Chinese university of Hong Kong) 

Upgrading Uncompetitive Products Economically 

Hua lu (aalborg university) 

Christian S. Jensen (aarhus university)


Attribute-Based Subsequence Matching and Mining 

yu peng (The Hong Kong university of Science and 

Technology) 

raymond Chi-Wing Wong (The Hong Kong university of 

Science and Technology) 

liangliang ye (The Hong Kong university of Science and 

Technology) 

philip S. yu (university of illinois at Chicago) 

Integrating Frequent Pattern Mining from Multiple Data 

Domains for Classification 

dhaval patel (national university of Singapore) 

Wynne Hsu (national university of Singapore) 

Mong li lee (national university of Singapore) 

Session 22: Scientific Data, Analysis and 

Visualization (Studio B) 

Session Chair: Christopher Re 

Efficient Versioning for Scientific Array Databases 

adam Seering (MiT CSail) 

philippe Cudre-Mauroux (university of Fribourg) 

Samuel Madden (MiT CSail) 

Michael Stonebraker (MiT CSail) 

Multidimensional Analysis of Atypical Events in 

Cyber-Physical Data 

lu-an Tang (uiuC) 

Xiao yu (uiuC) 

Sangkyum Kim (uiuC) 

Jiawei Han (uiuC) 

Wen-Chih peng (national Chiao Tung university) 

yizhou Sun (uiuC) 

Hector Gonzalez (Google) 

Sebastian Seith (Morning Star) 

HiCS: High Contrast Subspaces for Density-Based 

Outlier Ranking 

Fabian Keller (Karlsruhe institute of Technology) 


Klemens Böhm (Karlsruhe institute of Technology) 

Extracting Analyzing and Visualizing Triangle K-Core Motifs 

within Networks 

yang Zhang (The ohio State university) 

Srinivasan parthasarathy (The ohio State university) 

Page 

47



Page 

48 

Session 23: Similarity Search and Detection (Studio D) 

Session Chair: Xuemin Lin 

Horizontal Reduction: Instance-Level Dimensionality 

Reduction for Similarity Search in Large Document 


Min Soo Kim (KaiST) 

Kyu-young Whang (KaiST) 

yang-Sae Moon (Kangwon national university) 

Adaptive Windows for Duplicate Detection 

uwe draisbach (Hasso-plattner-institute) 

Felix naumann (Hasso-plattner-institute) 

Sascha Szott (Zuse institute) 

oliver Wonneberg (r. lindner GmbH & Co. KG) 

Efficient Dual-Resolution Layer Indexing for Top-k Queries 

Jongwuk lee (pohang university of Science and Technology 

(poSTECH)) 

Hyunsouk Cho (pohang university of Science and Technology 

(poSTECH)) 

Seung-won Hwang (pohang university of Science and 

Technology (poSTECH)) 

Evaluating Probabilistic Queries over Uncertain Matching 

reynold Cheng (The university of Hong Kong) 

Jian Gong (The university of Hong Kong) 

david W. Cheung (The university of Hong Kong) 

Jiefeng Cheng (Shenzhen institute of advanced Technology) 



4PM - 5:30PM Sessions 24-25, Demo Group 4 

Session 24: Sensors Network and Trajectory 

(Studio B) 

Session Chair: Flip Korn 

Detecting Outliers in Sensor Networks using the Geometric 

Approach 

Sabbas Burdakis (Technical university of Crete) 

antonios deligiannakis (Technical university of Crete)

Efficient Threshold Monitoring for Distributed 

Probabilistic Data 

Mingwang Tang (university of utah) 

Feifei li (university of utah) 

Jeff M. phillips (university of utah) 

Jeffrey Jestes (university of utah) 

Incorporating Duration Information for Trajectory 

Classification 

dhaval patel (national university of Singapore) 

Chang Sheng (dBS Bank) 

Wynne Hsu (national university of Singapore) 

Mong li lee (national university of Singapore) 


Reducing Uncertainty of Low-Sampling-Rate Trajectories 

Kai Zheng (The university of Queensland) 

yu Zheng (Microsoft research asia) 

Xing Xie (Microsoft research asia) 

Xiaofang Zhou (The university of Queensland) 

Session 25: Error Reduction and Data 

Security (Studio D) 

Session Chair: Graham Cormode 

Efficient Similarity Search over Encrypted Data 

Mehmet Kuzu (The university of Texas at dallas) 

Mohammad Saiful islam (The university of Texas at dallas) 

Murat Kantarcioglu (The university of Texas at dallas) 

Obfuscating the Topical Intention in Enterprise Text Search 

HweeHwa pang (Singapore Management university) 

Xiaokui Xiao (nanyang Technological university) 

Jialie Shen (Singapore Management university) 

Correlation Support for Risk Evaluation in Databases 

Katrin Eisenreich (Sap research) 

Jochen adamek (Technische universität Berlin) 

philipp rösch (Sap research) 

Volker Markl (Technische universität Berlin) 

Gregor Hackenbroich (Sap research) 

A Game-Theoretic Approach for High-Assurance of Data 

Trustworthiness in Sensor Networks 

Hyo-Sang lim (purdue university & Computer and Telecommunications 

Engineering division, South Korea) 

Gabriel Ghinita (university of Massachusetts at Boston) 

Elisa Bertino (purdue university) 

Murat Kantarcioglu (university of Texas at dallas) 

Page 

49


THurSday, april 5 

Page 

50 




9AM - 5:30PM Workshops 

Studio B: data Management in the Cloud (dMC) 

Studio d: Graph data Management: Techniques and 

applications (GdM) 

Studio F: Secure data Management on Smartphones and 

Mobiles (SdMSM)

Keynotes 

awarded in 2008 an ERC Advanced Grant, namely Webdam, on Foundations 

of Web Data Management. He is a member of the French Academy of 

Keynote 1: Monday, april 2 

Sciences since 2008. 

Viewing the Web as a Distributed Knowledge Base 

Serge abiteboul (Professor at Collège de France and Senior researcher at 

INRIA Saclay) 

ABstrAct: Information of interest may be found on the Web 

in a variety of forms, in many systems, and with different access 

protocols. A typical user may have information on many devices 

(smartphone, laptop, TV box, etc.), many systems (mailers, 

blogs, Web sites, etc.), many social networks (Facebook, Picasa, 

etc.). This same user may have access to more information from 

Keynote family, 2 (Tuesday friends, April associations, 3): 

companies, and organizations. Today, the control and 

management of the diversity of data and tasks in this setting are beyond the skills 

How Different Is Big Data? 

of casual users. Facing similar issues, companies see the cost of managing and inte- 

Surajit Chaudhuri (Microsoft Corp) 

grating information skyrocketing. 

TALK ABSTRACT 

One buzzword that has been popular in the last couple of years is Big Data. In simplest 

terms, We Big are Data interested symbolizes the aspiration here to build in platforms the and management tools to ingest, store and of such data. Our focus is not on har- 

analyze data that can be voluminous, diverse, and possibly fast changing. In this talk, I 

will vesting try to reflect all on a the few of the data technical of problems a particular presented by the exploration user or of Big a group of users and then managing it 

Data. Some of these challenges in data analytics have been addressed by our community 

in the a past centralized in a more traditional relational manner. database Instead, context but only we with mixed are results. concerned I with the management of Web 

will review these quests and study some of the key lessons learned. At the same time, 

significant data in developments place such in as a the distributed emergence of cloud infrastructure manner, and availability with of a possibly large number of autonomous, 

data rich web services hold the potential for transforming our industry. I will discuss the 

heterogeneous systems collaborating to support certain tasks. 

unique opportunities they present for Big Data Analytics. 

BIOGRAPHICAL SKETCH 

Surajit Our Chaudhuri thesis is a is Distinguished that managing Scientist at Microsoft the research. richness His current areas and of diversity of user-centric data residing 

interest are enterprise data analytics, self-manageability and multi-tenant technology for 

on the Web can be tamed using a holistic approach based on a distributed knowledge 

base. All Web informations are represented as logical facts, and Web data 

cloud database services. Surajit is an ACM Fellow, a recipient of the ACM SIGMOD 

Page 

51


management tasks as logical rules. We discuss Webdamlog, a variant of datalog for 

distributed data management that we use for this purpose. The automatic reasoning 

povided by its inference engine, operating over the Web knowledge base, 

greatly benefits a variety of complex data management tasks that currently require 

intense work and deep expertise. 

This work is part of the Webdam European project, http://webdam.inria.fr/. 

Bio: Serge Abiteboul, Telecom Paris, PhD computer science, USC Los Angeles, and 

Thèse d’Etat, University of Paris Sud. He has held professor positions at Stanford 

and Ecole Polytechnique. He is one of the co-authors of Foundations of Databases, 

and, recently, of Web Data Management. He co-founded in 2000 a start-up, 

named Xyleme. He received the 1998 ACM SIGMOD Innovation Award. He has been 

program chair of a number of conferences including ACM PODS-95, ICALP-94, 

ICDT-90, ECDL-99 and VLDB-09, ICDE-11, track of WWW-12. He has been awarded 

in 2008 an ERC Advanced Grant, namely Webdam, on Foundations of Web Data 

Management. He is a member of the French Academy of Sciences since 2008. 

F. Codd Innovations Award, ACM SIGMOD Contributions Award, and a VLDB 

ar Best Paper Award. Surajit received his Ph.D. from Stanford University and 

h from the Indian Keynote Institute of Technology, 2: TueSday, Kharagpur. april 3 

How Different is Big Data? 

Surajit Chaudhuri (Microsoft Corp) 

tALK ABstrAct: One buzzword that has been popular in the 

last couple of years is Big Data. In simplest terms, Big Data 

symbolizes the aspiration to build platforms and tools to ingest, 

store and analyze data that can be voluminous, diverse, and 

possibly fast changing. In this talk, I will try to reflect on a few of 

the technical problems presented by the exploration of Big Data. 

Some of these challenges in data analytics have been addressed 

by our community in the past in a more traditional relational database context but 

only with mixed results. I will review these quests and study some of the key lessons 

learned. At the same time, significant developments such as the emergence of 

ote 3 (Wednesday cloud infrastructure April 4) and availability of data rich web services hold the potential for 

untability 

transforming 

and Trust in Cooperative 

our industry. 

Information 

I will discuss 

Systems 

the unique opportunities they present for 

Big Data Analytics. 

r Druschel (Max Planck Institute for Software Systems (MPI-SWS) 

rslautern and BioGrAPHicAL Saarbrücken, Germany) sKEtcH: Surajit Chaudhuri is a Distinguished Scientist at Microsoft 

research. His current areas of interest are enterprise data analytics, self- 

eration and trust play an increasingly important role in today’s information 

ms. For instance, manageability peer-to-peer systems and multi-tenant like BitTorrent, Sopcast technology and Skype for are cloud database services. Surajit is 

red by resource an ACM contributions Fellow, from a participating recipient users; of the federated ACM SIGMOD systems like Edgar F. Codd Innovations Award, 

ternet have ACM to respect SIGMOD the interests, Contributions policies and laws Award, of participating and a VLDB 10 year Best Paper Award. Surajit 

izations and received countries; in his the Ph.D. Cloud, from users entrust Stanford their data University and computation and B.Tech from the Indian Institute of 

rd-part infrastructure. 

Technology, Kharagpur. 

s talk, we consider accountability as a way to facilitate transparency and trust 

perative systems. We look at practical techniques to account for the integrity 

tributed, cooperative computations, and look at some of the difficulties and 

problems in accountability. 

talk describes joint work with Paarijaat Aditya, Ioan- nis Avramopoulos, 

ael Backes, Andreas Haeberlen, Petr Kuznetsov, Yin Lin, Bruce Maggs, 

fer Rexford, Rodrigo Rodrigues, Dominique Unruh, Bill Wishon and 

chen Zhao. 

Page 

52

Bio: Peter Druschel is the founding director of the Max Planck Institute for 

Software Systems (MPI-SWS) in Germany. Previ- ously, he was a Professor of 

Computer Science and Electrical and Computer Engineering at Rice University in 

Houston, Texas. He received the Dipl-Ing. (FH) in Data Systems Engi- neering 

from Fachhochschule Munich, Germany in 1986 and the Ph.D. degree in 

Computer Science from the University of Arizona in 1994. His research interests 

include distributed systems and operating systems. He is the recipient of an NSF 

CAREER Award, Alfred P. Sloan Fellowship and the ACM SIGOPS Mark Weiser 

Award, and a member of Academia Europaea and the German Academy of 

Sciences Leopoldina. 

Keynote 3: WedneSday, april 4 

Keynotes 

Accountability and trust in cooperative 

information systems 

peter druschel (Max Planck Institute for Software Systems (MPI-SWS) 

Kaiserslautern and Saarbrücken, Germany) 

Cooperation and trust play an increasingly important role in 

today’s information systems. For instance, peer-to-peer systems 

like BitTorrent, Sopcast and Skype are powered by resource 

contributions from participating users; federated systems like 

the Internet have to respect the interests, policies and laws of 

participating organizations and countries; in the Cloud, users entrust their data and 

computation to third-part infrastructure. 

In this talk, we consider accountability as a way to facilitate transparency and trust 

in cooperative systems. We look at practical techniques to account for the integrity 

of distributed, cooperative computations, and look at some of the difficulties and 

open problems in accountability. 

This talk describes joint work with Paarijaat Aditya, loannis Avramopoulos, Michael 

Backes, Andreas Haeberlen, Petr Kuznetsov, Yin Lin, Bruce Maggs, Jennifer Rexford, 

Rodrigo Rodrigues, Dominique Unruh, Bill Wishon and Mingchen Zhao. 

Bio: Peter Druschel is the founding director of the Max Planck Institute for Software 

Systems (MPI-SWS) in Germany. Previously, he was a Professor of Computer 

Science and Electrical and Computer Engineering at Rice University in Houston, 

Texas. He received the DiplIng. (FH) in Data Systems Engineering from Fachhochschule 

Munich, Germany in 1986 and the Ph.D. degree in Computer Science from the 

University of Arizona in 1994. His research interests include distributed systems and 

operating systems. He is the recipient of an NSF CAREER Award, Alfred P. Sloan 

Fellowship and the ACM SIGOPS Mark Weiser Award, and a member of Academia 

Europaea and the German Academy of Sciences Leopoldina. 

Page 

53


Page 

54

Seminars 

Seminar 1: 

Data ManageMent ISSueS on the SeMantIc Web 

Seminar 1: Data Management Issues on the Semantic Web 


Oktie 

Oktie Hassanzadeh 

HassanzadeH 

is a Research 

is a Research 

Staff Member 

Staff Member 

at IBM T.J. 

at IBM T.J. 

Oktie Watson Hassanzadeh Research Center. is a Research His research Staff Member interests at are IBM in T.J. the 

Watson Research Center. His research interests are in the areas 

areas Watson of Research data Center. cleaning His and research integration, interests Web are in data the 

of management areas 

data cleaning 

of data and and online cleaning 

integration, data and analytics. integration, 

Web He data has management received Web data the and 

online IBM management PhD data fellowship analytics. and online in He 2010, data has and analytics. received is a recipient He the has IBM received of PhD the 2010 fellowship the in 

2010, Yahoo! IBM PhD and Key fellowship is a Scientific recipient in Challenges 2010, of the and 2010 award. is a Yahoo! recipient He is Key of a two the Scientific 2010 -time Challenges 

recipient Yahoo! award. Key of the Scientific He first is prize a Challenges two-time at the Triplification recipient award. He of Challenge, is the a two first -time an prize at the 

Triplification annual recipient contest of the Challenge, first that prize awards an at annual the prizes Triplification to contest the most Challenge, that promising awards an prizes to 

the projects annual most contest in promising the area that of awards projects Linked prizes Data. in the He to area the is a of most graduate Linked promising of Data. the He is a 

University 

graduate 

projects in of 

of 

the Toronto 

the 

area 

University 

of (M.Sc., Linked Ph.D.) 

of 

Data. 

Toronto 

He and is Sharif 

(M.Sc., 

a graduate University 

Ph.D.) 

of 

and 

the of 

Sharif 

University Technology of (B.Sc.). Toronto (M.Sc., Ph.D.) and Sharif University of 

University of Technology (B.Sc.). 

Technology (B.Sc.). 

Dr Anastasios Kementsietsidis is a Research Staff Member 

at Dr IBM Anastasios T.J. Watson Kementsietsidis Research is Center a Research at Hawthorne, Staff Member NY. 

dr. Anastasios at anastasiOs IBM T.J. has Watson a kementsietsidis PhD Research in computer Center at is science a Hawthorne, Research from NY. the Staff Member 

at University Anastasios IBM T.J. Watson of has Toronto. a Research PhD He in is currently computer Center at interested science Hawthorne, in from various NY. the Anastasios 

has aspects University a PhD of in of computer RDF Toronto. data He science management is currently from (including, interested the University in querying, various of Toronto. He 

is currently storing aspects and of interested RDF benchmarking data in management various RDF aspects data). (including, In of the RDF querying, past, data he manage- 

worked (and is still interested in continuing working) on data 

ment 

storing 

(including, 

and benchmarking 

querying, storing 

RDF data). 

and benchmarking 

In the past, he 

RDF data). 

integration, worked (and cleaning, is still interested provenance in continuing and annotation, working) security, on data 

In the 

as well 

past, 

as 

he 

(distributed) 

worked 

query 

(and 

evaluation 

is still interested 

and optimization 

in continuing 

on 

work- 

integration, cleaning, provenance and annotation, security, 

ing) relational as on well data as (distributed) or integration, semi-structured query cleaning, evaluation data. provenance and He optimization has and several annotation, 


security, publications relational as well or in the as semi-structured (distributed) leading database data. query conferences, He evaluation has including several and optimization 

a publications on best relational paper in award the or leading semi-structured in ICDE database 2007, a conferences, best data. demo He has including award several in publica- 

EDBT 2006, and his CIKM tions 2009 a best in paper the paper leading was award a runner-up database in ICDE for 2007, conferences, a best a paper best demo award. including award He has a in best paper 

served EDBT 2006, on the and program his CIKM committee award 2009 paper in of ICDE several was 2007, a leading runner-up a best conferences for demo a best award and paper workshops. in award. EDBT He 2006, has and his 

served on the program committee of several leading conferences and workshops. 

Yannis Velegrakis is a faculty member of the Department of 

Information Yannis Velegrakis Engineering is a faculty and member Computer of the Science Department of the of Page 

University Information of Engineering Trento. He holds and a Computer PhD degree Science in Computer of the 55 

Science University from of Trento. the University He holds of Toronto. a PhD degree His research in Computer areas 

of Science expertise from the include University information of Toronto. integration, His research mappings areas 

across of expertise heterogeneous include information data sources, integration, interoperability, mappings

Anastasios has a PhD in computer science from the 

University of Toronto. He is currently interested in various 

aspects of RDF data management (including, querying, 

storing and benchmarking RDF data). In the past, he 

ICDE 2012 <strong>Conference</strong> worked (and is still interested in continuing working) on data 

integration, cleaning, provenance and annotation, security, 

as well as (distributed) query evaluation and optimization on 

relational or semi-structured data. He has several 

CIKM publications 2009 in paper the leading was a database runner-up conferences, for a best including paper award. He has 

a best paper award in ICDE 2007, a best demo award in 

served on the program committee of several leading conferences 

EDBT 2006, and his CIKM 2009 paper was a runner-up for a best paper award. He has 

served on the program committee and workshops. 

of several leading conferences and workshops. 

Yannis Velegrakis is a faculty is a member faculty of member the Department of the of Department of 

Information Engineering and and Computer Computer Science Science of the of the Univer- 

University of Trento. He holds a PhD degree in Computer 

sity 

Science 

of Trento. 

from the 

He 

University 

holds a 

of 

PhD 

Toronto. 

degree 

His research 

in Computer 

areas 

Science from 

the of University expertise include of Toronto. information His research integration, areas mappings of expertise include 

information across heterogeneous integration, data mappings sources, across interoperability, heterogeneous data 

sources, keyword interoperability, searching, semantic keyword web, social searching, applications, semantic and web, social 

large-scale data management. Prior to joining the 

applications, and large-scale data management. Prior to joining 

University of Trento, he held a researcher position at AT&T 

the Research University Labs of in Trento, the US. he He held has also a researcher spent time position as a at AT&T 

Research visitor at the Labs University in the US. of California, He has Santa-Cruz, also spent the time IBM as a visitor at the 

University Almaden Research of California, Center, Santa-Cruz, and the Center the of IBM Advanced Almaden Research 

Center, Studies and of the the IBM Center Toronto of Lab. Advanced He was a Studies member of the 

IBM Toronto 

committee for the CIMI cultural profile of the ANSI/NISO Z39.50 standard. He has 

Lab. He was a member of the committee for the CIMI cultural 

served in many program committees of national and international conferences and as 

reviewer for numerous international profile of journals. the ANSI/NISO He is a general Z39.50 co-chair standard. for VLDB He 2013 has served in many 

and a PC co-chair for WebDB program 2012. He committees has also been of a national general co-chair and international for DESWEB conferences 

2010 and 2011 and for SWAE2007. and as reviewer He holds 2 for US numerous patents and international has been a Marie journals. Curie He is a gen- 

Fellow for the period 2006-2008. eral co-chair for VLDB 2013 and a PC co-chair for WebDB 2012. 

He has also been a general co-chair for DESWEB 2010 and 2011 

and for SWAE2007. He holds 2 US patents and has been a Marie 

Curie Fellow for the period 2006-2008. 

Seminar 2: 

DIScoverIng MultIple cluSterIng SolutIonS: 

groupIng objectS In DIfferent vIeWS of the Data 

Seminar 2: Discovering Multiple Clustering Solutions: Grouping Objects in 

Different Views of the Data 

emmanuel Emmanuel Müller müller is a senior is a senior researcher researcher at the institute at the for institute for 

program structures and and data data organization organization at the Karlsruhe at the Karlsruhe Insti- 

Institute of Technology (KIT), Germany. In the past years, 

tute 

he was 

of Technology 

a research assistant 

(KIT), 

in 

Germany. 

computer science 

In the 

at 

past 

the data 

years, he was a 

research management assistant and data in exploration computer group science at RWTH at the Aachen data management 

and University, data exploration Germany. His group research at RWTH interests Aachen cover efficient University, Germany. 

data His mining research in high interests dimensional cover data, efficient detection data of clusters mining in high di- 

in subspace projections and outlier detection. Leading the 

mensional data, detection of clusters in subspace projections and 

open-source initiative OpenSubspace he provides a general 

outlier contribution detection. to the Leading research the community open-source especially initiative by a OpenSubspace 

repeatable he provides and comparable a general evaluation contribution study on recent to the data research community 

mining especially approaches. by a Dr. repeatable Müller received and his comparable Diplom (MSc) evaluation in study 

on 2007 recent and his data PhD mining in 2010 approaches. from RWTH Aachen Dr. Müller University. received his Diplom 

He is active member of program committees such as SDM, ECML PKDD, and recent 

(MSc) in 2007 and his PhD in 2010 from RWTH Aachen Univer- 

MultiClust-Workshops. 

sity. He is active member of program committees such as SDM, 

ECML PKDD, and recent MultiClust-Workshops. 

Stephan Günnemann is a PhD student and research 

assistant in computer science at the data management and 

data exploration group at RWTH Aachen University, 

Germany. His research interests include the mining of nonredundant 

and multiple clustering solutions for high 

dimensional and structured data. He contributes to the open 

source initiative OpenSubspace for the evaluation and 

exploration of subspace clustering algorithms. Stephan 

Günnemann received his Diplom (MSc) in 2008 from RWTH 

Aachen University. 

Page 

56 

Ines Färber is a PhD student and research assistant in computer 

science at the data management and data exploration group at 

RWTH Aachen University, Germany. Her research interests

epeatable contribution and to comparable the research evaluation community study especially on recent data by a 

mining repeatable approaches. and comparable Dr. Müller evaluation received his study Diplom on recent (MSc) data in 

2007 mining and approaches. his PhD in Dr. 2010 Müller from received RWTH Aachen his Diplom University. (MSc) in 

He is active member of program 2007 committees and his PhD such in 2010 as SDM, from ECML RWTH PKDD, Aachen and University. recent 


He is active member of program committees such as SDM, ECML PKDD, and recent 


associate editor. 

Seminars 

Stephan Günnemann is a PhD student and research 

stepHan günnemann is a PhD student and research assistant 

assistant Stephan in Günnemann computer science is a at PhD the data student management and research and 

in data computer assistant exploration in computer science group at science the at data RWTH at the management data Aachen management University, and and data exploration 

Germany. data group exploration His at research RWTH group Aachen interests at RWTH University, include the Aachen mining Germany. University, of nonHis 

research 

interests redundant Germany. include His and research multiple the mining interests clustering of include non-redundant solutions the mining for of and high nonmultiple 

clustering dimensional redundant solutions and structured multiple for high data. clustering dimensional He contributes solutions and to the for structured open high data. He 

source dimensional initiative and OpenSubspace structured data. He for contributes the evaluation to the and open 

contributes exploration source initiative of to the subspace OpenSubspace 

open source clustering for 

initiative algorithms. the evaluation 

OpenSubspace Stephan and 

for the 

evaluation Günnemann exploration and received of exploration subspace his Diplom clustering of (MSc) subspace in algorithms. 2008 clustering from Stephan RWTH algorithms. 

Stephan Aachen Günnemann University. Günnemann received his received Diplom (MSc) his Diplom in 2008 (MSc) from RWTH in 2008 from 

RWTH Aachen Aachen University. University. 

Ines Färber is a PhD student and research assistant in computer 

science Ines ines Färber at the is data a PhD is management a student PhD student and and research data and exploration assistant research in group computer assistant at in comput- 

RWTH science er science Aachen at the at data University, the management data management Germany. and data Her exploration research and data interests group exploration at group 

include RWTH mining Aachen of alternative University, and Germany. multi-view Her clustering research solutions interests 

at RWTH Aachen University, Germany. Her research interests 

for include high dimensional mining of alternative data. She and contributes multi-view to the clustering OpenSubspace solutions 

initiative for include high for dimensional mining evaluation of data. alternative and She exploration contributes and of multi-view to multiple the OpenSubspace clustering 

solutions 

solutions. initiative for high Ines for dimensional evaluation Färber received and data. exploration her She Diplom contributes of (MSc) multiple in 2009 to clustering the from OpenSubspace 

RWTH solutions. initiative Aachen Ines for University. Färber evaluation received and her exploration Diplom (MSc) of in multiple 2009 from clustering 

RWTH solutions. Aachen Ines University. Färber received her Diplom (MSc) in 2009 from 

RWTH Aachen University. 

tHOmas 

Thomas Seidl 

seidl 

is 

is 

a 

a 

professor 

professor 

for 

for 

computer 

computer 

science 

science 

and 

and 

head 

head 

of the 

of data the management data management and data and data exploration exploration group group at RWTH at RWTH Aachen 

Aachen University, University, Germany. Germany. His research His research interests interests include include data mining data and 

mining database and technology database technology for multimedia for multimedia and spatio-temporal and spatio-tem- databases 

poral in engineering, databases in communication engineering, communication and life science and applications. life science Prof. 

applications. Seidl received Prof. his Seidl Diplom received (MSc) his in 1992 Diplom from (MSc) TU Muenchen in 1992 from and his 

TU PhD Muenchen (1997) and and venia his PhD legendi (1997) (2001) and venia from legendi LMU Muenchen. (2001) from He is 

LMU active Muenchen. member He of is several active member program of committees several program including commit- ACM 

tees SIGKDD, including <strong>IEEE</strong> ACM ICDE, SIGKDD, SDM, <strong>IEEE</strong> recent ICDE, 0MultiClust-Workshops SDM, recent 0MultiClust- and 

Workshops others. He is and member others. of He the is member editorial board of the of editorial The VLDB board Journal of as 

The VLDB Journal as associate editor. 

Seminar 3: 

eMergIng graph QuerIeS In lInkeD Data 

Seminar 3: Emerging Graph Queries In Linked Data 

and ’11. 

arijit Arijit kHan Khan is a PhD PhD student of the of the Department of Computer of Computer 

Science, University University of California, of California, Santa Santa Barbara Barbara (UCSB). (UCSB). He is cur- He is 

rently currently working working with Professor with Professor Xifeng Yan Xifeng in Graph Yan Mining. in Graph Arijit Mining. 

received 

Arijit received 

his Bachelor 

his 

degree 

Bachelor 

in Computer 

degree in 

Science 

Computer 

and 

Science 

Engineer- 

and 

ing from Jadavpur University, India in 2008. He is the recipient of 

Engineering from Jadavpur University, India in 2008. He is the 

the prestigious CITRIX GO-TO fellowship award for the academic 

recipient of the prestigious CITRIX GO-TO fellowship award for 

year 2008-2009 and P1 fellowship award for the Spring Quarter 

the academic year 2008-2009 and P1 fellowship award for the 

in 2009-10 from the Department of Computer Science, UCSB. 

He 

Spring 

was also 

Quarter 

awarded 

in 

gold 

2009-10 

medals 

from 

by 

the 

Tata 

Department 

Consultancy 

of 

Services 

Computer 

Ltd Science, for being UCSB. the best He student was also of the awarded Department gold of medals Computer by Tata 

Science Consultancy & Engineering, Services Jadavpur Ltd for University, being the for best 2008-2009. student He of the 

published Department papers of in Computer SIGMOD’10 Science and ’11. & Engineering, Jadavpur 

University, for 2008-2009. He published papers in SIGMOD’10 

Page 

57 

Yinghui Wu is a research scientist of the Department of 

Computer Science, University of California, Santa Barbara 

(UCSB). He is currently working with Professor Xifeng Yan

Department Spring Quarter of Computer in 2009-10 Science from the & Department Engineering, of Computer Jadavpur 

University, Science, UCSB. for 2008-2009. He was He also published awarded papers gold medals in SIGMOD’10 by Tata 

and ’11. 

Consultancy Services Ltd for being the best student of the 


Department of Computer Science & Engineering, Jadavpur 

University, Yinghui for Wu 2008-2009. is a research He published scientist of papers the Department in SIGMOD’10 of 

and ’11. 

Computer Science, University of California, Santa Barbara 

(UCSB). He is currently working with Professor Xifeng Yan 

YingHui in Yinghui graph Wu Wu data is is management. a a research scientist Yinghui of got of the his Department PhD from the of of Computer 

University Computer 

Science, of Science, Edinburgh, University 

University UK of California, in 2010. of California, His Santa Barbara 

Santa research Barbara interests (UCSB). 

(UCSB). He is currently working with Professor Xifeng Yan 

He 

lie 

in 

is currently 

in the area 

graph data 

working 

of database 

management. 

with Professor 

theory and 

Yinghui got 

Xifeng 

graph 

his PhD 

Yan 

database 

from 

in graph 

management, with emphasis on graph database models and the 

data University management. of Edinburgh, Yinghui UK got in 2010. his PhD His from research the interests University of 

query languages. He published papers in SIGMOD, VLDB, 

Edinburgh, ICDE lie in and the UK ICDT. area in 2010. of database His research theory interests and graph lie database in the area of 

database management, theory with and emphasis graph database on graph database management, models with and emphasis 

on query graph languages. database He published models and papers query in SIGMOD, languages. VLDB, He published 

papers ICDE in and SIGMOD, ICDT. VLDB, ICDE and ICDT. 

Xifeng Yan is an assistant professor at the University of 

XiFeng 

California 

Yan is 

at 

an 

Santa 

assistant 

Barbara. 

professor 

He holds 

at the 

the 

University 

Venkatesh 

of California 

at Narayanamurti Santa Barbara. Chair He in holds Computer the Venkatesh Science. He Narayanamurti received Chair 

in Computer his Xifeng Ph.D. Yan Science. degree is an in assistant He Computer received professor Science his Ph.D. at from the degree the University in Computer of 

Science of California from Illinois the at University Santa Urbana-Champaign Barbara. of Illinois He in holds at 2006. Urbana-Champaign the He Venkatesh was a in 

2006. research Narayanamurti He was staff a research member Chair in staff at Computer the member IBM T. Science. J. at Watson the He IBM Research received T. J. Watson Research 

Center his 

Center 

Ph.D. between degree 

between 2006 in Computer 

2006 and 2008. and 

Science 

2008. He has from 

He been has 

the working University 

been on working on 

of Illinois at Urbana-Champaign in 2006. He was a 

modeling, modeling, managing, managing, and and mining mining large-scale large-scale graphs graphs in bioinformat- 

in 

bioinformatics, research staff member social networks, at the IBM information T. J. Watson networks, Research and 

ics, social 

Center 

networks, 

between 2006 

information 

and 2008. 

networks, 

He has been 

and 

working 

computer 


systems. 

computer systems. His works were extensively 

His works 

referenced, modeling, were managing, extensively 

with over 5,000 and referenced, mining citations large-scale per 

with 

Google 

over graphs Scholar. 

5,000 in citations 

per Google He bioinformatics, received Scholar. NSF social He received networks, CAREER NSF information Award, CAREER IBM networks, Award, Invention and IBM Invention 

Achievement computer systems. Award, ACM-SIGMOD His works Dissertation were Dissertation extensively Runner- Runner-Up 

Up Award, and <strong>IEEE</strong> ICDM Award, 10-year referenced, and Highest <strong>IEEE</strong> with ICDM Impact over 10-year Paper 5,000 Award. citations Highest per Impact Google Paper Scholar. Award. 

He received NSF CAREER Award, IBM Invention 

Achievement Award, ACM-SIGMOD Dissertation Runner- 

Up Seminar Award, and <strong>IEEE</strong> 4: ICDM 10-year Highest Impact Paper Award. 

Seminar boolean 4: Boolean MatrIx Matrix DecoMpoSItIon Decomposition probleM: Problem: Theory, Variatio 

Applications theory, to varIatIonS Data Engineering anD applIcatIonS to 

Data engIneerIng 

Dr. Jaideep Vaidya is an Associate Professor of C 

dr. jaideep VaidYa is an Associate Professor of Computer Information 

Information Systems at Rutgers Systems University. at Rutgers He received University. his Masters He rece 

and Ph.D. Masters from Purdue and University Ph.D. from and his Purdue Bachelors University degree from and his B 

the University of Mumbai. His research interests are in Data Min- 

degree from the University of Mumbai. His research 

ing, Data Management, Privacy, and Security. He has published 

over 60 are papers in in Data international Mining, conferences Data Management, and archival journals, Privacy, and 

and has He received has three published best paper over awards 60 from papers the premier in international con- con 

ferences in data mining, databases, and digital government. He is 

and archival journals, and has received three be 

also the recipient of a NSF Career Award and a Rutgers Board of 

Trustees awards Research Fellowship from the for Scholarly premier Excellence. conferences in data 

databases, and digital government. He is also the recip 

NSF Career Award and a Rutgers Board of Trustees R 

Fellowship for Scholarly Excellence. 

Page 

58

Seminar 5: 

MInIng knoWleDge froM Data: 

an InforMatIon netWork analySIS approach 

Seminar 5: Mining Knowledge from Data: An Information Network Analysis 

Approach 

Seminars 

jiaWei Jiawei Han Han is is Abel Abel Bliss Bliss Professor in in Engineering, in the in the Depart- 

Department of Computer Science at the University of Illinois. 

ment of Computer Science at the University of Illinois. He has 

He has been researching into data mining, information network 

been analysis, researching and database into systems, data mining, with over information 600 publications. network analysis, 

Seminar 5: Mining Knowledge and database from Data: An systems, Information with Network over Analysis 

Seminar 3: Emerging Graph He Queries served In as Linked the founding Data 600 publications. He served as 

Approach 

Editor-in-Chief of ACM 

the Transactions founding on Editor-in-Chief Knowledge Discovery of ACM from Transactions Data (TKDD) on and Knowledge 

Jiawei Arijit Discovery on the Khan Han editorial is is from Abel a PhD boards Bliss Data student Professor (TKDD) of several of in and the other Engineering, Department on journals. the editorial in the Jiawei of Computer boards has of sev- 

Department Science, eral received other University of IBM journals. Computer Faculty of Science Jiawei California, Awards, at has the HP Santa received University Innovation Barbara of IBM Illinois. Awards, (UCSB). Faculty ACM He Awards, is HP 

He currently SIGKDD has been 

Innovation 

working Innovation researching 

Awards, 

with into Award 

ACM 

Professor data (2004), mining, 

SIGKDD 

Xifeng information <strong>IEEE</strong> 

Innovation 

Yan Computer network in Graph Society 

Award 

Mining. 

analysis, (2004), <strong>IEEE</strong> 

Arijit Technical received and Achievement database his Bachelor systems, Award with degree (2005), over in 600 Computer and publications. <strong>IEEE</strong> Science Computer and 

He Computer Society served W. as Society Wallace the founding McDowell Technical Editor-in-Chief Award Achievement (2009), of and ACM 

Engineering from Jadavpur University, India in 2008. 

Award Daniel He 

(2005), 

is C. the 

and 

Transactions <strong>IEEE</strong> Drucker Computer Eminent 

on Knowledge 

Faculty Society Discovery 

Award W. Wallace (2011). 

from Data 

He McDowell (TKDD) 

is a Fellow 

and Award of ACM (2009), and 

on recipient the editorial of the boards prestigious of several CITRIX other journals. GO-TO Jiawei fellowship has award for 

and a Fellow of <strong>IEEE</strong>. He Daniel is currently C. Drucker the Director Eminent of Faculty Information Award Network (2011). Academic 

received the academic IBM Faculty year Awards, 2008-2009 HP Innovation and P1 fellowship Awards, ACM award He for is the a Fellow of 

Research Center (INARC) SIGKDD Spring ACM supported and Innovation Quarter a by Fellow the in Award 2009-10 Network of (2004), <strong>IEEE</strong>. Science-Collaborative from He <strong>IEEE</strong> the is Computer currently Department Society the Technology of Director Computer of Informa- 

Alliance (NS-CTA) program Technical Science, tion Network 

of Achievement U.S. UCSB. Army 

Academic He Research Award was also (2005), Research 

Lab. awarded and His <strong>IEEE</strong> Center 

book gold Computer with medals (INARC) 

Micheline by supported Tata by 

Kamber and Jian Pei, "Data Society Consultancy 

Mining: W. Wallace Concepts 

the Network Services McDowell and 

Science-Collaborative Ltd 

Techniques" Award for being (2009), (Morgan 

the and 

Technology best Daniel Kaufmann) 

student C. has 

Alliance of the 

been used worldwide as 

(NS-CTA) 

Drucker a textbook. Eminent Faculty Award (2011). He is a Fellow of ACM 

Department program of of U.S. Computer Army Research Science Lab. & Engineering, His book with Jadavpur 

and a Fellow of <strong>IEEE</strong>. He is currently the Director of Information Network Academic Micheline 

Research Center (INARC) University, 

Kamber supported Yizhou by for 

and Sun the 2008-2009. 

Jian Network is a Pei, Ph.D. Science-Collaborative He published papers 

“Data candidate Mining: at the Concepts University Technology in SIGMOD’10 

of and Illinois Techniques” 

and Alliance ’11. (NS-CTA) program of 

(Morgan at U.S. Urbana-Champaign. Army Research Lab. 

Kaufmann) has Her His 

been principal book 

used research with Micheline 

worldwide interest as is a in textbook. 

Kamber and Jian Pei, "Data Mining: Concepts and Techniques" (Morgan Kaufmann) has 

large-scale information and social networks, and more 

been used worldwide as a textbook. Yinghui 

generally 

Wu 

in data 

is a 

mining, 

research 

database 

scientist 

systems, 

of the 

applied 

Department of 

Computer statistics, machine Science, learning, University information of California, retrieval, Santa and Barbara 

Yizhou Sun is a Ph.D. candidate at the University of Illinois 

YizHOu (UCSB). network sun He science, is a currently Ph.D. with a candidate focus working on modeling with at Professor the novel University problems Xifeng of Yan Illinois at 

at Urbana-Champaign. Her principal research interest is in 

Urbana-Champaign. large-scale in and graph proposing information data management. scalable and Her social algorithms principal networks, Yinghui for research and large-scale, got more his PhD interest real- from is the in largescale 

generally University world information applications. in data of Edinburgh, mining, and Yizhou database social UK has systems, in networks, over 2010. 30 applied His publications and research more in interests generally in data 

mining, statistics, lie book in 

database chapters, the machine area learning, journals, systems, 

of database information and applied major theory retrieval, conferences statistics, 

and and graph such machine 

database as learning, in- 

network management, 

formation 

SIGKDD, science, 

retrieval, 

SIGMOD, with a emphasis focus 

and 

VLDB, on modeling 

network 

NIPS on graph and novel 

science, 

so database on, problems and 

with 

tutorials models and 

and a focus on model- 

query on 

proposing 

"mining languages. scalable 

heterogeneous He algorithms published information 

for large-scale, papers networks" in real- SIGMOD, in VLDB, 

ingworld novel applications. 

premier 

problems Yizhou 

conferences. 

and has proposing over 30 publications scalable in 

ICDE and ICDT. 

algorithms for largescale, 

book real-world chapters, journals, applications. and major conferences Yizhou has such over as 30 publications in 

SIGKDD, SIGMOD, VLDB, NIPS and so on, and tutorials 

book on "mining chapters, heterogeneous journals, information and major networks" conferences in such as SIGKDD, 

SIGMOD, premier Xifeng conferences. VLDB, Yan is NIPS an assistant and so professor on, and at tutorials the University on “mining of heterogeneous 

California information at Santa networks” Barbara. He in premier holds the conferences. 

Venkatesh 

Narayanamurti Chair in Computer Science. He received 

Xifeng his Yan Ph.D. is an degree assistant in professor Computer at the University Science of from the 

California University at Santa of Illinois Barbara. at Urbana-Champaign He holds the Venkatesh in 2006. He 

XiFeng Narayanamurti Xifeng Yan 

was Yan a research is Chair is an 

an assistant in assistant Computer 

staff member professor Science. professor He at 

at the at received the University of 

IBM the T. University J. Watson of Califor- 

his California Ph.D. degree at Santa in Computer Barbara. Science He holds from the the Venkatesh 

nia University at Narayanamurti 

Research Santa of Barbara. Illinois 

Center 

Chair at 

between 

Urbana-Champaign He in holds Computer 

2006 the and Venkatesh Science. in 

2008. 

2006. 

He 

He He 

has Narayanamurti received 

been Chair 

in Computer was his 

working a Ph.D. research on Science. degree 

modeling, staff member in He Computer 

managing, received at the Science IBM and his T. mining Ph.D. J. from Watson degree the 

large-scale 

University in Computer 

Science Research graphs 

from Center in 

the 

bioinformatics, between University 2006 of and social 

Illinois 2008. networks, He at has Urbana-Champaign been information 

of Illinois at Urbana-Champaign in 2006. He was a in 

working networks, on modeling, and computer managing, and systems. mining large-scale His works were 

2006. research 

extensively 

He was staff a 

referenced, 

research member staff at the 

with 

member IBM T. J. 

over 5,000 

at Watson the 

citations 

IBM Research T. 

per 

J. Watson Re- 

graphs in bioinformatics, social networks, information 

search networks, Center 

Google Center between and Scholar. between computer 2006 

He received 2006 and systems. 2008. and NSF His 2008. He 

CAREER works has He been were Award, has working been IBM 

on working on 

Invention Achievement Award, modeling, extensively modeling, 

ACM-SIGMOD managing, referenced, managing, 

Dissertation and with and mining over Runner-Up 

mining 5,000 large-scale citations large-scale 

Award, per graphs and 

graphs 

<strong>IEEE</strong> in bioinformat- 

in 

ICDM 10-year Highest Impact 

Google 

ics, social Paper bioinformatics, Scholar. 

networks, Award. 

He social received networks, NSF CAREER information Award, IBM networks, and 

information networks, and computer systems. 

Invention Achievement Award, ACM-SIGMOD computer Dissertation systems. Runner-Up His works Award, and were <strong>IEEE</strong> extensively 

ICDM 10-year Highest Impact His Paper works 

referenced, Award. were extensively 

with over 5,000 

referenced, 

citations per 

with 

Google 

over 

Scholar. 

5,000 citations 

per Google He received Scholar. NSF He received CAREER NSF Award, CAREER IBM Award, Invention IBM Invention 

Achievement Award, ACM-SIGMOD Dissertation Runner- Runner-Up 

Up Award, and <strong>IEEE</strong> ICDM Award, 10-year and Highest <strong>IEEE</strong> ICDM Impact 10-year Paper Award. 

Highest Impact Paper Award. 

Page 

59


pHilip Philip s. Yu S. received Yu received his Ph.D. his Ph.D. degree degree in E.E. in from E.E. Stanford from University. 

Stanford He is a University. Professor in He Computer is a Professor Science in at Computer the University 

of Illinois Science at Chicago at the University and also of holds Illinois the at Wexler Chicago Chair and in also Information 

Technology. holds the Wexler Dr. Yu Chair spent in Information most of his Technology. career at IBM, Dr. where 

he was Yu manager spent most of the of Software his career Tools at IBM, and where Techniques he was group at 

the Watson manager Research of the Software Center. His Tools research and Techniques interests include group data 

mining, at the database Watson and Research privacy. Center. He has His published research more interests than 650 

papers 

include 

in refereed 

data mining, 

journals 

database 

and conferences. 

and privacy. 

He holds 

He has 

or has ap- 

published more than 650 papers in refereed journals 

plied for more than 350 US patents. Dr. Yu is a Fellow of the ACM 

and conferences. He holds or has applied for more than 

and the <strong>IEEE</strong>. He is the Editor-in-Chief of ACM Transactions on 

350 US patents. Dr. Yu is a Fellow of the ACM and the 

Knowledge Discovery from Data. He was the Editor-in-Chief of 

<strong>IEEE</strong>. He is the Editor-in-Chief of ACM Transactions on Knowledge Discovery from 

Data. He was the Editor-in-Chief 

<strong>IEEE</strong> Transactions 

of <strong>IEEE</strong> 


Transactions 

Knowledge 


and 

Knowledge 

Data Engineering 

and Data 

(2001- 

Engineering (2001-2004). He 

2004). 

received 

He received 

a Research 

a Research 

Contributions 

Contributions 

Award from 

Award 

<strong>IEEE</strong> 

from 

Intl. 

<strong>IEEE</strong> 

<strong>Conference</strong> on Data Mining Intl. (2003). <strong>Conference</strong> on Data Mining (2003). 

Seminar 6: 

Seminar 6: Detecting Clones, Copying and Reuse on the Web 

DetectIng cloneS, copyIng anD reuSe on the Web 

Seminar 6: Detecting Clones, Xin luna Copying dOng Xin and Reuse Luna is a researcher on Dong the Web is at AT&T a researcher Labs-Research. at AT&T She re- Labs-Rese 

ceived her Ph.D. from University of Washington in 2007, received 

Xin Luna Dong She is received a researcher her at AT&T Ph.D. Labs-Research. from University of Washing 

a Master’s Degree from Peking University in China in 2001, and 

She received 2007, her Ph.D. received from University a Master’s of Washington Degree in from Peking Univ 

received 

2007, received 

a Bachelor’s 

a Master’s 

Degree 

Degree 

from 

from 

Nankai 

Peking 

University 

University 

in China 

in 1998. Her research in China interests in 2001, include and databases, received information a Bachelor’s Degree 

in China in 2001, and received a Bachelor’s Degree from 

retrieval Nankai and University Nankai machine in China University learning, in 1998. with Her in an research China emphasis interests in on 1998. data Her integra- research inte 

tion, include data cleaning, databases, include personal information databases, information retrieval information management, and machine retrieval and Web and ma 

search. learning, She with has led an the emphasis Solomon on project, data integration, whose goal data is to detect 

learning, with an emphasis on data integration, 

copying cleaning, between personal structured information sources management, and to leverage and Web the results 

in various 

search. She 

aspects cleaning, has led 

of 

the 

data 

Solomon personal integration, 

project, information and 

whose 

the 

goal 

Semex 

is to management, personal in- and 

detect copying between structured sources and to 

formation management search. She system, has which led the won Solomon the Best Demo project, award 

leverage the results in various aspects of data integration, whose goa 

(one and of the top-3) Semex detect in personal Sigmod’05. copying information She has management between co-chaired system, structured WebDB’10 and sources an 

has which served won in the program Best Demo committees award (one of Sigmod’12, of top-3) in Sigmod’11, 

VLDB’11, PVLDB’10, leverage WWW’10, the results ICDE’10, in VLDB’09, various etc. aspects of data integr 

Sigmod’05. She has co-chaired WebDB’10 and has served in the program committees 

of Sigmod’12, Sigmod’11, VLDB’11, PVLDB’10, and the WWW’10, Semex ICDE’10, personal VLDB’09, information etc. management sy 

which won the Best Demo award (one of top- 

Sigmod’05. She has co-chaired WebDB’10 and has served in the program comm 

of Sigmod’12, Sigmod’11, VLDB’11, PVLDB’10, WWW’10, ICDE’10, VLDB’09, etc. 

Page 

60 

diVesH Divesh sriVastaVa Srivastava is the is the head head of the of Database the Database Research Research De- 

Department at AT&T Labs-Research. He received his 

partment at AT&T Labs-Research. He received his Ph.D. from the 

Ph.D. from the University of Wisconsin, Madison, and his 

University 

B.Tech 

of 

from 

Wisconsin, 

the Indian 

Madison, 

Institute 

and his 

of 

B.Tech 

Technology, 

from the Indian 

Institute Bombay. of Technology, His research Bombay. interests span His research a variety of interests topics span a 

variety in data of management. 

topics in data management. 

Divesh Srivastava is the head of the Database Res 

Department at AT&T Labs-Research. He receive 

Ph.D. from the University of Wisconsin, Madison, an 

B.Tech from the Indian Institute of Techn 

Bombay. His research interests span a variety of 

in data management.

er Panel 

.D. 

t at 

. Over 

has 

ct at 

ftware 

uring 

ished 

wo 

nd 

abase 

and 

t. 

his 

hed in 

work 

arch 

nd 

ditor- 

w, a 

r of 

Panels 

PANEL 1: NSF ICDE 2012 CarEEr PaNEl 

PhiliP A. Bernstein, Ph.D. (Microsoft Research) is a Distinguished 

Scientist at Microsoft Corporation. Over the past 35 

years, he has been a product architect at Microsoft and Digital 

Equipment Corp., a professor at Harvard University and Wang 

Institute of Graduate Studies, and a VP Software at Sequoia Systems. 

During that time, he has published over 150 papers and two 

books on the theory and implementation of database systems, 

especially on transaction processing and metadata management. 

The second edition of his book “Transaction Processing” with Eric 

Newcomer was published in June 2009. His latest work focuses 

on database systems for cloud computing, on web search over 

structured data, and on object-to-relational mappings. He is an 

Editor-in-Chief of the VLDB Journal, an ACM Fellow, a winner 

of the ACM SIGMOD Innovations Award, and a member of the 

Washington State Academy of Sciences and the National Academy 

of Engineering. He received a B.S. degree from Cornell and 

M.Sc. and Ph.D. from University of Toronto. 

Page 

61

.S. 

ty 

l of 

.D. from 

nto. 

h.D. the 

ialy) 

l 

g receive 

e 

, his M.S. 

ute of 

5, and 

niversity 

10, all of 

in 

of . His 

tly are in the 

mporal 

2005nal 

SF 

n and 

ship) 

ersity of 

currently 

in the 

BM 

t National 

ence 

at 

ia. 

h.D. (IBM 

h) ale 

a 

mber at 

earch er 

ry e & 

rge scale 

sity 

nd 

eived is her 

Science & 

University ent 

8. She is 

, 

ievement 

f 

ta 

el 

g. 

ard 

t 

nd 

g 

g 


Page 

62 

JAmes m. KAng, Ph.D. (National Geospatial-Intelligence Agency) 

received his B.S. at Purdue University in 2000, his M.S. at Rochester 

Institute of Technology in 2005, and his Ph.D. at the University 

of Minnesota in 2010, all of his degrees were in Computer 

Science. His research interests are in the areas of Spatio-Temporal 

Data Mining and Databases. From 2005-2007, he was an NSF 

IGERT (Integrative Graduate Education and Research Traineeship) 

Fellow at the University of Minnesota. He is currently a project 

scientist in the Basic and Applied Research Office at National 

Geospatial-Intelligence Agency (NGA) in Springfield, Virginia. 

YuAnYuAn tiAn, Ph.D. (IBM Almaden Research) is a Research 

Staff Member at IBM Almaden Research Center. Her primary 

research area is large scale data processing and analytics. She 

received her PhD in Computer Science & Engineering from University 

of Michigan in 2008. She is the recipient of Distinguished 

Achievement Award from University of Michigan in 2008 for her 

research and academic accomplishments. 

srinivAsAn PArthAsArAthY, Ph.D. (Ohio State University) 

Dr. Srinivasan Parthasarathy (PhD, University of Rochester), is 

currently a Professor of Computer Science and Engineering at the 

Ohio State University (OSU). His research interests are broadly in 

the areas of Data Mining, Databases, Bioinformatics and Parallel 

and Distributed Computing. He is a recipient of an Ameritech 

Faculty Fellowship in 2001, a US National Science Foundation 

CAREER award in 2003, a US Department of Energy Early Career 

Award in 2004, multiple IBM Faculty Awards in 2007 and 2010, 

and a Google Research Award in 2009. His papers have received 

six best paper awards or similar honors from among ten nominations 

in leading conferences in the field, including ones at SIAM 

international conference on data mining (SDM), <strong>IEEE</strong> international 

conference on data mining (ICDM), Intelligent Systems for 

Molecular Biology (ISMB), the Very Large Databases <strong>Conference</strong> 

(VLDB) and at the ACM Knowledge Discovery and Data Mining 

(SIGKDD). He has served on the program, organizational and 

steering committees of leading conferences in the fields of data 

mining, databases, and high performance computing. He currently 

serves on the editorial boards of several journals including the 

Data Mining and Knowledge Discovery Journal (DMKDJ), the Distributed 

and Parallel Databases Journal (DAPDJ), the Journal of 

Parallel and Distributed Computing (JPDC), and the ACM Transactions 

on Knowledge Discovery and Data Mining (ACM-TKDD).

Dr. 

Alexandros Labrinidis received 

his Ph.D degree in Computer 

Science 

from the University of Maryland, 

College Park in 2002. He is 

currently 

an associate professor at the 

Department of Computer Science 

of the 

University of Pittsburgh and codirector 

of the Advanced Data 

Management Technologies Lab. 

He is also an adjunct associate 

professor 

at Carnegie Mellon University 

(CS Dept). 

Dr. Labrinidis' research 

focuses on user-centric 

data management for 

network-centric 

applications, including webdatabases, 

data stream 

management systems, 

sensor networks, and 

scientific data management 

(with an emphasis on big 

data). He has published 

over 60 papers 

at peer-reviewed journals, 

conferences, and 

workshops; he is 

the 

recipient of an NSF 

CAREER award in 2008. 

Dr. Labrinidis is 

currently the 

Secretary/Treasurer for 

ACM SIGMOD, and has 

served 

as the Editor of SIGMOD 

Record, and in numerous 

program 

committees of international 

conferences/workshops. 

Panels 

Dr. AlexAnDros lABriniDis received his Ph.D degree in Computer 

Science from the University of Maryland, College Park in 

2002. He is currently an associate professor at the Department of 

Computer Science of the University of Pittsburgh and co-director 

of the Advanced Data Management Technologies Lab. He is also 

an adjunct associate professor at Carnegie Mellon University 

(CS Dept). Dr. Labrinidis’ research focuses on user-centric data 

management for network-centric applications, including webdatabases, 

data stream management systems, sensor networks, 

and scientific data management (with an emphasis on big data). 

He has published over 60 papers at peer-reviewed journals, conferences, 

and workshops; he is the recipient of an NSF CAREER 

award in 2008. Dr. Labrinidis is currently the Secretary/Treasurer 

for ACM SIGMOD, and has served as the Editor of SIGMOD 

Record, and in numerous program committees of international 

conferences/workshops. 

PANEL 2: FuNDErS SESSIoN 

Dr. FrAnK olKen (Consultant, Panel Organizer) is a veteran database researcher. 

He has a PhD. in Computer Science from Univ. of California Berkeley. He has 

worked on a variety of topic in scientific and statistical databases including random 

sampling from relational databases, bioinformatics, building energy management 

systems, power grid informatics, workflow management, file migration, metadata 

registries, etc. Most of his 35 year career was at Lawrence Berkeley National Laboratory. 

He has also worked on standards development for metadata registries, RDF 

and XML schema languages, etc. 

From 2006 to 2010 he was detailed to the at the U.S. National Science Foundation 

as a program director in the Computer and Information Science and Engineering 

(CISE) Directorate, Information and Intelligent Systems (IIS) Division, Information 

Integration and Informatics (III) program, where he managed proposal reviews 

and awards in the areas of database management, graph database and mining, 

data intensive computing, etc. His current research interests include semantic web 

technologies, rule systems, graph data management and mining, electronic health 

records, and social science data management and analytics. 

He can reached at: frankolken@gmail.com, @frankolken on twitter, and on LinkeIn, Facebook 

and Google+. 

Dr. le gruenwAlD (National Science Foundation) is a Program Director and the 

Cluster Lead of the Information Integration and Informatics (III) Program, in the 

Intelligent Information Systems (IIS) Division of the Computer and Information Science 

and Engineering (CISE) Directorate at the National Science Foundation (NSF). 

The IIS program supports research in areas such as Databases, Data Mining, Informatics, 

Information Retrieval, and Social Media. She is also the Presidential and Dr. 

David W. Franke Professor in the School of Computer Science at The University of 

Oklahoma (OU). She received her Ph.D. in Computer Science from Southern Meth- 

Page 

63


odist University in 1990. Prior to joining OU, she was a Member of Technical Staff in 

the Database Management Group at the Advanced Switching Laboratory of NEC, 

America, a Software Engineer at WRT, and a Lecturer in the Computer Science and 

Engineering Department at Southern Methodist University. 

Dr. Gruenwald’s major research interests include Mobile and Sensor Databases, Data 

Security, Privacy and Confidentiality, Stream Data Management, Data Mining, Real- 

Time Distributed Databases, Autonomic Data Management, Multimedia Databases 

and Web Databases. She has published numerous technical papers in these areas. 

She can be reached at: lgruenwa@nsf.gov 

Dr. Ceren sust (Department of Enegry) joined Department of Energy (DOE) 

Office of Advanced Scientific Computing Research (ASCR) in January 2011 after 

completing a National Research Council Postdoctoral Fellowship at the Center for 

Nanoscale Science and Technology (CNST) at the National Institute of Standards 

and Technology (NIST). 

She has diverse research experience in chemistry, chemical engineering, materials 

science and applied physics. At ASCR, she currently manages the Scientific Discovery 

through Advanced Computing (SciDAC) portfolio. 

She can be reached at: ceren.susut-bennett@science.doe.gov 

Dr. olgA BrAzhniK (National Institutes of Health) has over 30 years of professional 

career in computational sciences and health, biomedical and clinical informatics. 

He started as a physicist, applying theoretical and computational methods 

in biology and medicine; and earned a Ph.D. in Computational Physics from Moscow 

State University, Russia. Researching and developing technologies for transforming 

data into knowledge, she worked at the University of Chicago, Virginia Tech, 

Virginia Bioinformatics Institute, and the US Air Force Surgeon General Office. 

She joined the National Institutes of Health (NIH) in 2004. Over her years with NIH, 

she managed grants, cooperative agreements and contracts in areas of health and 

biomedical informatics, semantics, visualization, multi-media, knowledge engineering, 

social network analysis and collaborative technologies. Among her other duties, 

she currently directs Small Business Innovation Research (SBIR) program at the 

National Center for Advancing Translational Sciences (NCATS). 

Following her passion for developing collective intelligence about human health, 

Dr. Brazhnik keeps exploring ways in which ubiquitous computing and cutting edge 

technology enable us to employ individual creativity, wisdom of the crowds, art, 

holistic approaches and solid science to benefit human health and wellbeing. She 

introduced numerous novel informatics and collaborative technologies to the NIH 

community and recently organized Crowdsourcing: the Art and Science of Open 

Innovation (http://videocast.nih.gov/summary.asp?live=10366). 

She can be reached at: brazhnik@mail.nih.gov 

Page 

64

Panels 

PANEL 3: THE FuTurE oF SCIENTIFIC DaTa BaSES 

Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique 

AnAstAsiA AilAmAKi is a Professor of Computer Sciences at the 

Federale de Lausanne (EPFL) in Switzerland. Her research interests are in designing robust 

Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland. 

systems to support Her data-intensive research interests applications, are in designing and in particular robust systems (a) in maximizing to support the 

potential of multicore data-intensive hardware and applications, solid-state and drive in particular storage for (a) scalable in maximizing query the and 

transaction processing, potential and of (b) multicore in automating hardware physical and solid-state design to drive support storage demanding for 

scientific applications. scalable She query has received and transaction a European processing, Young and Investigator (b) in automat- Award from the 

ing physical design to support demanding scientific applications. 

European Science Foundation (2007), a Finmeccanica endowed chair from the Computer 

She has received a European Young Investigator Award from the 

Science Department European at Carnegie Science Mellon Foundation (2007), (2007), an Alfred a Finmeccanica P. Sloan Research endowed Fellowship 

(2005), seven best-paper chair from awards the Computer at top conferences Science Department (2001-2011), at Carnegie and an NSF Mellon CAREER 

award (2002). She (2007), earned an her Alfred Ph.D. P. in Sloan Computer Research Science Fellowship from the (2005), University seven of Wisconsin 

Madison in 2000. She best-paper is a member awards of at <strong>IEEE</strong> top conferences and ACM, (2001-2011), and has also and been an a NSF CRA-W mentor 

Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique 

Federale de Lausanne (EPFL) CAREER in award Switzerland. (2002). Her She research earned interests her Ph.D. are in designing Computer robust Science 

systems to support data-intensive from the University applications, of Wisconsin-Madison and in particular (a) in in maximizing 2000. She the is a mem- 

potential of multicore hardware ber of <strong>IEEE</strong> and and solid-state ACM, and drive has storage also been for scalable a CRA-W query mentor. and 

transaction processing, and (b) in automating physical design to support demanding 

scientific applications. She has received a European Young Investigator Award from the 

European Science Foundation JeremY (2007), KePner a Finmeccanica received a endowed B.A. with chair distinction from the in Computer Astrophys- 

Science Department at ics Carnegie from Pomona Mellon (2007), College an (Claremont, Alfred P. Sloan CA). Research After receiving Fellowship a DoE 

(2005), seven best-paper Computational awards at top conferences Science Graduate (2001-2011), Fellow and in an 1994 NSF he CAREER obtained his 

award (2002). She earned Ph.D. her from Ph.D. the in Computer Dept. of Astrophysics Science from the at Princeton University University of Wisconsin- in 

Madison in 2000. She is 1998 a member and then of <strong>IEEE</strong> joined and MIT. ACM, His and research has also is focused been a CRA-W on the mentor. development 

of advanced libraries for the application of massively 

parallel computing to a variety of data intensive signal processing 

problems on which he has published many articles. Jeremy 

is most proud of the opportunity he has had to be the principal 

architect, PI or otherwise co-lead several very talented teams. 

These teams have produced a number of innovative technologies 

that have broken new ground in several domains. 

Jeremy Kepner received a B.A. with distinction in Astrophysics from Pomona College 

(Claremont, CA). After receiving a DoE Computational Science Graduate Fellow in 1994 

he obtained his Ph.D. from the Dept. of Astrophysics at Princeton University in 1998 and 

then joined MIT. His research is focused on the development of advanced libraries for the 

application of massively parallel computing to a variety of data intensive signal processing 

problems on which he has published many articles. Jeremy is most proud of the opportunity 

he has had to be the principal architect, PI or otherwise co-lead several very talented teams. 

These teams have produced a number of innovative technologies that have broken new 

AlexAnDer szAlAY is the Alumni Centennial Professor of 

Astronomy at the Johns Hopkins University, and Professor in the 

Department of Computer Science. He is a cosmologist, working 

on the statistical measures of the spatial distribution of galaxies 

and galaxy formation. He was born and educated in Hungary. He 

Jeremy Kepner received is the a B.A. architect with distinction for the Science in Astrophysics Archive from of the Pomona Sloan College Digital Sky 

(Claremont, ground in CA). several After domains. Survey. receiving His a papers DoE Computational cover areas from Science theoretical Graduate cosmology Fellow in 1994 to 

he obtained his Ph.D. from observational the Dept. of astronomy, Astrophysics spatial at Princeton statistics University and computer in 1998 science. and 

then joined MIT. His research He is a Corresponding is focused on the Member development of the of Hungarian advanced libraries Academy for the of 

application of massively Sciences, parallel computing and a Fellow to a of variety the American of data intensive Academy signal of processing Arts and Sci- 

problems on which he has ences. published In 2004 many he received articles. Jeremy an Alexander is most proud Von Humboldt of the opportunity Award in 

Alexander he has Szalay had to is be the the Alumni principal Physical Centennial architect, Sciences, Professor PI of or in Astronomy otherwise 2007 the at co-lead Microsoft the Johns several Hopkins Jim very Gray talented Award. teams. In 2008 

University, and Professor in the 

These teams have produced he Department became of 

a number Doctor Computer 

of innovative Honoris Science. 

technologies Clausa He is a cosmologist, of the that Eötvös have broken University. new 

working on the statistical measures of the spatial distribution of galaxies and galaxy 

formation. ground He in was several born and domains. educated in Hungary. He is the architect for the Science 

Archive of the Sloan Digital Sky Survey. His papers cover areas from theoretical 

cosmology to observational astronomy, spatial statistics and computer science. He is a 

Corresponding Member of the Hungarian Academy of Sciences, and a Fellow of the 

American Academy of Arts and Sciences. In 2004 he received an Alexander Von Humboldt 

Award in Physical Sciences, in 2007 the Microsoft Jim Gray Award. In 2008 he became 

Doctor Honoris Clausa of the Eötvös University. 

Page 

65

Archive of the Sloan Digital Sky Survey. His papers cover areas from theoretical 

cosmology to observational astronomy, spatial statistics and computer science. He is a 

Corresponding Member of the Hungarian Academy of Sciences, and a Fellow of the 

American Academy of Arts and Sciences. In 2004 he received an Alexander Von Humboldt 

Award in Physical Sciences, in 2007 the Microsoft Jim Gray Award. In 2008 he became 

Doctor Honoris ICDE Clausa 2012 of the <strong>Conference</strong> 

Eötvös University. 

miChAel stoneBrAKer has been a pioneer of data base 

research and technology for more than a quarter of a century. 

He was the main architect of the INGRES relational DBMS, and 

the object-relational DBMS, POSTGRES. These prototypes were 

developed at the University of California at Berkeley where 

Stonebraker was a Professor of Computer Science for twenty five 

years. More recently at M.I.T. he was a co-architect of the Aurora/ 

Borealis stream processing engine, the C-Store column-oriented 

DBMS, and the H-Store transaction processing engine. Currently, 

he is working on science-oriented DBMSs, OLTP DBMSs, and 

search engines for accessing the deep web. He is the founder of 

five venture-capital backed startups, which commercialized his 

prototypes. Presently he serves as Chief Technology Officer of 

VoltDB, Paradigm4, Inc. and Goby.com. 

Dr. Stonebraker has been a pioneer of data base research and technology for more than a 

quarter of a century. He was the main Professor architect of the Stonebraker INGRES relational DBMS, is the and author the of scores of research papers 

object-relational DBMS, POSTGRES. on These data prototypes base were technology, developed at the operating University 

systems and the architecture 

of system software services. He was awarded the ACM System 

Software Award in 1992, for his work on INGRES. Additionally, 

he was awarded the first annual Innovation award by the ACM 

SIGMOD special interest group in 1994, and was elected to the 

National Academy of Engineering in 1997. He was awarded the 

<strong>IEEE</strong> John Von Neumann award in 2005, and is presently an Adjunct 

Professor of Computer Science at M.I.T. 

Page 

66

Awards 

InfluentIAl PAPer AwArd 

Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das: 

dBXplorer: A System for Keyword-Based Search over relational databases. ICde 2002. 

Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan 

Keyword Searching and Browsing in databases using BAnKS. ICde 2002. 

Citation 

together, these two papers from ICde 2002 laid the foundations for keyword search 

over relational databases, paving the way for a significant body of follow-on work in the 

area of Information retrieval and databases. the solutions presented in these papers 

are elegant and highly effective. 

Page 

67


BeSt PAPer AwArd 

Winner 

“Temporal Analytics on Big Data for Web Advertising” 

Badrish Chandramouli (Microsoft research) Jonathan Goldstein (Microsoft Corporation) 

Songyun duan (IBM t. J. watson research Center) 


the paper beautifully combines the Map-reduce framework and ideas from data-stream 

management systems for scalable temporal analytics on big data for effective behavioral 

targeting on the web. 

rUnner-UP 

“Recomputing Materialized Instances after Changes to Mappings and Data” 

todd J. Green (university of California, davis) Zachary G. Ives (university of Pennsylvania) 


the paper elegantly applies novel ideas for optimizing queries with materialized views 

to the practical problem of incrementally adapting declarative schema mappings in collaborative 

data sharing systems. 

Page 

68

Abstracts 

SeSSion 1: PrivAcy 

Privacy in Social Networks: How Risky is Your Social Graph? 

cuneyt Gurcan Akcora (University of insubria) 

Barbara carminati (University of insubria) 

Elena Ferrari (University of insubria) 

Several efforts have been made for more privacy aware Online Social Networks 

(OSNs) to protect personal data against various privacy threats. However, despite 

the relevance of these proposals, we believe there is still the lack of a conceptual 

model on top of which privacy tools have to be designed. Central to this model 

should be the concept of risk. Therefore, in this paper, we propose a risk measure for 

OSNs. The aim is to associate a risk level with social network users in order to provide 

other users with a measure of how much it might be risky, in terms of disclosure 

of private information, to have interactions with them. We compute risk levels based 

on similarity and benefit measures, by also taking into account the user risk attitudes. 

In particular, we adopt an active learning approach for risk estimation, where 

user risk attitude is learned from few required user interactions. The risk estimation 

process discussed in this paper has been developed into a Facebook application and 

tested on real data. The experiments show the effectiveness of our proposal. 

Page 

69


Differentially Private Spatial Decompositions 

Graham cormode (AT&T Labs – research) 

cecilia Procopiuc (AT&T Labs – research) 

Entong Shen (North carolina State University) 

Divesh Srivastava (AT&T Labs – research) 

Ting yu (North carolina State University) 

Differential privacy has recently emerged as the de facto standard for private data 

release. This makes it possible to provide strong theoretical guarantees on the 

privacy and utility of released data. While it is well-understood how to release data 

based on counts and simple functions under this guarantee, it remains to provide 

general purpose techniques to release data that is useful for a variety of queries. In 

this paper, we focus on spatial data such as locations and more generally any multidimensional 

data that can be indexed by a tree structure. Directly applying existing 

differential privacy methods to this type of data simply generates noise. We 

propose instead the class of “private spatial decompositions’’: these adapt standard 

spatial indexing methods such as quadtrees and kd-trees to provide a private description 

of the data distribution. Equipping such structures with differential privacy 

requires several steps to ensure that they provide meaningful privacy guarantees. 

Various basic steps, such as choosing splitting points and describing the distribution 

of points within a region, must be done privately, and the guarantees of the 

different building blocks composed to provide an overall guarantee. Consequently, 

we expose the design space for private spatial decompositions, and analyze some 

key examples. A major contribution of our work is to provide new techniques for 

parameter setting and post-processing the output to improve the accuracy of query 

answers. Our experimental study demonstrates that it is possible to build such 

decompositions efficiently, and use them to answer a variety of queries privately 

with high accuracy. 

Differentially Private Histogram Publication 

Jia Xu (Northeastern University, china) 

Zhenjie Zhang (Advanced Digital Sciences center, illinois at Singapore Pte.) 

Xiaokui Xiao (Nanyang Technological University) 

yin yang (Advanced Digital Sciences center, illinois at Singapore Pte.) 

Ge yu (Northeastern University, china) 

Differential privacy (DP) is a promising scheme for releasing the results of statistical 

queries on sensitive data, with strong privacy guarantees against adversaries with 

arbitrary background knowledge. Existing studies on DP mostly focus on simple aggregations 

such as counts. This paper investigates the publication of DP-compliant 

histograms, which is an important analytical tool for showing the distribution of a 

random variable, e.g., hospital bill size for certain patients. Compared to simple aggregations 

whose results are purely numerical, a histogram query is inherently more 

complex, since it must also determine its structure, i.e., the ranges of the bins. As 

we demonstrate in the paper, a DP-compliant histogram with finer bins may actually 

lead to significantly lower accuracy than a coarser one, since the former requires 

stronger perturbations in order to satisfy DP. Moreover, the histogram structure itself 

may reveal sensitive information, which further complicates the problem. Motivated 

by this, we propose two novel algorithms, namely NoiseFirst and StructureFirst, for 

Page 

70

Abstracts 

computing DP-compliant histograms. Their main difference lies in the relative order 

of the noise injection and the histogram structure computation steps. NoiseFirst 

has the additional benefit that it can improve the accuracy of an already published 

DP-complaint histogram computed using a naiive method. Going one step further, 

we extend both solutions to answer arbitrary range queries. Extensive experiments, 

using several real data sets, confirm that the proposed methods output highly accurate 

query answers, and consistently outperform existing competitors. 

Privacy-Preserving and Content-Protecting Location Based Queries 

russell Paulet (victoria University) 

Md. Golam Kaosar (victoria University) 

Xun yi (victoria University) 

Elisa Bertino (Purdue University) 

In this paper we present a solution to one of the location-based query problems. 

This problem is defined as follows: (i) a user wants to query a database of location 

data, known as Points Of Interest (POI), and does not want to reveal his/her location 

to the server due to privacy concerns; (ii) the owner of the location data, that 

is, the location server, does not want to simply distribute its data to all users. The 

location server desires to have some control over its data, since the data is its asset. 

Previous solutions have used a trusted anonymiser to address privacy, but introduced 

the impracticality of trusting a third party. More recent solutions have used 

homomorphic encryption to remove this weakness. Briefly, the user submits his/her 

encrypted coordinates to the server and the server would determine the user’s location 

homomorphically, and then the user would acquire the corresponding record 

using Private Information Retrieval techniques. We propose a major enhancement 

upon this result by introducing a similar two stage approach, where the homomorphic 

comparison step is replaced with Oblivious Transfer to achieve a more secure 

solution for both parties. The solution we present is efficient and practical in many 

scenarios. We also include the results of a working prototype to illustrate the efficiency 

of our protocol. 

SeSSion 2: WEB 2.0 APPLicATioNS 

GeoFeed: A Location-Aware News Feed 

Jie Bao (University of Minnesota at Twin cities) 

Mohamed F. Mokbel (University of Minnesota at Twin cities) 

chi-yin chow (city University of Hong Kong) 

This paper presents the GeoFeed system; a location-aware news feed system that 

provides a new platform for its users to get spatially related message updates from 

either their friends or favorite news sources. GeoFeed distinguishes itself from all 

existing news feed systems in that it takes into account the spatial extents of messages 

and user locations when deciding upon the selected news feed. GeoFeed 

is equipped with three different approaches for delivering the news feed to its 

users, namely, spatial pull, spatial push, and shared push. Then, the main challenge 

of GeoFeed is to decide on when to use each of these three approaches to which 

users. GeoFeed is equipped with a smart decision model that decides about using 

these approaches in a way that: (a) minimizes the system overhead for delivering 

Page 

71


the location-aware news feed, and (b) guarantees a certain response time for each 

user to obtain the requested location-aware news feed. Experimental results, based 

on real and synthetic data, show that GeoFeed is favorable over existing news feed 

systems, with a minimal system overhead. 

Temporal Analytics on Big Data for Web Advertising 

Badrish chandramouli (Microsoft research) 

Jonathan Goldstein (Microsoft corp.) 

Songyun Duan (iBM T. J. Watson research center) 

“Big Data” in map-reduce (M-R) clusters is often fundamentally temporal in nature, 

as are many analytics tasks over such data. For instance, display advertising uses 

Behavioral Targeting (BT) to select ads for users based on prior searches, page 

views, etc. Previous work on BT has focused on techniques that scale well for offline 

data using M-R. However, this approach has limitations for BT-style applications that 

deal with temporal data: (1) many queries are temporal and not easily expressible in 

M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not 

suitable for temporal processing; (2) as commercial systems mature, they may need 

to also directly analyze and react to real-time data feeds since a high turnaround 

time can result in missed opportunities, but it is difficult for current solutions to 

naturally also operate over real-time streams. Our contributions are twofold. First, 

we propose a novel framework called TiMR (pronounced timer), that combines a 

time-oriented data processing system with a M-R framework. Users write and submit 

analysis algorithms as temporal queries - these queries are succinct, scale-outagnostic, 

and easy to write. They scale well on large-scale offline data using TiMR, 

and can work unmodified over real-time streams. We also propose new cost-based 

query fragmentation and temporal partitioning schemes for improving efficiency 

with TiMR. Second, we show the feasibility of this approach for BT, with new temporal 

algorithms that exploit new targeting opportunities. Experiments using real data 

from a commercial ad platform show that TiMR is very efficient and incurs ordersof-magnitude 

lower development effort. Our BT solution is easy and succinct, and 

performs up to several times better than current schemes in terms of memory, 

learning time, and click-through-rate/coverage. 

Entity Search Strategies for Mashup Applications 

Stefan Endrullis (University of Leipzig) 

Andreas Thor (University of Leipzig) 

Erhard rahm (University of Leipzig) 

Programmatic data integration approaches such as mashups have become a viable 

approach to dynamically integrate web data at runtime. Key data sources 

for mashups include entity search engines and hidden databases that need to be 

queried via source-specific search interfaces or web forms. Current mashups are 

typically restricted to simple query approaches such as using keyword search. Such 

approaches may need a high number of queries if many objects have to be found. 

Furthermore, the effectiveness of the queries may be limited, i.e., they may miss 

relevant results. We therefore propose more advanced search strategies that aim at 

finding a set of entities with high efficiency and high effectiveness. Our strategies 

use different kinds of queries that are determined by source-specific query genera- 

Page 

72

Abstracts 

tors. Furthermore, the queries are selected based on the characteristics of input 

entities. We introduce a flexible model for entity search strategies that includes 

a ranking of candidate queries determined by different query generators. We 

describe different query generators and outline their use within four entity search 

strategies. These strategies apply different query ranking and selection approaches 

to optimize efficiency and effectiveness. We evaluate our search strategies in detail 

for two domains: product search and publication search. The comparison with a 

standard keyword search shows that the proposed search strategies provide significant 

improvements in both domains. 

CI-Rank: Ranking Keyword Search Results Based on Collective Importance 

Xiaohui yu (york University & Shandong University) 

Huxia Shi (york University) 

Keyword search over databases, popularized by keyword search in WWW, allows 

ordinary users to access database information without the knowledge of structured 

query languages and database schemas. Most of the previous studies in this area 

use IR-style ranking, which fail to consider the importance of the query answers. In 

this paper, we propose Ci-Rank, a new approach for keyword search in databases, 

which considers the importance of individual nodes in a query answer and the 

cohesiveness of the result structure in a balanced way. Ci-Rank is built upon a carefully 

designed model call Random Walk with Message Passing that helps capture 

the relationships between different nodes in the query answer. We develop a branch 

and bound algorithm to support the efficient generation of top-k query answers. 

Indexing methods are also introduced to further speed up the run-time processing 

of queries. Extensive experiments conducted on two real data sets with a real user 

query log confirm the effectiveness and efficiency of Ci-Rank. 

SeSSion 3: STorAGE MANAGEMENT 

Lookup Tables: Fine-Grained Partitioning for Distributed Databases 

Aubrey L. Tatarowicz (MiT) 

carlo curino (MiT) 

Evan P. c. Jones (MiT) 

Sam Madden (MiT) 

The standard way to get linear scaling in a distributed OLTP DBMS is to horizontally 

partition data across several nodes. Ideally, this partitioning will result in each query 

being executed at just one node, to avoid the overheads of distributed transactions 

and allow nodes to be added without increasing the amount of required coordination. 

For some applications, simple strategies, such as hashing on primary key, provide 

this property. Unfortunately, for many applications, including social networking 

and order-fulfillment, many-to-many relationships cause simple strategies to result 

in a large fraction of distributed queries. Instead, what is needed is a fine-grained 

partitioning, where related individual tuples (e.g., cliques of friends) are co-located 

together in the same partition. Maintaining such a fine-grained partitioning requires 

the database to store a large amount of metadata about which partition each tuple 

resides in. We call such metadata a lookup table, and present the design of a data 

distribution layer that efficiently stores these tables and maintains them in the 

Page 

73


presence of inserts, deletes, and updates. We show that such tables can provide 

scalability for several difficult to partition database workloads, including Wikipedia, 

Twitter, and TPC-E. Our implementation provides 40% to 300% better performance 

on these workloads than either simple range or hash partitioning and shows greater 

potential for further scale-out. 

Temporal Support for Persistent Stored Modules 

richard T. Snodgrass (University of Arizona) 

Dengfeng Gao (iBM Silicon valley Lab) 

rui Zhang (University of Arizona) 

Stephen W. Thomas (Queen’s University, Kingston) 

We show how to extend temporal support of SQL to the Turing-complete portion 

of SQL, that of persistent stored modules (PSM). Our approach requires minor new 

syntax beyond that already in SQL/Temporal to define and to invoke PSM routines, 

thereby extending the current, sequenced, and non-sequenced semantics of 

queries to PSM routines. Temporal upward compatibility (existing applications work 

as before when one or more tables are rendered temporal) is ensured. We provide 

a transformation that converts Temporal SQL/PSM to conventional SQL/PSM. To 

support sequenced evaluation of PSM routines, we define two different slicing approaches, 

maximal slicing and per-statement slicing. We compare these approaches 

empirically using a comprehensive benchmark and provide a heuristic for choosing 

between them. 

Energy Efficient Storage Management Cooperated with Large Data 

Intensive Applications 

Norifumi Nishikawa (The University of Tokyo) 

Miyuki Nakano (The University of Tokyo) 

Masaru Kitsuregawa (The University of Tokyo) 

Power, especially that consumed for storing data, and cooling costs for datacenters 

have increased rapidly. The main applications running at datacenters are data intensive 

applications such as large file servers or database systems. Recently, power 

management of the data intensive applications has been emphasized in the literature. 

Such reports discuss the importance of power savings. However, these reports 

lack research on power management models for the efficient use of data intensive 

applications’ I/O behaviors. This paper proposes a novel energy efficient storage 

management system that monitors both application- and device-level I/O patterns 

at run time, and uses not only the device-level I/O pattern but also applicationlevel 

patterns. First, the design of the proposed model combined with such large data 

intensive applications will be shown. The key features of the model are i) classifying 

application-level I/O into four patterns using run-time access behaviors such as the 

length of idle time and read/write frequency, and ii) adopting an appropriate power-saving 

method-based on these application level I/O patterns. Next, the proposed 

method is quantitatively evaluated with typical data intensive applications such as 

file servers, OLTP, and DSS. It is shown that energy efficient storage management 

is effective in achieving large power savings compared with traditional approaches 

while an application is running. 

Page 

74

Abstracts 

ISOBAR Preconditioner for Effective and High-throughput Lossless 

Data Compression 

Eric r. Schendel (North carolina State University) 

ye Jin (North carolina State University) 

Neil Shah (North carolina State University) 

Jackie chen (Sandia National Laboratory) 

c.S. chang (Princeton Plasma Physics Laboratory, Princeton, NJ 08543, USA) 

Seung-Hoe Ku (New york University) 

Stephane Ethier (Princeton Plasma Physics Laboratory) 

Scott Klasky (oak ridge National Laboratory) 

robert Latham (Argonne National Laboratory) 

robert ross (Argonne National Laboratory) 

Nagiza F. Samatova (North carolina State University & oak ridge National Laboratory) 

Efficient handling of large volumes of data is a necessity for exascale scientific applications 

and database systems. To address the growing imbalance between the 

amount of available storage and the amount of data being produced by high speed 

(FLOPS) processors on the system, data must be compressed to reduce the total 

amount of data placed on the file systems. General-purpose lossless compression 

frameworks, such as zlib and bzlib2, are commonly used on datasets requiring lossless 

compression. Quite often, however, many scientific data sets compress poorly, 

referred to as hard-to-compress datasets, due to the negative impact of highly entropic 

content represented within the data. An important problem in better lossless 

data compression is to identify the hard-to-compress information and subsequently 

optimize the compression techniques at the byte-level. To address this challenge, 

we introduce the In-Situ Orthogonal Byte Aggregate Reduction Compression 

(ISOBAR-compress) methodology as a preconditioner of lossless compression to 

identify and optimize the compression efficiency and throughput of hard-to-compress 

datasets. 

SeSSion 4: DATA STrEAMS ProcESSiNG 

Physically Independent Stream Merging 

Badrish chandramouli (Microsoft research) 

David Maier (Portland State University) 

Jonathan Goldstein (Microsoft corp.) 

A facility for merging equivalent data streams can support multiple capabilities 

in a data stream management system (DSMS), such as query-plan switching and 

high availability. One can logically view a data stream as a temporal table of events, 

each associated with a lifetime (time interval) over which the event contributes to 

output. In many applications, the “same” logical stream may present itself physically 

in multiple physical forms, for example, due to disorder arising in transmission or 

from combining multiple sources; and modifications of earlier events. Merging such 

streams correctly is challenging when the streams may differ physically in timing, 

order, and composition. This paper introduces a new stream operator called Logical 

Merge (LMerge) that takes multiple logically consistent streams as input and 

outputs a single stream that is compatible with all of them. LMerge can handle the 

Page 

75


dynamic attachment and detachment of input streams. We present a range of algorithms 

for LMerge that can exploit compile-time stream properties for efficiency. 

Experiments with StreamInsight, a commercial DSMS, show that LMerge is sometimes 

orders-of-magnitude more efficient than enforcing determinism on inputs, 

and that there is benefit to using specialized algorithms when stream variability 

is limited. We also show that LMerge and its extensions can provide performance 

benefits in several real-world applications. 

On Computing Correlated Aggregates over a Data Stream 

Srikanta Tirthapura (iowa State University) 

David P. Woodruff (iBM Almaden research center) 

On a stream of two dimensional data items (x,y) where x is an item identifier, and 

y is a numerical attribute, a correlated aggregate query requires us to first apply 

a selection predicate along the second (y) dimension, followed by an aggregation 

along the first (x) dimension. For selection predicates of the form (y < c) or (y > c), 

where parameter c is provided at query time, we present new streaming algorithms 

and lower bounds for estimating statistics of the resulting substream of elements 

that satisfy the predicate. We provide the first sublinear space algorithms for a large 

family of statistics in this model, including frequency moments. We experimentally 

validate our algorithms, showing that their memory requirements are significantly 

smaller than existing linear storage schemes for large datasets, while simultaneously 

achieving fast per-record processing time. We also study the problem when 

the items have weights. Allowing negative weights allows for analyzing values which 

occur in the symmetric difference of two datasets. We give a strong space lower 

bound which holds even if the algorithm is allowed up to a logarithmic number of 

passes over the data(before the query is presented). We complement this with a 

small space algorithm which uses a logarithmic number of passes. 

Accuracy-Aware Uncertain Stream Databases 

Tingjian Ge (University of Kentucky) 

Fujun Liu (University of Kentucky) 

Previous work has introduced probability distributions as first-class components in 

uncertain stream database systems. A lacking element is the fact of how accurate 

these probability distributions are. This indeed has a profound impact on the accuracy 

of query results presented to end users. While there is some previous work 

that studies unreliable intermediate query results in the tuple uncertainty model, 

to the best of our know-ledge, we are the first to consider an uncertain stream 

database in which accuracy is taken into consideration all the way from the learned 

distributions based on raw data samples to the query results. We perform an initial 

study of various components in an accuracy-aware uncertain stream database 

system, including the representation of accuracy information and how to obtain 

query results’ accuracy. In addition, we propose novel predicates based on hypothesis 

testing for decision-making using data with limited accuracy. We augment our 

study with a comprehensive set of experimental evaluations. 

Page 

76

On Discovery of Traveling Companions from Streaming Trajectories 

Lu-An Tang (UiUc) 

yu Zheng (MSrA) 

Jing yuan (MSrA) 

Jiawei Han (UiUc) 

Alice Leung (BBN) 

chih-chieh Hung (yahoo!) 

Wen-chih Peng (NcTU) 

Abstracts 

The advance of object tracking technologies leads to huge volumes of spatio-temporal 

data collected in the form of trajectory data stream. In this study, we investigate 

the problem of discovering object groups that travel together (i.e., traveling 

companions) from trajectory stream. Such technique has broad applications in the 

areas of scientific study, transportation management and military surveillance. To 

discover traveling companions, the monitoring system should cluster the objects 

of each snapshot and intersect the clustering results to retrieve moving-together 

objects. Since both clustering and intersection steps involve high computational 

overhead, the key issue of companion discovery is to improve the algorithm’s efficiency. 

We propose the models of closed companion candidates and smart intersection 

to accelerate data processing. A new data structure termed traveling buddy 

is designed to facilitate scalable and flexible companion discovery on trajectory 

stream. The traveling buddies are micro-groups of objects that are tightly bound together. 

By only storing the object relationships rather than their spatial coordinates, 

the buddies can be dynamically maintained along trajectory stream with low cost. 

Based on traveling buddies, the system can discover companions without accessing 

the object details. The proposed methods are evaluated with extensive experiments 

on both real and synthetic datasets. The buddy-based method is an order of 

magnitude faster than existing methods. It also outperforms other competitors with 

higher precision and recall in companion discovery. 

SeSSion 5: GrAPHS 

Iterative Graph Feature Mining for Graph Indexing 

Dayu yuan (Penn State University) 

Prasenjit Mitra (Penn State University) 

Huiwen yu (Penn State University) 

c. Lee Giles (Penn State University) 

Subgraph search is a popular query scenario on graph databases. Given a query 

graph q, the subgraph search algorithm returns all database graphs having q as a 

subgraph. In order to quickly process the subgraph search, subgraph features are 

mined to index the graph database. Many subgraph feature mining approaches 

have been proposed. They are all mine-at- once algorithms in which the whole 

feature set is mined with one run of the mining before building a stable graph index. 

However, due to the change of the environments (such as the update of the graph 

database and the increase of available memory), the index need to be updated to 

accommodate those changes. Since most of the “mine-at-once” algorithms involve 

frequent subgraph or subtree mining over the whole graph database, and con- 

Page 

77


structing and deploying a new index involve expensive disk operations, it is not efficient 

to re-mine the features and rebuild the index from scratch. We observe that, 

under most cases, it is sufficient to update a small part of the graph index. In this 

paper, we propose an “iterative subgraph mining” algorithm, finding one feature 

to insert into (or remove from) the index iteratively. Since the majority of indexing 

features and the index structure are not changed, the algorithm can be frequently 

invoked. We first introduce the objective function that guides the feature mining. 

Then, a basic branch and bound algorithm is proposed to mine the features. Finally, 

we design an advanced search algorithm, which quickly finds a near-optimum 

subgraph feature and reduces the search space. Experiments show that our feature 

mining algorithm is 5 times faster than GIndex on updating the graph index, and 

features mined by the iterative algorithm have high filtering rate on the subgraph 

search problem. 

An Efficient Graph Indexing Method 

Xiaoli Wang (National University of Singapore) 

Xiaofeng Ding (Huazhong University of Science and Technology) 

Anthony K.H. Tung (National University of Singapore) 

Shanshan ying (National University of Singapore) 

Hai Jin (Huazhong University of Science and Technology) 

Graphs are popular models for representing complex structure data and similarity 

search for graphs has become a fundamental research problem. Many techniques 

have been proposed to support similarity search based on the graph edit distance. 

However, they all suffer from certain drawbacks: high computational complexity, 

poor scalability in terms of database size, or not taking full advantage of indexes. To 

address these problems, in this paper, we propose SEGOS, an indexing and query 

processing framework for graph similarity search. First, an effective two-level index 

is constructed off-line based on sub-unit decomposition of graphs. Then, a novel 

search strategy based on the index is proposed. Two algorithms adapted from TA 

and CA methods are seamlessly integrated into the proposed strategy to enhance 

graph search. More specially, the proposed framework is easy to be pipelined to 

support continuous graph pruning. Extensive experiments are conducted on two 

real datasets to evaluate the effectiveness and scalability of our approaches. 

PRAGUE: Towards Blending Practical Visual Subgraph Query 

Formulation and Query Processing 

changjiu Jin (Nanyang Technological University) 

Sourav S. Bhowmick (Nanyang Technological University) 

Byron choi (Hong Kong Baptist University) 

Shuigeng Zhou (Fudan University) 

In a previous paper, we laid out the vision of a novel graph query processing paradigm 

where instead of processing a visual query graph after its construction, it interleaves 

visual query formulation and processing by exploiting the latency offered 

by the GUI to filter irrelevant matches and prefetch partial query results [8]. Our 

first attempt at implementing this vision, called GBLENDER [8], shows significant 

improvement in system response time (SRT) for subgraph containment queries. 

However, GBLENDER suffers from two key drawbacks, namely inability to handle 

Page 

78

Abstracts 

visual subgraph similarity queries and inefficient support for visual query modification, 

limiting its usage in practical environment. In this paper, we propose a novel 

algorithm called PRAGUE (PRactical visuAl Graph QUery blEnder), that addresses 

these limitations by exploiting a novel data structure called spindle-shaped graphs 

(SPIG). A SPIG succinctly records various information related to the set of supergraphs 

of a newly added edge in the visual query fragment. Specifically, PRAGUE 

realizes a unified visual framework to support SPIG-based processing of modification-efficient 

subgraph containment and similarity queries. Extensive experiments 

on real-world and synthetic datasets demonstrate effectiveness of PRAGUE. 

Ego-centric Graph Pattern Census 

Walaa Eldin Moustafa (University of Maryland, college Park) 

Amol Deshpande (University of Maryland, college Park) 

Lise Getoor (University of Maryland, college Park) 

There is increasing interest in analyzing networks of all types including social, biological, 

sensor, computer, and transportation networks. Broadly speaking, we may 

be interested in global network-wide analysis (e.g., centrality analysis, community 

detection) where the properties of the entire network are of interest, or local egocentric 

analysis where the focus is on studying the properties of nodes (egos) by 

analyzing their neighborhood subgraphs. In this paper we propose and study egocentric 

pattern census queries, a new type of graph analysis query, where a given 

structural pattern is searched for in every node’s neighborhood and the counts are 

reported or used in further analysis. This kind of analysis is useful in many domains 

in social network analysis including opinion leader identification, node classification, 

link prediction, and role identification. We propose an SQL-based declarative 

language to support this class of queries, and develop a series of efficient query 

evaluation algorithms for it. We evaluate our algorithms on a variety of synthetically 

generated graphs. We also show an application of our language in a real-world 

scenario for predicting future collaborations from DBLP data. 

SeSSion 6: UNcErTAiN AND ProBABiLiSTic DATABASES 

Searching Uncertain Data Represented by Non-Axis Parallel Gaussian 

Mixture Models 

Katrin Haegler (University of Munich) 

Frank Fiedler (University of Munich) 

christian Böhm (University of Munich) 

Efficient similarity search in uncertain data is a central problem in many modern 

applications such as biometric identification, stock market analysis, sensor networks, 

medical imaging, etc. In such applications, the feature vector of an object 

is not exactly known but is rather defined by a probability density function like a 

Gaussian Mixture Model (GMM). Previous work is limited to axis-parallel Gaussian 

distributions, hence, correlations between different features are not considered in 

the similarity search. In this paper, we propose a novel, efficient similarity search 

technique for general GMMs without independence assumption for the attributes, 

named SUDN, which approximates the actual components of a GMM in a conservative 

but tight way. A filter-refinement architecture guarantees no false dismissals, 

Page 

79


due to conservativity, as well as a good filter selectivity, due to the tightness of 

our approximations. An extensive experimental evaluation of SUDN demonstrates 

a considerable speed-up of similarity queries on general GMMs and an increase in 

accuracy compared to existing approaches. 

Aggregate Query Answering on Possibilistic Data with 

Cardinality Constraints 

Graham cormode (AT&T Labs – research) 

Entong Shen (North carolina State University) 

Divesh Srivastava (AT&T Labs – research) 

Ting yu (North carolina State University) 

Uncertainties in data can arise for a number of reasons: when data is incomplete, 

contains conflicting information or has been deliberately perturbed or coarsened to 

remove sensitive details. An important case which arises in many real applications 

is when the data describes a set of possibilities, but with cardinality constraints. 

These constraints represent correlations between tuples encoding, e.g. that at most 

two possible records are correct, or that there is an (unknown) one-to-one mapping 

between a set of tuples and attribute values. Although there has been much effort to 

handle uncertain data, current systems are not equipped to handle such correlations, 

beyond simple mutual exclusion and co-existence constraints. Vitally, they have little 

support for efficiently handling aggregate queries on such data. In this paper, we aim 

to address some of these deficiencies, by introducing LICM (Linear Integer Constraint 

Model), which can succinctly represent many types of tuple correlations, particularly 

a class of cardinality constraints. We motivate and explain the model with 

examples from data cleaning and masking sensitive data, to show that it enables 

modeling and querying such data, which was not previously possible. We develop an 

efficient strategy to answer conjunctive and aggregate queries on possibilistic data 

by describing how to implement relational operators over data in the model. LICM 

compactly integrates the encoding of correlations, query answering and lineage 

recording. In combination with off-the-shelf linear integer programming solvers, our 

approach provides exact bounds for aggregate queries. Our prototype implementation 

demonstrates that query answering with LICM can be effective and scalable. 

Discovering Threshold-based Frequent Closed Itemsets over 

Probabilistic Data 

yongxin Tong (Hong Kong Univeristy of Science and Engineering) 

Lei chen (Hong Kong Univeristy of Science and Engineering) 

Bolin Ding (University of illinois at Urbana-champaign) 

In recent years, many new applications, such as sensor network monitoring and 

moving object search, show a growing amount of importance of uncertain data 

management and mining. In this paper, we study the problem of discovering 

threshold-based frequent closed itemsets over probabilistic data. Frequent itemset 

mining over probabilistic database has attracted much attention recently. However, 

existing solutions may lead an exponential number of results due to the downward 

closure property over probabilistic data. Moreover, it is hard to directly extend the 

successful experiences from mining exact data to a probabilistic environment due 

to the inherent uncertainty of data. Thus, in order to obtain a reasonable result set 

Page 

80

Abstracts 

with small size, we study discovering frequent closed itemsets over probabilistic 

data. We prove that even a sub-problem of this problem, computing the frequent 

closed probability of an itemset, is #P-Hard. Therefore, we develop an efficient 

mining algorithm based on depth-first search strategy to obtain all probabilistic 

frequent closed itemsets. To reduce the search space and avoid redundant computation, 

we further design several probabilistic pruning and bounding techniques. 

Finally, we verify the effectiveness and efficiency of the proposed methods through 

extensive experiments. 

Ranking Query Answers in Probabilistic Databases: Complexity and 

Efficient Algorithms 

Dan olteanu (oxford) 

Hongkai Wen (oxford) 

In many applications of probabilistic databases, the probabilities are mere degrees 

of uncertainty in the data and are not otherwise meaningful to the user. Often, users 

care only about the ranking of answers in decreasing order of their probabilities 

or about a few most likely answers. In this paper, we investigate the problem of 

ranking query answers in probabilistic databases. We give a dichotomy for ranking 

in case of conjunctive queries without repeating relation symbols: it is either 

in polynomial time or \#P-hard. Surprisingly, our syntactic characterisation of 

tractable queries is not the same as for probability computation. The key observation 

is that there are queries for which probability computation is \#P-hard, yet 

ranking can be computed in polynomial time. This is possible whenever probability 

computation for distinct answers has a common factor that is hard to compute but 

irrelevant for ranking. We complement this tractability analysis with an effective 

ranking technique for conjunctive queries. Given a query, we construct a share plan, 

which exposes subqueries whose probability computation can be shared or ignored 

across query answers. Our technique combines share plans with incremental approximate 

probability computation of subqueries. We implemented our technique 

in the SPROUT query engine and report on performance gains of orders of magnitude 

over Monte Carlo simulation using FPRAS and exact probability computation 

based on knowledge compilation. 

SeSSion 7: DATA iNTEGrATioN AND EXTrAcTioN 

Joint Entity Resolution 

Steven Euijong Whang (Stanford University) 

Hector Garcia-Molina (Stanford University) 

Entity resolution (ER) is the problem of identifying which records in a database 

represent the same entity. Often, records of different types are involved (e.g., 

authors, publications, institutions, venues), and resolving records of one type can 

impact the resolution of other types of records. In this paper we propose a flexible, 

modular resolution framework where existing ER algorithms developed for a given 

record type can be plugged in and used in concert with other ER algorithms. Our 

approach also makes it possible to run ER on subsets of similar records at a time, 

important when the full data is too large to resolve together. We study the scheduling 

and coordination of the individual ER algorithms in order to resolve the full data 

Page 

81


set. We then evaluate our joint ER techniques on synthetic and real data and show 

the scalability of our approach. 

A Self-Configuring Schema Matching System 

Eric Peukert (SAP research Dresden) 

Julian Eberius (Dresden University of Technology) 


Mapping complex metadata structures is crucial in a number of domains such as 

data integration, ontology alignment or model management. To speed up the generation 

of such mappings, automatic matching systems were developed to compute 

mapping suggestions that can be corrected by a user. However, constructing and 

tuning match strategies still requires a high manual effort by matching experts as 

well as correct mappings to evaluate generated mappings. We therefore propose 

a self-configuring schema matching system that is able to automatically adapt to 

the given mapping problem at hand. Our approach is based on analyzing the input 

schemas as well as intermediate matching results. A variety of matching rules use 

the analysis results to automatically construct and adapt an underlying matching 

process for a given match task. We comprehensively evaluate our approach on 

different mapping problems from the schema, ontology and model management 

domains. The evaluation shows that our system is able to robustly return good quality 

mappings across different mapping problems and domains. 

Incremental Detection of Inconsistencies in Distributed Data 

Wenfei Fan (University of Edinburgh) 

Jianzhong Li (Harbin institute of Technology) 

Nan Tang (University of Edinburgh & Qatar computing research institute) 

Wenyuan yu (University of Edinburgh) 

This paper investigates the problem of incremental detection of errors in distributed 

data. Given a distributed database D, a set Σ of conditional functional dependencies 

(CFDs), the set V of violations of the CFDs in D, and updates Δ D to D, it 

is to find, with minimum data shipment, changes Δ V to V in response to Δ D. The 

need for the study is evident since real-life data is often dirty, distributed and is 

frequently updated. It is often prohibitively expensive to recompute the entire set 

of violations when D is updated. We show that the incremental detection problem 

is NP-complete for D partitioned either vertically or horizontally, even when Σ and D 

are fixed. Nevertheless, we show that it is bounded and better still, actually optimal: 

there exist algorithms to detect errors such that their computational cost and 

data shipment are both linear in the size of Δ D and Δ V, independent of the size of 

the database D. We provide such incremental algorithms for vertically partitioned 

data, and show that the algorithms are optimal. We further propose optimization 

techniques for the incremental algorithm over vertical partitions to reduce data 

shipment. We verify experimentally, using real-life data on Amazon Elastic Compute 

Cloud (EC2), that our algorithms substantially outperform their batch counterparts 

even when Δ V is reasonably large. 

Page 

82

Abstracts 

Recomputing Materialized Instances after Changes to Mappings and Data 

Todd J. Green (University of california, Davis) 

Zachary G. ives (University of Pennsylvania) 

A major challenge faced by today’s information systems is that of evolution as 

data usage evolves or new data resources become available. Modern organizations 

sometimes exchange data with one another via declarative mappings among 

their databases, as in data exchange and collaborative data sharing systems. Such 

mappings are frequently revised and refined as new data becomes available, new 

cross-reference tables are created, and corrections are made. A fundamental question 

is how to handle changes to these mapping definitions, when the organizations 

each materialize the results of applying the mappings to the available data. We 

consider how to incrementally recompute these database instances in this setting, 

reusing (if possible) previously computed instances to speed up computation. We 

develop a principled solution that performs cost-based exploration of recomputation 

versus reuse, and simultaneously handles updates to source data and mapping 

definitions through a single, unified mechanism. Our solution also takes advantage 

of provenance information, when present, to speed up computation even further. 

We present an implementation that takes advantage of an off-the-shelf DBMS’s 

query processing system, and we show experimentally that our approach provides 

substantial performance benefits. 

SeSSion 8: SPATio-TEMPorAL DATA MANAGEMENT 

SWST: A Disk Based Index for Sliding Window Spatio-Temporal Data 

Manish Singh (University of Michigan, Ann Arbor) 

Qiang Zhu (University of Michigan, Dearborn) 

H.v. Jagadish (University of Michigan, Ann Arbor) 

Numerous applications such as wireless communication and telematics need to 

keep track of evolution of spatio-temporal data for a limited past. Limited retention 

may even be required by regulations. In general, each data entry can have its own 

user specified lifetime. It is desired that expired entries are automatically removed 

by the system through some garbage collection mechanism. This kind of limited 

retention can be achieved by using a sliding window semantics similar to that from 

stream data processing. However, due to the large volume and relatively long lifetime 

of data in the aforementioned applications (in contrast to the real-time transient 

streaming data), the sliding window here needs to be maintained for data on 

disk rather than in memory. It is a new challenge to provide fast access to the information 

from the recent past and, at the same time, facilitate efficient deletion of the 

expired entries. In this paper, we propose a disk based, two-layered, sliding window 

indexing scheme for discretely moving spatio-temporal data. Our index can support 

efficient processing of standard timeslice and interval queries and delete expired 

entries with almost no overhead. In existing historical spatio-temporal indexing 

techniques, deletion is either infeasible or very inefficient. Our sliding window based 

processing model can support both current and past entries, while many existing 

historical spatio-temporal indexing techniques cannot keep these two types of data 

together in the same index. Our experimental comparison with the best known historical 

index (i.e., the MV3R tree) for discretely moving spatio-temporal data shows 

Page 

83


that our index is about five times faster in terms of insertion time and comparable 

in terms of search performance. MV3R follows a partial persistency model, whereas 

our index can support very efficient deletion and update. 

Querying Uncertain Spatio-Temporal Data 

Tobias Emrich (Ludwig-Maximilians-Universität München) 

Hans-Peter Kriegel (Ludwig-Maximilians-Universität München) 

Nikos Mamoulis (University of Hong Kong) 

Matthias renz (Ludwig-Maximilians-Universität München) 

Andreas Züfle (Ludwig-Maximilians-Universität München) 

The problem of modeling and managing uncertain data has received a great deal 

of interest, due to its manifold applications in spatial, temporal, multimedia and 

sensor databases. There exists a wide range of work covering spatial uncertainty in 

the static (snapshot) case, where only one point of time is considered. In contrast, 

the problem of modeling and querying uncertain spatio-temporal data has only 

been treated as a simple extension of the spatial case, disregarding time dependencies 

between consecutive timestamps. In this work, we present a framework for 

efficiently modeling and querying uncertain spatio-temporal data. The key idea of 

our approach is to model possible object trajectories by stochastic processes. This 

approach has three major advantages over previous work. First it allows answering 

queries in accordance with the possible worlds model. Second, dependencies 

between object locations at consecutive points in time are taken into account. And 

third it is possible to reduce all queries on this model to simple matrix multiplications. 

Based on these concepts we propose efficient solutions for different probabilistic 

spatio-temporal queries. In an experimental evaluation we show that our approaches 

are several order of magnitudes faster than state-of-the-art competitors. 

The Min-dist Location Selection Query 

Jianzhong Qi (University of Melbourne) 

rui Zhang (University of Melbourne) 

Lars Kulik (University of Melbourne) 

Dan Lin (Missouri University of Science and Technology) 

yuan Xue (University of Melbourne) 

We propose and study a new type of location optimization problem: given a set of 

clients and a set of existing facilities, we select a location from a given set of potential 

locations for establishing a new facility so that the average distance between a 

client and her nearest facility is minimized. We call this problem the min-dist location 

selection problem, which has a wide range of applications in urban development 

simulation, massively multiplayer online games, and decision support systems. 

We explore two common approaches to location optimization problems and propose 

methods based on those approaches for solving this new problem. However, 

those methods either need to maintain an extra index or fall short in efficiency. To 

address their drawbacks, we propose a novel method (named MND), which has 

very close performance to the fastest method but does not need an extra index. 

We provide a detailed comparative cost analysis on the various algorithms. We also 

perform extensive experiments to evaluate their empirical performance and validate 

the efficiency of the MND method. 

Page 

84

Abstracts 

Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation 

Jia Pan (UNc chapel Hill) 

Dinesh Manocha (UNc chapel Hill) 

We present a new Bi-level LSH algorithm to perform approximate k-nearest neighbor 

search in high dimensional spaces. Our formulation is based on a two-level 

scheme. In the first level, we use a RP-tree that divides the dataset into sub-groups 

with bounded aspect ratios and is used to distinguish well-separated clusters. During 

the second level, we compute a single LSH hash table for each sub-group along 

with a hierarchical structure based on space-filling curves. Given a query, we first 

determine the sub-group that it belongs to and perform k-nearest neighbor search 

within the suitable buckets in the LSH hash table corresponding to the sub-group. 

Our algorithm also maps well to current GPU architectures and can improve the 

quality of approximate KNN queries as compared to prior LSH-based algorithms. 

We highlight its performance on two large, high-dimensional image datasets. Given 

a runtime budget, Bi-level LSH can provide better accuracy in terms of recall or 

error ration. Moreover, our formulation reduces the variation in runtime cost or the 

quality of results. 

SeSSion 9: QUEry ProcESSiNG 

Learning-based Query Performance Modeling and Prediction 

Mert Akdere (Brown University) 

Ugur cetintemel (Brown University) 

Matteo riondato (Brown University) 

Eli Upfal (Brown University) 

Stanley B. Zdonik (Brown University) 

Accurate query performance prediction (QPP) is central to effective resource management, 

query optimization and query scheduling. Analytical cost models, used in 

current generation of query optimizers, have been successful in comparing the costs 

of alternative query plans, but they are poor predictors of execution latency. As a 

more promising approach to QPP, this paper studies the practicality and utility of 

sophisticated learning-based models, which have recently been applied to a variety 

of predictive tasks with great success, in both static (i.e., fixed) and dynamic query 

workloads. We propose and evaluate predictive modeling techniques that learn query 

execution behavior at different granularities, ranging from coarse-grained planlevel 

models to fine-grained operator-level models. We demonstrate that these two 

extremes offer a tradeoff between high accuracy for static workload queries and 

generality to unforeseen queries in dynamic workloads, respectively, and introduce a 

hybrid approach that combines their respective strengths by selectively composing 

them in the process of QPP. We discuss how we can use a training workload to (i) 

pre-build and materialize such models offline, so that they are readily available for 

future predictions, and (ii) build new models online as new predictions are needed. 

All prediction models are built using only static features (available prior to query 

execution) and the performance values obtained from the offline execution of the 

training workload. We fully implemented all these techniques and extensions on top 

of PostgreSQL and evaluated them experimentally by quantifying their effectiveness 

over analytical workloads, represented by well-established TPC-H data and queries. 

Page 

85


The results provide quantitative evidence that learning-based modeling for QPP is 

both feasible and effective for both static and dynamic workload scenarios. 

Parametric Plan Caching Using Density-Based Clustering 

Gunes Aluc (University of Waterloo) 

David E. DeHaan (Sybase, an SAP company) 

ivan T. Bowman (Sybase, an SAP company) 

Query plan caching eliminates the need for repeated query optimization; hence, it 

has strong practical implications for relational database management systems (RD- 

BMSs). Unfortunately, existing approaches consider only the query plan generated at 

the expected values of parameters that characterize the query, data and the current 

state of the system, while these parameters may take different values during the lifetime 

of a cached plan. A better alternative is to harvest the optimizer’s plan choice 

for different parameter values, populate the cache with promising query plans, and 

select a cached plan based upon current parameter values. To address this challenge, 

we propose a parametric plan caching (PPC) framework that uses an online plan 

space clustering algorithm. The clustering algorithm is density-based, and it exploits 

locality-sensitive hashing as a pre-processing step so that clusters in the plan spaces 

can be efficiently stored in database histograms and queried in constant time. We 

experimentally validate that our approach is precise, efficient in space-and-time and 

adaptive, requiring no eager exploration of the plan spaces of the optimizer. 

Effective and Robust Pruning for Top-Down Join 


Pit Fender (Mannheim University) 

Guido Moerkotte (Mannheim University) 

Thomas Neumann (Technical University of Munich) 

viktor Leis (Technical University of Munich) 

Finding the optimal execution order of join operations is a crucial task of today’s 

cost-based query optimizers. There are two approaches to identify the best plan: 

bottom-up and top-down join enumeration. For both optimization strategies efficient 

algorithms have been published. However, only the top-down approach allows 

for branch-and-bound pruning. Two pruning techniques can be found in the literature. 

We add six new ones. Combined, they improve performance roughly by an 

average factor of 2-5. Even more important, our techniques improve the worst case 

by two orders of magnitude. Additionally, we introduce a new, very efficient, and 

easy to implement top-down join enumeration algorithm. This algorithm, together 

with our improved pruning techniques, yields a performance which is by an average 

factor of 6-9 higher than the performance of the original top-down enumeration 

algorithm with the original pruning methods. 

Towards Preference-aware Relational Databases 

Anastasios Arvanitis (National Technical University of Athens) 

Georgia Koutrika (iBM Almaden research center) 

In implementing preference-aware query processing, a straightforward option is 

Page 

86

Abstracts 

to build a plug-in on top of the database engine. However, treating the DBMS as 

a black box affects both the expressivity and performance of queries with preferences. 

In this paper, we argue that preference-aware query processing needs to be 

pushed closer to the DBMS. We present a preference-aware relational data model 

that extends database tuples with preferences and an extended algebra that captures 

the essence of processing queries with preferences. A key novelty of our preference 

model itself is that it defines a preference in three dimensions showing the 

tuples affected, their preference scores and the credibility of the preference. Our 

query processing strategies push preference evaluation inside the query plan and 

leverage its algebraic properties for finer-grained query optimization. We experimentally 

evaluate the proposed strategies. Finally, we compare our framework to a 

pure plug-in implementation and we show its feasibility and advantages. 

SeSSion 10: LocATioN AWArE DATA ProcESSiNG 

A Foundation for Efficient Indoor Distance-Aware Query Processing 

Hua Lu (Aalborg University) 

Xin cao (Nanyang Technological University) 

christian S. Jensen (Aarhus University) 

Indoor spaces accommodate large numbers of spatial objects, e.g., points of interest 

(POIs), and moving populations. A variety of services, e.g., location-based 

services and security control, are relevant to indoor spaces. Such services can be 

improved substantially if they are capable of utilizing indoor distances. However, existing 

indoor space models do not account well for indoor distances. To address this 

shortcoming, we propose a data management infrastructure that captures indoor 

distance and facilitates distance-aware query processing. In particular, we propose 

a distance-aware indoor space model that integrates indoor distance seamlessly. To 

enable the use of the model as a foundation for query processing, we develop accompanying, 

efficient algorithms that compute indoor distances for different indoor 

entities like doors as well as locations. We also propose an indexing framework 

that accommodates indoor distances that are pre-computed using the proposed 

algorithms. On top of this foundation, we develop efficient algorithms for typical 

indoor, distance-aware queries. The results of an extensive experimental evaluation 

demonstrate the efficacy of the proposals. 

LARS: A Location-Aware Recommender System 

Justin J. Levandoski (Microsoft research) 

Mohamed Sarwat (University of Minnesota) 

Ahmed Eldawy (University of Minnesota) 

Mohamed F. Mokbel (University of Minnesota) 

This paper proposes LARS, a location-aware recommender system that uses location-based 

ratings to produce recommendations. Traditional recommender systems 

do not consider spatial properties of users nor items; LARS, on the other hand, supports 

a taxonomy of three novel classes of location-based ratings, namely, spatial 

ratings for non-spatial items, non-spatial ratings for spatial items, and spatial ratings 

for spatial items. LARS exploits user rating locations through user partitioning, a 

technique that influences recommendations with ratings spatially close to querying 

Page 

87


users in a manner that maximizes system scalability while not sacrificing recommendation 

quality. LARS exploits item locations using travel penalty, a technique that favors 

recommendation candidates closer in travel distance to querying users in a way 

that avoids exhaustive access to all spatial items. LARS can apply these techniques 

separately, or in concert, depending on the type of location-based rating available. 

Experimental evidence using large-scale real-world data from both the Foursquare 

location-based social network and the MovieLens movie recommendation system 

reveals that LARS is efficient, scalable, and capable of producing recommendations 

twice as accurate compared to existing recommendation approaches. 

Approximate Shortest Distance Computing: A Query-Dependent Local 

Landmark Scheme 

Miao Qiao (The chinese University of Hong Kong) 

Hong cheng (The chinese University of Hong Kong) 

Lijun chang (The chinese University of Hong Kong) 

Jeffrey Xu yu (The chinese University of Hong Kong) 

Shortest distance query between two nodes is a fundamental operation in largescale 

networks. Most existing methods in the literature take a landmark embedding 

approach, which selects a set of graph nodes as landmarks and computes the 

shortest distances from each landmark to all nodes as an embedding. To handle a 

shortest distance query between two nodes, the precomputed distances from the 

landmarks to the query nodes are used to compute an approximate shortest distance 

based on the triangle inequality. In this paper, we analyze the factors that affect 

the accuracy of the distance estimation in the landmark embedding approach. 

In particular we find that a globally selected, query-independent landmark set plus 

the triangulation based distance estimation introduces a large relative error, especially 

for nearby query nodes. To address this issue, we propose a query-dependent 

local landmark scheme, which identifies a local landmark close to the specific query 

nodes and provides a more accurate distance estimation than the traditional global 

landmark approach. Specifically, a local landmark is defined as the least common 

ancestor of the two query nodes in the shortest path tree rooted at a global landmark. 

We propose efficient local landmark indexing and retrieval techniques, which 

are crucial to achieve low offline indexing complexity and online query complexity. 

Two optimization techniques on graph compression and graph online search are 

also proposed, with the goal to further reduce index size and improve query accuracy. 

Our experimental results on large-scale social networks and road networks 

demonstrate that the local landmark scheme reduces the shortest distance estimation 

error significantly when compared with global landmark embedding. 

Desks: Direction-Aware Spatial Keyword Search 

Guoliang Li (Tsinghua University) 

Jianhua Feng (Tsinghua University) 

Jing Xu (Tsinghua University) 

Location-based services (LBS) have been widely accepted by mobile users. Many 

LBS users have direction-aware search requirement that answers must be in 

the search direction. However to the best of our knowledge there is not yet any 

research available that investigates direction-aware search. A straightforward 

Page 

88

Abstracts 

method first finds candidates without considering the direction constraint, and then 

generates the answers by pruning those candidates which invalidate the direction 

constraint. However this method is rather expensive as it involves a lot of useless 

computation on many unnecessary directions. To address this problem, we propose 

a direction-aware spatial keyword search method which inherently supports 

direction-aware search. We devise novel direction-aware indexing structures to 

prune unnecessary directions. We develop effective pruning techniques and search 

algorithms to efficiently answer a direction-aware query. As users may dynamically 

change their search directions, we propose to incrementally answer a query. Experimental 

results on real datasets show that our method achieves high performance 

and outperforms existing methods significantly. 

SeSSion 11: MAP-rEDUcE BASED DATA ProcESSiNG 

Extending Map-Reduce for Efficient Predicate-Based Sampling 

raman Grover (University of california, irvine) 

Michael carey (University of california, irvine) 

In this paper we address the problem of using MapReduce to sample a massive 

data set in order to produce a fixed-size sample whose contents satisfy a given 

predicate. While it is simple to express this computation using MapReduce, its 

default Hadoop execution is dependent on the input size and is wasteful of cluster 

resources. This is unfortunate, as sampling queries are fairly common (e.g., for 

exploratory data analysis at Facebook), and the resulting waste can significantly 

impact the performance of a shared cluster. To address such use cases, we present 

the design, implementation and evaluation of a Hadoop execution model extension 

that supports incremental job expansion. Under this model, a job consumes input 

as required and can dynamically govern its resource consumption while producing 

the required results. The proposed mechanism is able to support a variety of policies 

regarding job growth rates as they relate to cluster capacity and current load. 

We have implemented the mechanism in Hadoop, and we present results from an 

experimental performance study of different job growth policies under both single- 

and multi-user workloads. 

Fuzzy Joins Using MapReduce 

Foto Afrati (National Technical University Athens) 

Anish Das Sarma (Google, inc. - work initiated at yahoo! research) 

David Menestrina (Google, inc.) 

Aditya Parameswaran (Stanford University) 

Jeffrey D. Ullman (Stanford University) 

Fuzzy/similarity joins have been widely studied in the research community and extensively 

used in real-world applications. This paper proposes and evaluates several 

algorithms for finding all pairs of elements from an input set that meet a similarity 

threshold. The computation model is a single MapReduce job. Because we allow only 

one MapReduce round, the Reduce function must be designed so a given output pair 

is produced by only one task; for many algorithms, satisfying this condition is one of 

the biggest challenges. We break the cost of an algorithm into three components: the 

execution cost of the mappers, the execution cost of the reducers, and the communi- 

Page 

89


cation cost from the mappers to reducers. The algorithms are presented first in terms 

of Hamming distance, but extensions to edit distance and Jaccard distance are shown 

as well. We find that there are many different approaches to the similarity-join problem 

using MapReduce, and none dominates the others when both communication 

and reducer costs are considered. Our cost analyses enable applications to pick the 

optimal algorithm based on their communication, memory, and cluster requirements. 

Parallel Top-K Similarity Join Algorithms Using MapReduce 

younghoon Kim (Seoul National University) 

Kyuseok Shim (Seoul National University) 

There is a wide range of applications that require finding the top-k most similar 

pairs of records in a given database. However, computing such top-k similarity joins 

is a challenging problem today, as there is an increasing trend of applications that 

expect to deal with vast amounts of data. For such data-intensive applications, 

parallel executions of programs on a large cluster of commodity machines using 

the MapReduce paradigm have recently received a lot of attention. In this paper, we 

investigate how the top-k similarity join algorithms can get benefits from the popular 

MapReduce framework. We first develop the divide-and-conquer and branchand-bound 

algorithms. We next propose the all pair partitioning and essential pair 

partitioning methods to minimize the amount of data transfers between map and 

reduce functions. We finally perform the experiments with not only synthetic but 

also real-life data sets. Our performance study confirms the effectiveness and scalability 

of our MapReduce algorithms. 

Load Balancing in MapReduce Based on Scalable Cardinality Estimates 

Benjamin Gufler (Technische Universität München) 

Nikolaus Augsten (Free University of Bolzano-Bozen) 

Angelika reiser (Technische Universität München) 

Alfons Kemper (Technische Universität München) 

MapReduce has emerged as a popular tool for distributed and scalable processing 

of massive data sets and is being used increasingly in e-science applications. Unfortunately, 

the performance of MapReduce systems strongly depends on an even 

data distribution while scientific data sets are often highly skewed. The resulting 

load imbalance, which raises the processing time, is even amplified by high runtime 

complexity of the reducer tasks. An adaptive load balancing strategy is required for 

appropriate skew handling. In this paper, we address the problem of estimating the 

cost of the tasks that are distributed to the reducers based on a given cost model. 

An accurate cost estimation is the basis for adaptive load balancing algorithms and 

requires to gather statistics from the mappers. This is challenging: (a) Since the 

statistics from all mappers must be integrated, the mapper statistics must be small. 

(b) Although each mapper sees only a small fraction of the data, the integrated 

statistics must capture the global data distribution. (c) The mappers terminate after 

sending the statistics to the controller, and no second round is possible. Our solution 

to these challenges consists of two components. First, a monitoring component 

executed on every mapper captures the local data distribution and identifies 

its most relevant subset for cost estimation. Second, an integration component 

aggregates these subsets approximating the global data distribution. 

Page 

90

SeSSion 12: SociAL MEDiA 

Community Detection with Edge Content in Social Media Networks 

Guo-Jun Qi (University of illinois at Urbana-champaign) 

charu c. Aggarwal (iBM T. J. Watson research center) 

Thomas S. Huang (University of illinois at Urbana-champaign) 

Abstracts 

The problem of community detection in social media has been widely studied in 

the social networking community in the context of the structure of the underlying 

graphs. Most community detection algorithms use the links between the nodes in 

order to determine the dense regions in the graph. These dense regions are the 

communities of social media in the graph. Such methods are typically based purely 

on the linkage structure of the underlying social media network. However, in many 

recent applications, edge content is available in order to provide better supervision 

to the community detection process. Many natural representations of edges in social 

interactions such as shared images and videos, user tags and comments are naturally 

associated with content on the edges. While some work has been done on utilizing 

node content for community detection, the presence of edge content presents 

unprecedented opportunities and flexibility for the community detection process. 

We will show that such edge content can be leveraged in order to greatly improve 

the effectiveness of the community detection process in social media networks. We 

present experimental results illustrating the effectiveness of our approach. 

Cross Domain Search by Exploiting Wikipedia 

chen Liu (National University of Singapore) 

Sai Wu (National University of Singapore) 

Shouxu Jiang (Harbin institute of Technology) 


The abundance of Web 2.0 resources in various media formats calls for better 

resource integration to enrich user experience. This naturally leads to a new crossmodal 

resource search requirement, in which a query is a resource in one modal 

and the results are closely related resources in other modalities. With cross-modal 

search, we can better exploit existing resources. Tags associated with Web 2.0 

resources are intuitive medium to link resources with different modality together. 

However, tagging is by nature an ad hoc activity. They often contain noises and are 

affected by the subjective inclination of the tagger. Consequently, linking resources 

simply by tags will not be reliable. In this paper, we propose an approach for linking 

tagged resources to concepts extracted from Wikipedia, which has become a fairly 

reliable reference over the last few years. Compared to the tags, the concepts are 

therefore of higher quality. We develop effective methods for cross-modal search 

based on the concepts associated with resources. Extensive experiments were conducted, 

and the results show that our solution achieves good performance. 

Page 

91


Provenance-based Indexing Support in Micro-blog Platforms 

Junjie yao (Peking University) 

Bin cui (Peking University) 

Zijun Xue (Peking University) 

Qingyun Liu (Peking University) 

Recently, lots of micro-blog message sharing applications have emerged on the 

web. Users can publish short messages freely and get notified by the subscriptions 

instantly. Prominent examples include Twitter, Facebook’s statuses, and Sina Weibo 

in China. The Micro-blog platform becomes a useful service for real time information 

creation and propagation. However, these messages’ short length and dynamic 

characters have posed great challenges for effective content understanding. Additionally, 

the noise and fragments make it difficult to discover the temporal propagation 

trail to explore development of micro-blog messages. In this paper, we propose 

a provenance model to capture connections between micro-blog messages. Provenance 

refers to data origin identification and transformation logging, demonstrating 

of great value in recent database and workflow systems. To cope with the real time 

micro-message deluge, we utilize a novel message grouping approach to encode 

and maintain the provenance information. Furthermore, we adopt a summary index 

and several adaptive pruning strategies to implement efficient provenance updating. 

Based on the index, our provenance solution can support rich query retrieval 

and intuitive message tracking for effective message organization. Experiments 

conducted on a real dataset verify the effectiveness and efficiency of our approach. 

Provenance refers to data origin identification and transformation monitoring, which 

has been demonstrated of great value in database and workflow systems. In this 

paper, we propose a provenance model in micro-blog platforms, and design an indexing 

scheme to support provenance-based message discovery and maintenance, 

which can capture the interactions of messages for effective message organization. 

To cope with the real time micro-message tornadoes, we introduce a novel virtual 

annotation grouping approach to encode and maintain the provenance information. 

Furthermore, we design a summary index and adaptive pruning strategies to facilitate 

efficient message update. Based on this provenance index, our approach can 

support query and message tracking in micro-blog systems. Experiments conducted 

on real datasets verify the effectiveness and efficiency of our approach. 

Learning Stochastic Models of Information Flow 

Luke Dickens (imperial college London) 

ian Molloy (iBM T. J. Watson research center) 

Jorge Lobo (iBM T. J. Watson research center) 

Pau-chen cheng (iBM T. J. Watson research center) 

Alessandra russo (imperial college London) 

An understanding of information flow has many applications, including for maximizing 

marketing impact on social media, limiting malware propagation, and managing 

undesired disclosure of sensitive information. This paper presents scalable methods 

for both learning models of information flow in networks from data, based 

on the Independent Cascade Model; and predicting probabilities of unseen flow 

from these models. Our approach is based on a principled probabilistic construction 

and results compare favourably with existing methods in terms of accuracy of 

Page 

92

Abstracts 

prediction and scalable evaluation, with the addition that we are able to evaluate a 

broader range of queries than previously shown, including probability of joint and/ 

or conditional flow, as well as reflecting model uncertainty. Exact evaluation of flow 

probabilities is exponential in the number of edges and naive sampling can also 

be expensive, so we propose sampling in an efficient Markov-Chain Monte-Carlo 

fashion using the Metropolis-Hastings algorithm — details described in the paper. 

We identify two types of data, those where the paths of past flows are known — attributed 

data, and those where only the endpoints are known — unattributed data. 

Both data types are addressed in this paper, including training methods, example 

real world data sets, and experimental evaluation. In particular, we investigate 

flow data from the Twitter micro-blogging service, exploring the flow of messages 

through retweets (tweet forwards) for the attributed case, and the propagation of 

hashtags (metadata tags) and urls for the unattributed case. 

SeSSion 13: P2P AND DiSTriBUTED ProcESSiNG 

BestPeer++: A Peer-to-Peer based Large-scale Data Processing 

Gang chen (NetEase.com inc. & Zhejiang University) 

Tianlei Hu (NetEase.com inc. & Zhejiang University) 

Dawei Jiang (National University of Singapore) 

Peng Lu (National University of Singapore) 

Kian-Lee Tan (National University of Singapore) 

Hoang Tam vo (National University of Singapore) 

Sai Wu (BestPeer Pte. Ltd. & National University of Singapore) 

The corporate network is often used for sharing information among the participating 

companies and facilitating collaboration in a certain industry sector where companies 

share a common interest. It can effectively help the companies to reduce 

their operational costs and increase the revenues. However, the inter-company data 

sharing and processing poses unique challenges to such a data management system 

including scalability, performance, throughput, and security. In this paper, we 

present BestPeer++, a system which delivers elastic data sharing services for corporate 

network applications in the cloud based on BestPeer — a peer-to-peer (P2P) 

based data management platform. By integrating cloud computing, database, and 

P2P technologies into one system, BestPeer++ provides an economical, flexible and 

scalable platform for corporate network applications and delivers data sharing services 

to participants based on the widely accepted pay-as-you-go business model. 

We evaluate BestPeer++ on Amazon EC2 Cloud platform. The benchmarking results 

show that BestPeer++ outperforms HadoopDB, a recently proposed large-scale 

data processing system, in performance when both systems are employed to handle 

typical corporate network workloads. The benchmarking results also demonstrate 

that BestPeer++ achieves near linear scalability for throughput with respect to the 

number of peer nodes. 

Page 

93


Effective Data Density Estimation in Ring-based P2P Networks 

Minqi Zhou (East china Normal University) 

Heng Tao Shen (The University of Queensland) 

Xiaofang Zhou (The University of Queensland) 

Weining Qian (East china Normal University) 

Aoying Zhou (East china Normal University) 

Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important 

issue and has yet to be well addressed. It can benefit many P2P applications, 

such as load balancing analysis, query processing, and data mining. Inspired by the 

inversion method for random variate generation, in this paper we present a novel 

model named distribution-free data density estimation for dynamic ring-based P2P 

networks to achieve high estimation accuracy with low estimation cost regardless 

of distribution models of the underlying data. It generates random samples for any 

arbitrary distribution by sampling the global cumulative distribution function and is 

free from sampling bias. In P2P networks, the key idea for distribution-free estimation 

is to sample a small subset of peers for estimating the global data distribution 

over the data domain. Algorithms on computing and sampling the global cumulative 

distribution function based on which global data distribution is estimated are 

introduced with detailed theoretical analysis. Our extensive performance study confirms 

the effectiveness and efficiency of our methods in ring-based P2P networks. 

Processing of Rank Joins in Highly Distributed Systems 

christos Doulkeridis (Norwegian University of Science and Technology (NTNU)) 

Akrivi vlachou (Norwegian University of Science and Technology (NTNU)) 

Kjetil Nørvåg (Norwegian University of Science and Technology (NTNU)) 

yannis Kotidis (Athens University of Economics and Business (AUEB)) 

Neoklis Polyzotis (Uc Santa cruz (UcSc)) 

In this paper, we study efficient processing of rank joins in highly distributed 

systems, where servers store fragments of relations in an autonomous manner. 

Existing rank-join algorithms exhibit poor performance in this setting due to excessive 

communication costs or high latency. We propose a novel distributed rank-join 

framework that employs data statistics, maintained as histograms, to determine the 

subset of each relational fragment that needs to be fetched to generate the top-k 

join results. At the heart of our framework lies a distributed score bound estimation 

algorithm that produces sufficient score bounds for each relation, that guarantee 

the correctness of the rank-join result set, when the histograms are accurate. Furthermore, 

we propose a generalization of our framework that supports approximate 

statistics, in the case that the exact statistical information is not available. An extensive 

experimental study validates the efficiency of our framework and demonstrates 

its advantages over existing methods. 

Page 

94

Load Balancing for MapReduce-based Entity Resolution 

Lars Kolb (University of Leipzig) 

Andreas Thor (University of Leipzig) 


Abstracts 

The effectiveness and scalability of MapReduce-based implementations of complex data-intensive 

tasks depend on an even redistribution of data between map and reduce 

tasks. In the presence of skewed data, sophisticated redistribution approaches thus 

become necessary to achieve load balancing among all reduce tasks to be executed 

in parallel. For the complex problem of entity resolution, we propose and evaluate 

two approaches for such skew handling and load balancing. The approaches support 

blocking techniques to reduce the search space of entity resolution, utilize a preprocessing 

MapReduce job to analyze the data distribution, and distribute the entities of 

large blocks among multiple reduce tasks. The evaluation on a real cloud infrastructure 

shows the value and effectiveness of the proposed load balancing approaches. 

SeSSion 14: XML AND rDF DATA MANAGEMENT 

Mapping XML to a Wide Sparse Table 

Liang Jeff chen (UcSD) 

Philip A. Bernstein (Microsoft corp.) 

Peter carlin (Microsoft corp.) 

Dimitrije Filipovic (Microsoft corp.) 

Michael rys (Microsoft corp.) 

Nikita Shamgunov (Facebook inc.) 

James F. Terwilliger (Microsoft corp.) 

Milos Todic (Microsoft corp.) 

Sasa Tomasevic (Microsoft corp.) 

Dragan Tomic (Microsoft corp.) 

XML is commonly supported by SQL database systems. However, existing mappings 

of XML to tables can only deliver satisfactory query performance for limited use 

cases. In this paper, we propose a novel mapping of XML data into one wide table 

whose columns are sparsely populated. This mapping provides good performance 

for document types and queries that are observed in enterprise applications but are 

not supported efficiently by existing work. XML queries are evaluated by translating 

them into SQL queries over the wide sparsely-populated table. We show how to 

translate full XPath 1.0 into SQL. Based on the characteristics of the new mapping, 

we present rewriting optimizations that minimize the number of joins. Experiments 

demonstrate that query evaluation over the new mapping delivers considerable 

improvements over existing techniques for the target use cases. 

Querying XML Data: As You Shape It 

curtis E. Dyreson (Utah State University) 

Sourav S. Bhowmick (Nanyang Technological University) 

A limitation of XQuery is that a programmer has to be familiar with the shape of the 

data to query it effectively. And if that shape changes, or if the shape is other than 

Page 

95


what the programmer expects, the query may fail. One way to avoid this limitation 

is to transform the data into a desired shape. A data transformation is a rearrangement 

of data into a new shape. In this paper, we present the semantics and implementation 

of XMorph 2.0, a shape-polymorphic data transformation language for 

XML. An XMorph program can act as a query guard. The guard both transforms 

data to the shape needed by the query and determines whether and how the transformation 

potentially loses information; a transformation that loses information 

may lead to a query yielding an inaccurate result. This paper describes how to use 

XMorph as a query guard, gives a formal semantics for shape-to-shape transformations, 

documents how XMorph determines how a transformation potentially loses 

information, and describes the XMorph implementation. 

Branch Code: A Labeling Scheme for Efficient Query Answering on Trees 

yanghua Xiao (Fudan University) 

Ji Hong (Fudan University) 

Wanyun cui (Fudan University) 

Zhenying He (Fudan University) 

Wei Wang (Fudan University) 

Guodong Feng (Fudan University) 

Labeling schemes lie at the core of query processing for many tree-structured data 

such as XML data that is flooding the web. A labeling scheme that can simultaneously 

and efficiently support various relationship queries on trees (such as parent/ 

children, descendant/ancestor, etc.), computation of lowest common ancestors 

(LCA) and update of trees, is desired for effective and efficient management of 

tree-structured data. Although a variety of labeling schemes such as prefix-based 

labeling, interval-based labeling and prime-based labeling as well as their variants 

have been available to us for encoding static and dynamic trees, these labeling 

schemes usually show weakness in one aspect or another. In this paper, we propose 

an integer-based labeling scheme branch code as well as its compressed version 

as our major solution to simultaneously support efficient query processing on both 

static and dynamic ordered trees with affordable storage cost. The proposed branch 

code can answer common queries on ordered trees in constant time, which comes 

at the cost of consuming O(Nlog N) storage. To reduce storage cost to O(N), a compressed 

branch code is further developed. We also give a relationship determination 

algorithm purely using compressed branch code, which is of quite low possibility to 

produce false positive results as verified by experimental results. With the support 

of splay trees, branch code can also support dynamic trees so that updates and 

queries can be implemented with O(log N) amortized cost. All the results above are 

either theoretically proved or verified by experimental studies. 

Scalable Multi-Query Optimization for SPARQL 

Wangchao Le (University of Utah) 

Anastasios Kementsietsidis (iBM T. J. Watson research center) 

Songyun Duan (iBM T. J. Watson research center) 

Feifei Li (University of Utah) 

This paper revisits the classical problem of multi-query optimization in the context 

of RDF/SPARQL. We show that the techniques developed for relational and 

Page 

96

Abstracts 

semi-structured data/query languages are hard, if not impossible, to be extended 

to account for RDF data model and graph query patterns expressed in SPARQL. In 

light of the NP-hardness of the multi-query optimization for SPARQL, we propose 

heuristic algorithms that partition the input batch of queries into groups such that 

each group of queries can be optimized together. An essential component of the 

optimization incorporates an efficient algorithm to discover the common substructures 

of multiple SPARQL queries and an effective cost model to compare 

candidate execution plans. Since our optimization techniques do not make any 

assumption about the underlying SPARQL query engine, they have the advantage 

of being portable across different RDF stores. The extensive experimental studies, 

performed on three popular RDF stores, show that the proposed techniques are 

effective, efficient and scalable. 

SeSSion 15: PErForMANcE 

GSLPI: a Cost-based Query Progress Indicator 

Jiexing Li (University of Wisconsin-Madison) 

rimma v. Nehme (Microsoft Jim Gray Systems Lab) 

Jeffrey Naughton (University of Wisconsin-Madison) 

Progress indicators for SQL queries were first published in 2004 with the simultaneous 

and independent proposals from Chaudhuri et al. and Luo et al. In this paper, 

we implement both progress indicators in the same commercial RDBMS to investigate 

their performance. We summarize common cases in which they are both accurate 

and cases in which they fail to provide reliable estimates. Although there are 

differences in their performance, much more striking is the similarity in the errors 

they make due to a common simplifying uniform future speed assumption. While 

the developers of these progress indicators were aware that this assumption could 

cause errors, they neither explored how large the errors might be nor did they 

investigate the feasibility of removing the assumption. To rectify this we propose a 

new query progress indicator, similar to these early progress indicators but without 

the uniform speed assumption. Experiments show that on the TPC-H benchmark, 

on queries for which the original progress indicators have errors up to 30X the 

query running time, the new progress indicator is accurate to within 10 percent. We 

also discuss the sources of the errors that still remain and shed some light on what 

would need to be done to eliminate them. 

Micro-Specialization in DBMSes 

rui Zhang (The University of Arizona) 

richard T. Snodgrass (The University of Arizona) 

Saumya Debray (The University of Arizona) 

Relational database management systems are general in the sense that they can 

handle arbitrary schemas, queries, and modifications; this generality is implemented 

using runtime metadata lookups and tests that ensure that control is channelled 

to the appropriate code in all cases. Unfortunately, these lookups and tests are 

carried out even when information is available that renders some of these operations 

superfluous, leading to unnecessary runtime overheads. This paper introduces 

micro-specialization, an approach that uses relation- and query-specific 

Page 

97


information to specialize the DBMS code at runtime and thereby eliminate some of 

these overheads. We develop a taxonomy of approaches and specialization times 

and propose a general architecture that isolates most of the creation and execution 

of the specialized code sequences in a separate DBMS-independent module. 

Through three illustrative types of micro-specializations applied to PostgreSQL, 

we show that this approach requires minimal changes to a DBMS and can improve 

the performance simultaneously across a wide range of queries, modifications, and 

bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C 

benchmarks. 

Towards Multi-Tenant Performance SLOs 

Willis Lang (University of Wisconsin-Madison) 

Srinath Shankar (Microsoft Jim Gray Systems Lab) 

Jignesh M. Patel (University of Wisconsin-Madison) 

Ajay Kalhan (Microsoft corp.) 

As traditional and mission-critical relational database workloads migrate to the 

cloud in the form of Database- as-a-Service (DaaS), there is an increasing motivation 

to provide performance goals in Service Level Objectives (SLOs). Providing 

such performance goals is challenging for DaaS providers as they must balance the 

performance that they can deliver to tenants and the data center’s operating costs. 

In general, aggressively aggregating tenants on each server reduces the operating 

costs but degrades performance for the tenants, and vice versa. In this paper, we 

present a framework that takes as input the tenant workloads, their performance 

SLOs, and the server hardware that is available to the DaaS provider, and outputs 

a cost- effective recipe that specifies how much hardware to provision and how 

to schedule the tenants on each hardware resource. We evaluate our method and 

show that it produces effective solutions that can reduce the costs for the DaaS 

provider while meeting performance goals. 

Multi-Version Concurrency via Timestamp Range Conflict Management 

David Lomet (Microsoft research) 

Alan Fekete (University of Sydney) 

rui Wang (Microsoft research) 

Peter Ward (University of Sydney) 

A database supporting multiple versions of records may use the versions to support 

queries of the past or to increase concurrency by enabling reads and writes to 

be concurrent. We introduce a new concurrency control approach that enables all 

SQL isolation levels including serializability to utilize multiple versions to increase 

concurrency while also supporting transaction time database functionality. The 

key insight is to manage a range of possible timestamps for each transaction that 

captures the impact of conflicts that have occurred. Using these ranges as constraints 

often permits concurrent access where lock based concurrency control 

would block. This can also allow blocking instead of some aborts that are common 

in earlier multi-version concurrency techniques. Also, timestamp ranges can be 

used to conservatively find deadlocks without graph based cycle detection. Thus, 

our multi-version support can enhance performance of current time data access via 

improved concurrency, while supporting transaction time functionality. 

Page 

98

SeSSion 16: DATA EXTrAcTioN AND QUALiTy 

Abstracts 

Automatic Extraction of Structured Web Data with Domain Knowledge 

Nora Derouiche (Télécom ParisTech – cNrS LTci) 

Bogdan cautis (Télécom ParisTech – cNrS LTci) 

Talel Abdessalem (Télécom ParisTech – cNrS LTci) 

We present in this paper a novel approach for extracting structured data from the 

Web, whose goal is to harvest real-world items from template-based HTML pages 

(the structured Web). It illustrates a two-phase querying of the Web, in which an 

intentional description of the data that is targeted is first provided, in a flexible and 

widely applicable manner. The extraction process leverages then both the input 

description and the source structure. Our approach is domain-independent, in the 

sense that it applies to any relation, either flat or nested, describing real-world 

items. Extensive experiments on five different domains and comparison with the 

main state of the art extraction systems from literature illustrate its flexibility and 

precision. We advocate via our technique that automatic extraction and integration 

of complex structured data can be done fast and effectively, when the redundancy 

of the Web meets knowledge over the to-be-extracted data. 

Discovering Conservation Rules 

Lukasz Golab (University of Waterloo) 

Howard Karloff (AT&T Labs–research) 

Flip Korn (AT&T Labs–research) 

Barna Saha (AT&T Labs–research) 

Divesh Srivastava (AT&T Labs–research) 

Many applications process data in which there exists a ``conservation law’’ between 

related quantities. For example, in traffic monitoring, every incoming event, such as 

a packet’s entering a router or a car’s entering an intersection, should ideally have 

an immediate outgoing counterpart. We propose a new class of constraints—-Conservation 

Rules—-that express the semantics and characterize the data quality of 

such applications. We give confidence metrics that quantify how strongly a conservation 

rule holds and present approximation algorithms (with error guarantees) for 

the problem of discovering a concise summary of subsets of the data that satisfy a 

given conservation rule. Using real data, we demonstrate the utility of conservation 

rules and we show order-of-magnitude performance improvements of our discovery 

algorithms over naive approaches. 

Answering Why-not Questions on Top-k Queries 

Zhian He (Hong Kong Polytechnic University), 

Eric Lo (Hong Kong Polytechnic University) 

After decades of effort working on database performance, the quality and the 

usability of database systems have received more attention in recent years. In 

particular, the feature of explaining missing tuples in a query result, or the so-called 

“why-not” questions, has recently become an active topic. In this paper, we study 

the problem of answering why-not questions on top-k queries. Our motivation is 

Page 

99


that we know many users love to use top-k queries when they are making multi-criteria 

decisions. However, they often feel frustrated when they are asked to quantify 

their feeling as a set of numeric weightings, and feel even more frustrated after they 

see the query results do not include their expected answers. In this paper, we use 

the query-refinement method to approach the problem. Given as inputs the original 

top-k query and a set of missing tuples, our algorithm returns to the user a refined 

top-k query that includes the missing tuples. A case study and experimental results 

show that our approach returns high quality explanations to users efficiently. 

An Efficient Trie-based Method for Approximate Entity Extraction with 

Edit-Distance Constraints 

Dong Deng (Tsinghua University) 

Guoliang Li (Tsinghua University) 

Jianhua Feng (Tsinghua University) 

Dictionary-based entity extraction has attracted much attention from the database 

community recently, which locates substrings in a document into predefined entities 

(e.g., person names or locations). To improve extraction recall, a recent trend is 

to provide approximate matching between substrings of the document and entities 

by tolerating minor errors. In this paper we study dictionary-based approximate 

entity extraction with edit-distance constraints. Existing methods have several 

limitations. First, they need to tune many parameters to achieve high performance. 

Second, they are inefficient for large edit-distance thresholds. We propose a triebased 

method to address these problems. We first partition each entity into a set of 

segments, and then use a trie structure to index segments. To extract similar entities, 

we search segments from the document, and extend the matching segments 

in both entities and the document to find similar pairs. We develop an extensionbased 

method to efficiently find similar string pairs by extending the matching 

segments. We optimize our partition scheme and select the best partition strategy 

to improve the extraction performance. Experimental results show that our method 

achieves much higher performance compared with state-of-the-art studies. 

SeSSion 17: ToP-K ProcESSiNG 

On Top-k Structural Similarity Search 

Pei Lee (University of British columbia) 

Laks v.S. Lakshmanan (University of British columbia) 

Jeffrey Xu yu (chinese University of Hong Kong) 

Search for objects similar to a given query object in a network has numerous applications 

including web search and collaborative filtering. We use the notion of 

structural similarity to capture the commonality of two objects in a network, e.g., 

if two nodes are referenced by the same node, they may be similar. Meeting-based 

methods including SimRank and P-Rank capture structural similarity very well. 

Deriving inspiration from PageRank, SimRank has gained popularity by a natural 

intuition and domain independence. Since it’s computationally expensive, subsequent 

work has focused on optimizing and approximating the computation of 

SimRank. In this paper, we approach SimRank from a top-k querying perspective 

where given a query node v, we are interested in finding the top-k nodes that have 

Page 

100

Abstracts 

the highest SimRank score w.r.t. v. The only known approaches for answering such 

queries are either a naive algorithm of computing the similarity matrix for all node 

pairs or computing the similarity vector by comparing the query node v with each 

other node independently, and then picking the top-k. None of these approaches 

can handle top-k structural similarity search efficiently by scaling to very large 

graphs consisting of millions of nodes. We propose an algorithmic framework called 

TopSim based on transforming the top-k SimRank problem on a graph G to one 

of finding the top-k nodes with highest authority on the product graph G G. We 

further accelerate TopSim by merging similarity paths and develop a more efficient 

algorithm called TopSim-SM. Two heuristic algorithms, Trun-TopSim-SM and Prio- 

TopSim-SM, are also proposed to approximate TopSim- SM on scale-free graphs to 

trade accuracy for speed, based on truncated random walk and prioritizing propagation 

respectively. We analyze the accuracy and performance of TopSim family 

algorithms and report the results of a detailed experimental study. 

Relevance Matters: Capitalizing on Less (Top-k Matching in 

Publish/Subscribe) 

Mohammad Sadoghi (University of Toronto) 

Hans-Arno Jacobsen (University of Toronto) 

The efficient processing of large collections of Boolean expressions plays a central 

role in major data intensive applications ranging from user-centric processing 

and personalization to real-time data analysis. Emerging applications such 

as computational advertising and selective information dissemination demand 

determining and presenting to an end-user only the most relevant content that is 

both user-consumable and suitable for limited screen real estate of target devices. 

To retrieve the most relevant content, we present BE*-Tree, a novel indexing data 

structure designed for effective hierarchical top-k pattern matching, which as its 

by-product also reduces the operational cost of processing millions of patterns. To 

further reduce processing cost, BE*-Tree employs an adaptive and non-rigid spacecutting 

technique designed to efficiently index Boolean expressions over a highdimensional 

continuous space. At the core of BE*-Tree lie two innovative ideas: (1) 

a bi-directional tree expansion build as a top-down (data and space clustering) and 

a bottom-up growths (space clustering), which together enable indexing only nonempty 

continuous sub-spaces, and (2) an overlap-free splitting strategy. Finally, the 

performance of BE*-Tree is proven through a comprehensive experimental comparison 

against state-of-the-art index structures for matching Boolean expressions. 

Efficiently Monitoring Top-k Pairs over Sliding Windows 

Zhitao Shen (UNSW) 

Muhammad Aamir cheema (UNSW) 

Xuemin Lin (UNSW & EcNU) 

Wenjie Zhang (UNSW) 

Haixun Wang (Microsoft research Asia) 

Top-k pairs queries have received significant attention by the research community. 

k-closest pairs queries, k-furthest pairs queries and their variants are among the 

most well studied special cases of the top-k pairs queries. In this paper, we present 

the first approach to answer a broad class of top-k pairs queries over sliding 

Page 

101


windows. Our framework handles multiple top-k pairs queries and each query is 

allowed to use a different scoring function, a different value of k and a different size 

of the sliding window. Although the number of possible pairs in the sliding window 

is quadratic to the number of objects N in the sliding window, we efficiently answer 

the top-k pairs query by maintaining a small subset of pairs called K-skyband which 

is expected to consist of O(K log(N/K)) pairs. For all the queries that use the same 

scoring function, we need to maintain only one K-skyband. We present efficient 

techniques for the K-skyband maintenance and query answering. We conduct a 

detailed complexity analysis and show that the expected cost of our approach is 

reasonably close to the lower bound cost. We experimentally verify this by comparing 

our approach with a specially designed supreme algorithm that assumes the 

existence of an oracle and meets the lower bound cost. 

Processing and Notifying Range Top-k Subscriptions 

Albert yu (Duke University) 

Pankaj K. Agarwal (Duke University) 

Jun yang (Duke University) 

We consider how to support a large number of users over a wide-area network 

whose interests are characterised by range top-k continuous queries. Given an 

object update, we need to notify users whose top-k results are affected. Simple 

solutions include using a content-driven network to notify all users whose interest 

ranges contain the update (ignoring top-k), or using a server to compute only the 

affected queries and notifying them individually. The former solution generates too 

much network traffic, while the latter overwhelms the server. We present a geometric 

framework for the problem that allows us to describe the set of affected queries 

succinctly with messages that can be efficiently disseminated using content-driven 

networks. We give fast algorithms to reformulate each update into a set of messages 

whose number is provably optimal, with or without knowing all user interests. 

We also present extensions to our solution, including an approximate algorithm that 

trades off between the cost of server-side reformulation and that of user-side postprocessing, 

as well as efficient techniques for batch updates. 

SESSioN 18: SiMiLAriTy 

Efficient Exact Similarity Searches using Multiple Token Orderings 

Jongik Kim (chonbuk National University) 

Hongrae Lee (Google inc.) 

Similarity searches are essential in many applications including data cleaning and near 

duplicate detection. Many similarity search algorithms first generate candidate records, 

and then identify true matches among them. A major focus of those algorithms has 

been on how to reduce the number of candidate records in the early stage of similarity 

query processing. One of the most commonly used techniques to reduce the candidate 

size is the prefix filtering principle, which exploits the document frequency ordering of 

tokens. In this paper, we propose a novel partitioning technique that considers multiple 

token orderings based on token co-occurrence statistics. Experimental results show 

that the proposed technique is effective in reducing the number of candidate records 

and as a result improves the performance of existing algorithms significantly. 

Page 

102

Abstracts 

Efficient Graph Similarity Joins with Edit Distance Constraints 

Xiang Zhao (The University of New South Wales & NicTA) 

chuan Xiao (The University of New South Wales) 

Xuemin Lin (The University of New South Wales & East china Normal University) 

Wei Wang (The University of New South Wales) 

Graphs are widely used to model complicated data semantics in many applications 

in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend 

is to tolerate noise arising from various sources, such as erroneous data entry, and 

find similarity matches. In this paper, we study the graph similarity join problem that 

returns pairs of graphs such that their edit distances are no larger than a threshold. 

Inspired by the q-gram idea for string similarity problem, our solution extracts 

paths from graphs as features for indexing. We establish a lower bound of common 

features to generate candidates. An efficient algorithm is proposed to exploit both 

matching and mismatching features to improve the filtering and verification on candidates. 

We demonstrate the proposed algorithm significantly outperforms existing 

approaches with extensive experiments on publicly available datasets. 

Parameter-Free Determination of Distance Thresholds for Metric 

Distance Constraints 

Shaoxu Song (Tsinghua University) 

Lei chen (The Hong Kong University of Science and Technology) 

Hong cheng (The chinese University of Hong Kong) 

The importance of introducing distance constraints to data dependencies, such as 

differential dependencies (DDs) [28], has recently been recognized. The metric distance 

constraints are tolerant to small variations, which enable them apply to wide 

data quality checking applications, such as detecting data violations. However, the 

determination of distance thresholds for the metric distance constraints is non-trivial. 

It often relies on a truth data instance which embeds the distance constraints. 

To find useful distance threshold patterns from data, there are several guidelines 

of statistical measures to specify, e.g., support, confidence and dependent quality. 

Unfortunately, given a data instance, users might not have any knowledge about 

the data distribution, thus it is very challenging to set the right parameters. In 

this paper, we study the determination of distance thresholds for metric distance 

constraints, in a parameter-free style. Specifically, we compute an expected utility 

based on the statistical measures from the data. According to our analysis as well 

as experimental verification, distance threshold patterns with higher expected 

utility could offer better usage in real applications, such as violation detection. We 

then develop efficient algorithms to determine the distance thresholds having the 

maximum expected utility. Finally, our extensive experimental evaluation demonstrates 

the effectiveness and efficiency of the proposed methods. 

Page 

103


Random Error Reduction in Similarity Search on Time Series: 

A Statistical Approach 

Wush chi-Hsuan Wu (Academia Sinica) 

Mi-yen yeh (Academia Sinica) 

Jian Pei (Simon Fraser University) 

Errors in measurement can be categorized into two types: systematic errors that 

are predictable, and random errors that are inherently unpredictable and have null 

expected value. Random error is always present in a measurement. More often 

than not, readings in time series may contain inherent random errors due to causes 

like dynamic error, drift, noise, hysteresis, digitalization error and limited sampling 

frequency. Random errors may affect the quality of time series analysis substantially. 

Unfortunately, most of the existing time series mining and analysis methods, 

such as similarity search, clustering, and classification tasks, do not address random 

errors, possibly because random error in a time series, which can be modeled as 

a random variable of unknown distribution, is hard to handle. In this paper, we 

tackle this challenging problem. Taking similarity search as an example, which is an 

essential task in time series analysis, we develop MISQ, a statistical approach for 

random error reduction in time series analysis. The major intuition in our method is 

to use only the readings at different time instants in a time series to reduce random 

errors. We achieve a highly desirable property in MISQ: it can ensure that the recall 

is above a user-specified threshold. An extensive empirical study on 20 benchmark 

real data sets clearly shows that our method can lead to better performance than 

the baseline method without random error reduction in real applications such as 

classification. Moreover, MISQ achieves good quality in similarity search. 

SeSSion 19: TEXT AND STriNGS 

Optimizing Statistical Information Extraction Programs Over 

Evolving Text 

Fei chen (HP Labs china) 

Xixuan Feng (University of Wisconsin-Madison) 

christopher re (University of Wisconsin-Madison) 

Min Wang (HP Labs china) 

Statistical information extraction (IE) programs are increasingly used to build realworld 

IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical 

IE approaches consider the text corpora underlying the extraction program to be 

static. However, many real-world text corpora are dynamic (documents are inserted, 

modified, and removed). As the corpus evolves, and IE programs must be applied 

repeatedly to consecutive corpus snapshots to keep extracted information up to 

date. Applying IE from scratch to each snapshot may be inefficient: a pair of consecutive 

snapshots may change very little, but unaware of this, the program must 

run again from scratch. In this paper, we present \crflex, a system that efficiently 

executes such repeated statistical IE, by recycling previous IE results to enable incremental 

update. We focus on statistical IE programs which use a leading statistical 

model, Conditional Random Fields (CRFs). We show how to model properties 

of the CRF inference algorithms for incremental update and how to exploit them 

Page 

104

Abstracts 

to correctly recycle previous inference results. Then we show how to efficiently 

capture and store intermediate results of IE programs for subsequent recycling. 

We find that there is a tradeoff between the I/O cost spent on reading and writing 

intermediate results, and CPU cost we can save from recycling those intermediate 

results. Therefore we present a cost-based solution to determine the most efficient 

recycling approach for any given CRF-based IE program and an evolving corpus. 

We present extensive experiments with CRF-based IE programs for 3 IE tasks over 

a real-world data set to demonstrate the utility of our approach. 

Approximate String Membership Checking: A Multiple Filter, 

Optimization-Based Approach 

chong Sun (University of Wisconsin-Madison) 

Jeffrey F. Naughton (University of Wisconsin-Madison) 

Siddharth Barman (University of Wisconsin-Madison) 

We consider the approximate string membership checking (ASMC) problem of extracting 

all the strings or substrings in a document that approximately match some 

string in a given dictionary. To solve this problem, the current state-of-art approach 

involves first applying an approximate, fast filter, then applying a more expensive 

exact verification algorithm to the strings that pass the filter. Correspondingly, 

many string filters have been proposed. We note that different filters are good at 

eliminating different strings, depending on the characteristics of the strings in both 

the documents and the dictionary. We suspect that no single filter will dominate all 

other filters everywhere. Given an ASMC problem instance and a set of string filters, 

we need to select the optimal filter to maximize the performance. Furthermore, in 

our experiments we found that in some cases a sequence of filters dominates any 

of the filters of the sequence in isolation, and that the best set of filters and their 

ordering depend upon the specific problem instance encountered. Accordingly, we 

propose that the approximate match problem be viewed as an optimization problem, 

and evaluate a number of techniques for solving this optimization problem. 

On Text Clustering with Side Information 

charu c. Aggarwal (iBM T. J. Watson research center) 

yuchen Zhao (University of illinois at chicago) 

Philip S. yu (University of illinois at chicago) 

Text clustering has become an increasingly important problem in recent years 

because of the tremendous amount of unstructured data which is available in various 

forms in online forums such as the web, social networks, and other information 

networks. In most cases, the data is not purely available in text form. A lot of side-information 

is available along with the text documents. Such side-information may be of 

different kinds, such as the links in the document, user-access behavior from web logs, 

or other non-textual attributes which are embedded into the text document. Such 

attributes may contain a tremendous amount of information for clustering purposes. 

However, the relative importance of this side-information may be difficult to estimate, 

especially when some of the information is noisy. In such cases, it can be risky to 

incorporate side-information into the clustering process, because it can either improve 

the quality of the representation for clustering, or can add noise to the process. Therefore, 

we need a principled way to perform the clustering process, so as to maximize 

Page 

105


the advantages from using this side information. In this paper, we design an algorithm 

which combines classical partitioning algorithms with probabilistic models in order to 

create an effective clustering approach. We present experimental results on a number 

of real data sets in order to illustrate the advantages of using such an approach. 

Fast SLCA and ELCA Computation for XML Keyword Queries based on 

Set Intersection 

Junfeng Zhou (yanshan University) 

Zhifeng Bao (National University of Singapore) 

Wei Wang (The University of New South Wales) 

Tok Wang Ling (National University of Singapore) 

Ziyang chen (yanshan University) 

Xudong Lin (yanshan University) 

Jingfeng Guo (yanshan University) 

In this paper, we focus on efficient keyword query processing for XML data based 

on the SLCA and ELCA semantics. We propose a novel form of inverted lists for keywords 

which include IDs of nodes that directly or indirectly contain a given keyword. 

We propose a family of efficient algorithms that are based on the set intersection operation 

for both semantics. We show that the problem of SLCA/ELCA computation 

becomes finding a set of nodes that appear in all involved inverted lists and satisfy 

certain conditions. We also propose several optimization techniques to further improve 

the query processing performance. We have conducted extensive experiments 

with many alternative methods. The results demonstrate that our proposed methods 

outperform previous methods by up to two orders of magnitude in many cases. 

SeSSion 20: QUEry ProcESSiNG ii 

Optimization of Massive Pattern Queries by Dynamic 

Configuration Morphing 

Nikolay Laptev (University of california, Los Angeles) 

carlo Zaniolo (University of california, Los Angeles) 

Complex pattern queries play a critical role in many applications that must efficiently 

search databases and data streams. Current techniques support the search 

for multiple patterns using deterministic or non-deterministic automata. In practice 

however, the static pattern representation does not fully utilize available system 

resources, subsequently suffering from poor performance. Therefore a low overhead 

auto-reconfigurable automaton is needed that optimizes pattern matching 

performance. In this paper, we propose a dynamic system that entails the efficient 

and reliable evaluation of a very large number of pattern queries on a resource constrained 

system under changing stress-load. Our system prototype, Morpheus, precomputes 

several query pattern representations, named templates, which are then 

morphed into a required form during run-time. Morpheus uses templates to speed 

up dynamic automaton reconfiguration. Results from empirical studies confirm the 

benefits of our approach, with three orders of magnitude improvement achieved in 

the overall pattern matching performance with the help of dynamic reconfiguration. 

This is accomplished only with a modest increase in amortized memory usage. 

Page 

106

Three-level Processing of Multiple Aggregate Continuous Queries 

Shenoda Guirguis (University of Pittsburgh) 

Mohamed A. Sharaf (The University of Queensland) 

Panos K. chrysanthis (University of Pittsburgh) 

Alexandros Labrinidis (University of Pittsburgh) 

Abstracts 

Aggregate Continuous Queries (ACQs) are both a very popular class of Continuous 

Queries (CQs) and also have a potentially high execution cost. As such, optimizing 

the processing of ACQs is imperative for Data Stream Management Systems 

(DSMSs) to reach their full potential in supporting (critical) monitoring applications. 

For multiple ACQs that vary in window specifications and pre-aggregation filters, 

existing multiple ACQs optimization schemes assume a processing model where 

each ACQ is computed as a final-aggregation of a sub-aggregation. In this paper, 

we propose a novel processing model for ACQs, called TriOps, with the goal of 

minimizing the repetition of operator execution at the sub-aggregation level. We 

also propose TriWeave, a TriOps-aware multi-query optimizer. We analytically and 

experimentally demonstrate the performance gains of our proposed schemes which 

shows their superiority over alternative schemes. Finally, we generalize TriWeave to 

incorporate the classical subsumption-based multi-query optimization techniques. 

Accelerating Range Queries For Brain Simulations 

Farhan Tauheed (EPFL) 

Laurynas Biveinis (Aalborg University) 

Thomas Heinis (EPFL) 

Felix Schürmann (EPFL) 

Henry Markram (EPFL) 

Anastasia Ailamaki (EPFL) 

Neuroscientists increasingly use computational tools in building and simulating 

models of the brain. The amounts of data involved In these simulations are immense 

and efficiently managing this data is key. One particular problem in analyzing this 

data is the scalable execution of range queries on spatial models of the brain. 

Known indexing approaches do not perform well even on today’s small models 

which represent a small fraction of the brain, containing only few millions of densely 

packed spatial elements. The problem of current approaches is that with the increasing 

level of detail in the models, also the overlap in the tree structure increases, 

ultimately slowing down query execution. The neuroscientists’ need to work 

with bigger and more detailed (denser) models thus motivates us to develop a new 

indexing approach. To this end we develop FLAT, a scalable indexing approach for 

dense data sets. We base the development of FLAT on the key observation that 

current approaches suffer from overlap in case of dense data sets. We hence design 

FLAT as an approach with two phases, each independent of density. In the first 

phase it uses a traditional spatial index to retrieve an initial object efficiently. In the 

second phase it traverses the initial object’s neighborhood to retrieve the remaining 

query result. Our experimental results show that FLAT not only outperforms R-Tree 

variants from a factor of two up to eight but that it also achieves independence 

from data set size and density. 

Page 

107


Keyword Query Reformulation on Structured Data 

Junjie yao (Peking University) 

Bin cui (Peking University) 

Liansheng Hua (Peking University) 

yuxin Huang (Peking University) 

Textual web pages dominate web search engines nowadays. However, there is 

also a striking increase of structured data on the web. Efficient keyword query 

processing on structured data has attracted enough attention, but effective query 

understanding has yet to be investigated. In this paper, we focus on the problem of 

keyword query reformulation in the structured data scenario. These reformulated 

queries provide alternative descriptions of original input. They could better capture 

users’ information need and guide users to explore related items in the target 

structured data. We propose an automatic keyword query reformulation approach 

by exploiting structural semantics in the underlying structured data sources. The 

reformulation solution is decomposed into two stages, i.e., offline term relation 

extraction and online query generation. We first utilize a heterogenous graph to 

model the words and items in structured data, and design an enhanced Random 

Walk approach to extract relevant terms from the graph context. In the online query 

reformulation stage, we introduce an efficient probabilistic generation module to 

suggest substitutable reformulated queries. Extensive experiments are conducted 

on a real-life data set, and our approach yields promising results. 

SeSSion 21: DATA MiNiNG 

Predicting Approximate Protein-DNA Binding Cores Using 

Association Rule Mining 

Po-yuen Wong (The chinese University of Hong Kong) 

Tak-Ming chan (The chinese University of Hong Kong) 

Man-Hon Wong (The chinese University of Hong Kong) 

Kwong-Sak Leung (The chinese University of Hong Kong) 

The studies of protein-DNA bindings between transcription factors (TFs) and transcription 

factor binding sites (TFBSs) are important bioinformatics topics. High-resolution 

(length490) are shown promising in identifying 

accurate binding cores without using any 3D structures. While the current association 

rule mining method on this problem addresses exact sequences only, the most 

recent ad hoc method for approximation does not establish any formal model and is 

limited by experimentally known patterns. As biological mutations are common, it is 

desirable to formally extend the exact model into an approximate one. In this paper, 

we formalize the problem of mining approximate protein-DNA association rules 

from sequence data and propose a novel efficient algorithm to predict protein-DNA 

binding cores. Our two-phase algorithm first constructs two compact intermediate 

structures called frequent sequence tree (FS-Tree) and frequent sequence class tree 

(FSCTree). Approximate association rules are efficiently generated from the structures 

and bioinformatics concepts (position weight matrix and information content) 

Page 

108

Abstracts 

are further employed to prune meaningless rules. Experimental results on real data 

show the performance and applicability of the proposed algorithm. 

Upgrading Uncompetitive Products Economically 

Hua Lu (Aalborg University) 

christian S. Jensen (Aarhus University) 

The skyline of a multidimensional point set consists of the points that are not 

dominated by other points. In a scenario where product features are represented by 

multidimensional points, the skyline points may be viewed as representing competitive 

products. A product provider may wish to upgrade uncompetitive products to 

become competitive, but wants to take into account the upgrading cost. We study 

the top-k product upgrading problem. Given a set P of competitor products, a set 

T of products that are candidates for upgrade, and an upgrading cost function f 

that applies to T, the problem is to return the k products in T that can be upgraded 

to not be dominated by any products in P at the lowest cost. This problem is nontrivial 

due to not only the large data set sizes, but also to the many possibilities for 

upgrading a product. We identify and provide solutions for the different options for 

upgrading an uncompetitive product, and combine the solutions into a single solution. 

We also propose a spatial join-based solution that assumes P and T are indexed 

by an R-tree. Given a set of products in the same R-tree node, we derive three lower 

bounds on their upgrading costs. These bounds are employed by the join approach 

to prune upgrade candidates with uncompetitive upgrade costs. Empirical studies 

with synthetic and real data show that the join approach is efficient and scalable. 

Attribute-Based Subsequence Matching and Mining 

yu Peng (The Hong Kong University of Science and Technology) 

raymond chi-Wing Wong (The Hong Kong University of Science and Technology) 

Liangliang ye (The Hong Kong University of Science and Technology) 


Sequence analysis is very important in our daily life. Typically, each sequence is 

associated with an ordered list of elements. For example, in a movie rental application, 

a customer’s movie rental record containing an ordered list of movies is a 

sequence example. Most studies about sequence analysis focus on subsequence 

matching which finds all sequences stored in the database such that a given query 

sequence is a subsequence of each of these sequences. In many applications, 

elements are associated with properties or attributes. For example, each movie is 

associated with some attributes like “Director” and “Actors”. Unfortunately, to the 

best of our knowledge, all existing studies about sequence analysis do not consider 

the attributes of elements. In this paper, we propose two problems. The first problem 

is: given a query sequence and a set of sequences, considering the attributes of 

elements, we want to find all sequences which are matched by this query sequence. 

This problem is called attribute-based subsequence matching (ASM). All existing 

applications for the traditional subsequence matching problem can also be applied 

to our new problem provided that we are given the attributes of elements. We propose 

an efficient algorithm for problem ASM. The key idea to the efficiency of this 

algorithm is to compress each whole sequence with potentially many associated 

attributes into just a triplet of numbers. By dealing with these very compressed rep- 

Page 

109


resentations, we greatly speed up the attribute-based subsequence matching. The 

second problem is to find all frequent attribute-based subsequence. We also adapt 

an existing efficient algorithm for this second problem to show we can use the algorithm 

developed for the first problem. Empirical studies show that our algorithms 

are scalable in large datasets. In particular, our algorithms run at least an order of 

magnitude faster than a straightforward method in most cases. This work can stimulate 

a number of existing data mining problems which are fundamentally based on 

subsequence matching such as sequence classification, frequent sequence mining, 

motif detection and sequence matching in bioinformatics. 

Integrating Frequent Pattern Mining from Multiple Data Domains 

for Classification 

Dhaval Patel (National University of Singapore) 

Wynne Hsu (National University of Singapore) 

Mong Li Lee (National University of Singapore) 

Many frequent pattern mining algorithms have been developed for categorical, 

numerical, time series, or interval data. However, little attention has been given to 

integrate these algorithms so as to mine frequent patterns from multiple domain 

datasets for classification. In this paper, we introduce the notion of a heterogenous 

pattern to capture the associations among different kinds of data. We propose a 

unified framework for mining multiple domain datasets and design an iterative algorithm 

called HTMiner. HTMiner discovers essential heterogenous patterns for classification 

and performs instance elimination. This instance elimination step reduces 

the problem size progressively by removing training instances which are correctly 

covered by the discovered essential heterogenous pattern. Experiments on two real 

world datasets show that the HTMiner is efficient and can significantly improve the 

classification accuracy. 

SeSSion 22: 

SciENTiFic DATA, ANALySiS AND viSUALiZATioN 

Efficient Versioning for Scientific Array Databases 

Adam Seering (MiT cSAiL) 

Philippe cudre-Mauroux (University of Fribourg) 

Samuel Madden (MiT cSAiL) 

Michael Stonebraker (MiT cSAiL) 

In this paper, we describe a versioned database storage manager we are developing 

for the SciDB scientific database. The system is designed to efficiently store and 

retrieve array-oriented data, exposing a ``no-overwrite’’ storage model in which 

each update creates a new ``version’’ of an array. This makes it possible to perform 

comparisons of versions produced at different times or by different algorithms, and 

to create complex chains and trees of versions. We present algorithms to efficiently 

encode these versions, minimizing storage space while still providing efficient access 

to the data. Additionally, we present an optimal algorithm that, given a long 

sequence of versions, determines which versions to encode in terms of each other 

(using delta compression) to minimize total storage space or query execution cost. 

Page 

110

Abstracts 

We compare the performance of these algorithms on real world data sets from the 

National Oceanic and Atmospheric Administration (NOAA), OpenStreetMaps, and 

several other sources. We show that our algorithms provide better performance 

than existing version control systems not optimized for array data, both in terms of 

storage size and access time, and that our delta-compression algorithms are able to 

substantially reduce the total storage space when versions exist with a high degree 

of similarity. 

Multidimensional Analysis of Atypical Events in Cyber-Physical Data 

Lu-An Tang (UiUc) 

Xiao yu (UiUc) 

Sangkyum Kim (UiUc) 

Jiawei Han (UiUc) 

Wen-chih Peng (National chiao Tung University) 

yizhou Sun (UiUc) 

Hector Gonzalez (Google) 

Sebastian Seith (Morning Star) 

A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras) 

with cyber (or informational) components to form a situation-integrated analytical 

system that may respond intelligently to dynamic changes of the real-world situations. 

CPS claims many promising applications, such as traffic observation, battlefield 

surveillance and sensor-networkbased monitoring. One important research 

topic in CPS is about the atypical event analysis, i.e., retrieving the events from 

large amount of data and analyzing them with spatial, temporal and other multidimensional 

information. Many traditional approaches are not feasible for such 

analysis since they use numeric measures and cannot describe the complex atypical 

events. In this study, we propose a new model of atypical cluster to effectively 

represent those events and efficiently retrieve them from massive data. The microcluster 

is designed to summarize individual events, and the macro-cluster is used 

to integrate the information from multiple event. To facilitate scalable, flexible and 

online analysis, the concept of significant cluster is defined and a guided clustering 

algorithm is proposed to retrieve significant clusters in an efficient manner. We 

conduct experiments on real datasets with the size of more than 50 GB, the results 

show that the proposed method can provide more accurate information with only 

15% to 20% time cost of the baselines. 

HiCS: High Contrast Subspaces for Density-Based Outlier Ranking 

Fabian Keller (Karlsruhe institute of Technology) 


Klemens Böhm (Karlsruhe institute of Technology) 

Outlier mining is a major task in data analysis. Outliers are objects that highly deviate 

from regular objects in their local neighborhood. Density-based outlier ranking 

methods score each object based on its degree of deviation. In many applications, 

these ranking methods degenerate to random listings due to low contrast between 

outliers and regular objects. Outliers do not show up in the scattered full space, 

they are hidden in multiple high contrast subspace projections of the data. Measuring 

the contrast of such subspaces for outlier rankings is an open research chal- 

Page 

111


lenge. In this work, we propose a novel subspace search method that selects high 

contrast subspaces for density-based outlier ranking. It is designed as pre-processing 

step to outlier ranking algorithms. It searches for high contrast subspaces with 

a significant amount of conditional dependence among the subspace dimensions. 

With our approach, we propose a first measure for the contrast of subspaces. Thus, 

we enhance the quality of traditional outlier rankings by computing outlier scores in 

high contrast projections only. The evaluation on real and synthetic data shows that 

our approach outperforms traditional dimensionality reduction techniques, naive 

random projections as well as state-of-the-art subspace search techniques and 

provides enhanced quality for outlier ranking. 

Extracting Analyzing and Visualizing Triangle K-Core Motifs 

within Networks 

yang Zhang (The ohio State University) 

Srinivasan Parthasarathy (The ohio State University) 

Cliques are topological structures that usually provide important information 

for understanding the structure of a graph or network. However, detecting and 

extracting cliques efficiently is known to be very hard. In this paper, we define and 

introduce the notion of a Triangle K-Core, a simpler topological structure and one 

that is more tractable and can moreover be used as a proxy for extracting cliquelike 

structure from large graphs. Based on this definition we first develop a localized 

algorithm for extracting Triangle K-Cores from large graphs. Subsequently we 

extend the simple algorithm to accommodate dynamic graphs (where edges can 

be dynamically added and deleted). Finally, we extend the basic definition to support 

various template pattern cliques with applications to network visualization and 

event detection on graphs and networks. Our empirical results reveal the efficiency 

and efficacy of the proposed methods on many real world datasets. 

SeSSion 23: SiMiLAriTy SEArcH AND DETEcTioN 

Horizontal Reduction: Instance-Level Dimensionality Reduction for 

Similarity Search in Large Document Databases 

Min Soo Kim (KAiST) 

Kyu-young Whang (KAiST) 

yang-Sae Moon (Kangwon National University) 

Dimensionality reduction is essential in text mining since the dimensionality of text 

documents could easily reach several tens of thousands. Most recent efforts on 

dimensionality reduction, however, are not adequate to large document databases 

due to lack of scalability. We hence propose a new type of simple but effective 

dimensionality reduction, called horizontal (dimensionality) reduction, for large 

document databases. Horizontal reduction converts each text document to a few 

bitmap vectors and provides tight lower bounds of inter-document distances using 

those bitmap vectors. Bitmap representation is very simple and extremely fast, and 

its instance-based nature makes it suitable for large and dynamic document databases. 

Using the proposed horizontal reduction, we develop an efficient k-nearest 

neighbor (k-NN) search algorithm for text mining such as classification and clustering, 

and we formally prove its correctness. The proposed algorithm decreases I/O 

Page 

112

Abstracts 

and CPU overheads simultaneously since horizontal reduction (1) reduces the number 

of accesses to documents significantly by exploiting the bitmap-based lower 

bounds in filtering dissimilar documents at an early stage, and accordingly, (2) 

decreases the number of CPU-intensive computations for obtaining a real distance 

between high-dimensional document vectors. Extensive experimental results show 

that horizontal reduction improves the performance of the reduction (preprocessing) 

process by one to two orders of magnitude compared with existing reduction 

techniques, and our k-NN search algorithm significantly outperforms the existing 

ones by one to three orders of magnitude. 

Adaptive Windows for Duplicate Detection 

Uwe Draisbach (Hasso-Plattner-institute) 

Felix Naumann (Hasso-Plattner-institute) 

Sascha Szott (Zuse institute) 

oliver Wonneberg (r. Lindner GmbH & co. KG) 

Duplicate detection is the task of identifying all groups of records within a data set 

that represent the same real-world entity, respectively. This task is difficult, because 

(i) representations might differ slightly, so some similarity measure must be defined 

to compare pairs of records and (ii) data sets might have a high volume making a 

pair-wise comparison of all records infeasible. To tackle the second problem, many 

algorithms have been suggested that partition the data set and compare all record 

pairs only within each partition. One well-known such approach is the Sorted Neighborhood 

Method (SNM), which sorts the data according to some key and then advances 

a window over the data comparing only records that appear within the same 

window. We propose with the Duplicate Count Strategy (DCS) a variation of SNM that 

uses a varying window size. It is based on the intuition that there might be regions of 

high similarity suggesting a larger window size and regions of lower similarity suggesting 

a smaller window size. Next to the basic variant of DCS, we also propose and 

thoroughly evaluate a variant called DCS++ which is provably better than the original 

SNM in terms of efficiency (same results with fewer comparisons). 

Efficient Dual-Resolution Layer Indexing for Top-k Queries 

Jongwuk Lee (Pohang University of Science and Technology (PoSTEcH)) 

Hyunsouk cho (Pohang University of Science and Technology (PoSTEcH)) 

Seung-won Hwang (Pohang University of Science and Technology (PoSTEcH)) 

Top-k queries have gained considerable attention as an effective means for narrowing 

down the overwhelming amount of data. This paper studies the problem 

of constructing an indexing structure that efficiently supports top-k queries for 

varying scoring functions and retrieval sizes. The existing work can be categorized 

into three classes: list-, layer-, and view-based approaches. This paper focuses on 

the layer-based approach, pre-materializing tuples into consecutive multiple layers. 

The layer-based index enables us to return top-k answers efficiently by restricting 

access to tuples in the k layers. However, we observe that the number of tuples 

accessed in each layer can be reduced further. For this purpose, we propose a dualresolution 

layer structure. Specifically, we iteratively build coarse-level layers using 

skylines, and divide each coarse-level layer into fine-level sublayers using convex 

skylines. The dual-resolution layer is able to leverage not only the dominance rela- 

Page 

113


tionship between coarse-level layers, named forall-dominance, but also a relaxed 

dominance relationship between fine-level sublayers, named exists-dominance. Our 

extensive evaluation results demonstrate that our proposed method significantly 

reduces the number of tuples accessed than the state-of-the-art methods. 

Evaluating Probabilistic Queries over Uncertain Matching 

reynold cheng (The University of Hong Kong) 

Jian Gong (The University of Hong Kong) 

David W. cheung (The University of Hong Kong) 

Jiefeng cheng (Shenzhen institute of Advanced Technology) 

A matching between two database schemas, generated by machine learning 

techniques (e.g., COMA++), is often uncertain. Handling the uncertainty of schema 

matching has recently raised a lot of research interest, because the quality of applications 

rely on the matching result. We study query evaluation over an inexact 

schema matching, which is represented as a set of ``possible mappings’’, as well 

as the probabilities that they are correct. Since the number of possible mappings 

can be large, evaluating queries through these mappings can be expensive. By 

observing the fact that the possible mappings between two schemas often exhibit 

a high degree of overlap, we develop two efficient solutions. We also present a fast 

algorithm to compute answers with the k highest probabilities. An extensive evaluation 

on real schemas shows that our approaches improve the query performance by 

almost an order of magnitude. 

SeSSion 24: SENSorS NETWorK AND TrAJEcTory 

Detecting Outliers in Sensor Networks using the Geometric Approach 

Sabbas Burdakis (Technical University of crete) 

Antonios Deligiannakis (Technical University of crete) 

The topic of outlier detection in sensor networks has received significant attention 

in recent years. Detecting when the measurements of a node become ``abnormal’’ 

is interesting, because this event may help detect either a malfunctioning node, or a 

node that starts observing a local interesting phenomenon (i.e., a fire). In this paper 

we present a new algorithm for detecting outliers in sensor networks, based on the 

geometric approach. Unlike prior work. our algorithms perform a distributed monitoring 

of outlier readings, exhibit 100% accuracy in their monitoring (assuming no 

message losses), and require the transmission of messages only at a fraction of the 

epochs, thus allowing nodes to safely refrain from transmitting in many epochs. Our 

approach is based on transforming common similarity metrics in a way that admits 

the application of the recently proposed geometric approach. We then propose 

a general framework and suggest multiple modes of operation, which allow each 

sensor node to accurately monitor its similarity to other nodes. Our experiments 

demonstrate that our algorithms can accurately detect outliers at a fraction of the 

communication cost that a centralized approach would require (even in the case 

where the central node lies just one hop away from all sensor nodes). Moreover, we 

demonstrate that these bandwidth savings become even larger as we incorporate 

further optimizations in our proposed modes of operation. 

Page 

114

Efficient Threshold Monitoring for Distributed Probabilistic Data 

Mingwang Tang (University of Utah) 

Feifei Li (University of Utah) 

Jeff M. Phillips (University of Utah) 

Jeffrey Jestes (University of Utah) 

Abstracts 

In distributed data management, a primary concern is monitoring the distributed 

data and generating an alarm when a user specified constraint is violated. A particular 

useful instance is the threshold based constraint, which is commonly known 

as the distributed threshold monitoring problem. This work extends this useful and 

fundamental study to distributed probabilistic data that emerge in a lot of applications, 

where uncertainty naturally exists when massive amounts of data are 

produced at multiple sources in distributed, networked locations. Examples include 

distributed observing stations, large sensor fields, geographically separate scientific 

institutes/units and many more. When dealing with probabilistic data, there 

are two thresholds involved, the score and the probability thresholds. One must 

monitor both simultaneously, as such, techniques developed for deterministic data 

are no longer directly applicable. This work presents a comprehensive study to this 

problem. Our algorithms have significantly outperformed the baseline method in 

terms of both the communication cost (number of messages and bytes) and the 

running time, as shown by an extensive experimental evaluation using several, real 

large datasets. 

Incorporating Duration Information for Trajectory Classification 

Dhaval Patel (National University of Singapore) 

chang Sheng (DBS Bank) 

Wynne Hsu (National University of Singapore) 

Mong Li Lee (National University of Singapore) 

Trajectory classification has many useful applications. Existing works on trajectory 

classification do not consider the duration information of trajectory. In this 

paper, we extract duration-aware features from trajectories to build a classifier. Our 

method utilizes information theory to obtain regions where the trajectories have 

similar speeds and directions. Further, trajectories are summarized into a network 

based on the MDL principle that takes into account the duration difference among 

trajectories of different classes. A graph traversal is performed on this trajectory 

network to obtain the top-k covering path rules for each trajectory. Based on the 

discovered regions and top-k path rules, we build a classifier to predict the class 

labels of new trajectories. Experiment results on real-world datasets show that the 

proposed duration-aware classifier can obtain higher classification accuracy than 

the state-of-the-art trajectory classifier. 

Page 

115


Reducing Uncertainty of Low-Sampling-Rate Trajectories 

Kai Zheng (The University of Queensland) 

yu Zheng (Microsoft research Asia) 

Xing Xie (Microsoft research Asia) 

Xiaofang Zhou (The University of Queensland) 

The increasing availability of GPS-embedded mobile devices has given rise to a new 

spectrum of location-based services, which have accumulated a huge collection of 

location trajectories. In practice, a large portion of these trajectories are of lowsampling-rate. 

For instance, the time interval between consecutive GPS points of 

some trajectories can be several minutes or even hours. With such a low sampling 

rate, most details of their movement are lost, which makes them difficult to process 

effectively. In this work, we investigate how to reduce the uncertainty in such kind 

of trajectories. Specifically, given a low-sampling-rate trajectory, we aim to infer its 

possible routes. The methodology adopted in our work is to take full advantage 

of the rich information extracted from the historical trajectories. We propose a 

systematic solution, History based Route Inference System (HRIS), which covers a 

series of novel algorithms that can derive the travel pattern from historical data and 

incorporate it into the route inference process. To validate the effectiveness of the 

system, we apply our solution to the map-matching problem which is an important 

application scenario of this work, and conduct extensive experiments on a real 

taxi trajectory dataset. The experiment results demonstrate that HRIS can achieve 

higher accuracy than the existing map-matching algorithms for low-sampling-rate 

trajectories. 

SeSSion 25: Error rEDUcTioN AND DATA SEcUriTy 

Efficient Similarity Search over Encrypted Data 

Mehmet Kuzu (The University of Texas at Dallas) 

Mohammad Saiful islam (The University of Texas at Dallas) 

Murat Kantarcioglu (The University of Texas at Dallas) 

In recent years, due to the appealing features of cloud computing, large amount 

of data have been stored in the cloud. Although cloud based services offer many 

advantages, privacy and security of the sensitive data is a big concern. To mitigate 

the concerns, it is desirable to outsource sensitive data in encrypted form. Encrypted 

storage protects the data against illegal access, but it complicates some basic, 

yet important functionality such as the search on the data. To achieve search over 

encrypted data without compromising the privacy, considerable amount of searchable 

encryption schemes have been proposed in the literature. However, almost all 

of them handle exact query matching but not similarity matching; a crucial requirement 

for real world applications. Although some sophisticated secure multi-party 

computation based cryptographic techniques are available for similarity tests, they 

are computationally intensive and do not scale for large data sources. In this paper, 

we propose an efficient scheme for similarity search over encrypted data. To do so, 

we utilize a state-of-the-art algorithm for fast near neighbor search in high dimensional 

spaces called locality sensitive hashing. To ensure the confidentiality of the 

sensitive data, we provide a rigorous security definition and prove the security of 

the proposed scheme under the provided definition. In addition, we provide a real 

Page 

116

Abstracts 

world application of the proposed scheme and verify the theoretical results with 

empirical observations on a real dataset. 

Obfuscating the Topical Intention in Enterprise Text Search 

HweeHwa Pang (Singapore Management University) 

Xiaokui Xiao (Nanyang Technological University) 

Jialie Shen (Singapore Management University) 

The text search queries in an enterprise can reveal the users’ topic of interest, and 

in turn confidential staff or business information. To safeguard the enterprise from 

consequences arising from a disclosure of the query traces, it is desirable to obfuscate 

the true user intention from the search engine, without requiring it to be reengineered. 

In this paper, we advocate a unique approach to profile the topics that 

are relevant to the user intention. Based on this approach, we introduce an (epsilon 1 , 

epsilon 2 )-privacy model that allows a user to stipulate that topics relevant to her 

intention at epsilon 1 level should appear to any adversary to be innocuous at epsilon 

2 level. We then present a TopPriv algorithm to achieve the customized (epsilon 1 , 

epsilon 2 )-privacy requirement of individual users through injecting automatically 

formulated fake queries. The advantages of TopPriv over existing techniques are 

confirmed through benchmark queries on a real corpus, with experiment settings 

fashioned after an enterprise search application. 

Correlation Support for Risk Evaluation in Databases 

Katrin Eisenreich (SAP research) 

Jochen Adamek (Technische Universität Berlin) 

Philipp rösch (SAP research) 

volker Markl (Technische Universität Berlin) 

Gregor Hackenbroich (SAP research) 

Investigating potential dependencies in data and their effect on future business 

developments can help experts to prevent misestimations of risks and chances. This 

makes correlation a highly important factor in risk analysis tasks. Previous research 

on correlation in uncertain data management addressed foremost the handling of 

dependencies between discrete rather than continuous distributions. Also, none of 

the existing approaches provides a clear method for extracting correlation structures 

from data and introducing assumptions about correlation to independently 

represented data. To enable risk analysis under correlation assumptions, we use 

an approximation technique based on copula functions. This technique enables 

analysts to introduce arbitrary correlation structures between arbitrary distributions 

and calculate relevant measures over thus correlated data. The correlation information 

can either be extracted at runtime from historic data or be accessed from a 

parametrically precomputed structure. We discuss the construction, application and 

querying of approximate correlation representations for different analysis tasks. Our 

experiments demonstrate the efficiency and accuracy of the proposed approach, 

and point out several possibilities for optimization. 

Page 

117


A Game-Theoretic Approach for High-Assurance of Data Trustworthiness 

in Sensor Networks 

Hyo-Sang Lim (Purdue University & computer and Telecommunications Engineering 

Division, South Korea) 

Gabriel Ghinita (University of Massachusetts at Boston) 

Elisa Bertino (Purdue University) 

Murat Kantarcioglu (University of Texas at Dallas) 

Sensor networks are being increasingly deployed in many application domains 

ranging from environment monitoring to supervising critical infrastructure systems 

(e.g., the power grid). Due to their ability to continuously collect large amounts of 

data, sensor networks represent a key component in decision-making, enabling 

timely situation assessment and response. However, sensors deployed in hostile environments 

may be subject to attacks by adversaries who intend to inject false data 

into the system. In this context, data trustworthiness is an important concern, as 

false readings may result in wrong decisions with serious consequences (e.g., largescale 

power outages). To defend against this threat, it is important to establish trust 

levels for sensor nodes and adjust node trustworthiness scores to account for malicious 

interferences. In this paper, we develop a game-theoretic defense strategy 

to protect sensor nodes from attacks and to guarantee a high level of trustworthiness 

for sensed data. We use a discrete time model, and we consider that there is a 

limited attack budget that bounds the capability of the attacker in each round. The 

defense strategy objective is to ensure that sufficient sensor nodes are protected in 

each round such that the discrepancy between the value accepted and the truthful 

sensed value is below a certain threshold. We model the attack-defense interaction 

as a Stackelberg game, and we derive the Nash equilibrium condition that is sufficient 

to ensure that the sensed data are truthful within a nominal error bound. We 

implement a prototype of the proposed strategy and we show through extensive 

experiments that our solution provides an effective and efficient way of protecting 

sensor networks from attacks. 

induStrial SeSSion 1: 

SUPPorT For LArGE ScALE DATA ANALyTicS 

Exploiting Common Subexpressions for Cloud Query Processing 

yasin N. Silva (Arizona State University) 

Per-Ake Larson (Microsoft research) 

Jingren Zhou (Microsoft corp.) 

Many companies now routinely run massive data analysis jobs – expressed in some 

scripting language – on large clusters of low-end servers. Many analysis scripts are 

complex and contain common subexpressions, that is, intermediate results that are 

subsequently joined and aggregated in multiple different ways. Applying conventional 

optimization techniques to such scripts will produce plans that execute a 

common subexpression multiple times, once for each consumer, which is clearly 

wasteful. Moreover, different consumers may have different physical requirements 

on the result: one consumer may want it partitioned on a column A and another 

one partitioned on column B. To find a truly optimal plan, the optimizer must trade 

Page 

118

Abstracts 

off such conflicting requirements in a cost-based manner. In this paper we show 

how to extend a Cascade-style optimizer to correctly optimize scripts containing 

common subexpression. The approach has been prototyped in SCOPE, Microsoft’s 

system for massive data analysis. Experimental analysis of both simple and large 

real-world scripts shows that the extended optimizer produces plans with 21 to 57% 

lower estimated costs. 

Vectorwise: a Vectorized Analytical DBMS 

Marcin Zukowski (Actian Netherlands) 

Mark van de Wiel (Actian corp.) 

Peter Boncz (cWi) 

vectorwise is a new entrant in the analytical database marketplace whose technology 

comes straight from innovations in the database research community in the past 

years. The product has since made waves due to its excellent performance in analytical 

customer workloads as well as benchmarks. We describe the history of vectorwise, as 

well as its basic architecture and the experiences in turning a technology developed in 

an academic context into a commercial-grade product. Finally, we turn our attention to 

recent performance results, most notably on the TPc-H benchmark at various sizes. 

Scalable and Numerically Stable Descriptive Statistics in SystemML 

yuanyuan Tian (iBM Almaden research center) 

Shirish Tatikonda (iBM Almaden research center) 

Berthold reinwald (iBM Almaden research center) 

There has been growing need for applying machine learning (ML) algorithms on 

very large datasets. SystemML is a declarative approach to scalable statistical ML. 

In SystemML, statistical ML algorithms are expressed as simple scripts in a highlevel 

language. SystemML then complies and optimizes the scripts, and eventually 

translates them into efficient runtime on MapReduce. As the basis of virtually 

every quantitative analysis, descriptive statistics provide powerful tools to explore 

data in SystemML. This paper describes our experience in implementing descriptive 

statistics in SystemML. In particular, we elaborate on how to overcome the two 

major challenges: (1) numerical stability while operating on large datasets in the 

distributed setting of MapReduce; (2) efficient implementation of order statistics in 

MapReduce. 

Page 

119



EvoLviNG PLATForMS For NEW APPLicATioNS 

Earlybird: Real-Time Search at Twitter 

Michael Busch (Twitter) 

Krishna Gade (Twitter) 

Brian Larson (Twitter) 

Patrick Lok (Twitter) 

Samuel Luckenbill (Twitter) 

Jimmy Lin (Twitter) 

The web today is increasingly characterized by social and real-time signals, which 

we believe represent two frontiers in information retrieval. In this paper, we present 

Earlybird, the core retrieval engine that powers Twitter’s real-time search service. 

Although Earlybird builds and maintains inverted indexes like nearly all modern retrieval 

engines, its index structures differ from those built to support traditional web 

search. We describe these differences and present the rationale behind our design. 

A key requirement of real-time search is the ability to ingest content rapidly and 

make it searchable immediately, while concurrently supporting low-latency, highthroughput 

query evaluation. These demands are met with a single-writer, multiplereader 

concurrency model and the targeted use of memory barriers. Earlybird represents 

a point in the design space of real-time search engines that has worked well 

for Twitter’s needs. By sharing our experiences, we hope to spur additional interest 

and innovation in this exciting space. 

Data Infrastructure at LinkedIn 

Linkedin Data infrastructure Team 

LinkedIn is among the largest social networking sites in the world. As the company 

has grown, our core data sets and request processing requirements have grown as 

well. In this paper, we describe a few selected data infrastructure projects at LinkedIn 

that have helped us accommodate this increasing scale. Most of those projects 

build on existing open source projects and are themselves available as open source. 

The projects covered in this paper include: (1) Voldemort: a scalable and fault tolerant 

key-value store; (2) Databus: a framework for delivering database changes to 

downstream applications; (3) Espresso: a distributed data store that supports flexible 

schemas and secondary indexing; (4) Kafka: a scalable and efficient messaging 

system for collecting various user activity events and log data. 

The Credit Suisse Meta-data Warehouse 

claudio Jossen (credit Suisse AG) 

Lukas Blunschi (ETH Zurich) 

Magdalini Mori (credit Suisse AG) 

Donald Kossmann (ETH Zurich) 

Kurt Stockinger (credit Suisse AG) 

This paper describes the meta-data warehouse of Credit Suisse that is productive 

since 2009. Like most other large organizations, Credit Suisse has a complex 

Page 

120

Abstracts 

application landscape and several data warehouses in order to meet the information 

needs of its users. The problem addressed by the meta-data warehouse is to 

increase the agility and flexibility of the organization with regards to changes such 

as the development of a new business process, a new business analytics report, or 

the implementation of a new regulatory requirement. The meta-data warehouse 

supports these changes by providing services to search for information items in 

the data warehouses and to extract the lineage of information items. One difficulty 

in the design of such a meta-data warehouse is that there is no standard or wellknown 

meta-data model that can be used to support such search services. Instead, 

the meta-data structures need to be flexible themselves and evolve with the changing 

IT landscape. This paper describes the current data structures and implementation 

of the Credit Suisse meta-data warehouse and shows how its services help to 

increase the flexibility of the whole organization. A series of example meta-data 

structures, use cases, and screenshots are given in order to illustrate the concepts 

used and the lessons learned based on feedback of real business and IT users 

within Credit Suisse. 


iNDEXiNG, UPDATES AND ProcESSiNG 

Efficient Support of XQuery Update Facility in XML Enabled RDBMS 

Zhen Hua Liu (oracle) 

Hui J. chang (oracle) 

Balasubramanyam Sthanikam (oracle) 

XQuery Update Facility (XQUF), which provides a declarative way of updating 

XML, has become recommendation by W3C. The SQL/XML standard, on the other 

hand, defines XMLType as a column data type in RDBMS environment and defines 

the standard SQL/XML operator, such as XMLQuery() to embed XQuery to query 

XMLType column in RDBMS. Based on this SQL/XML standard, XML enabled RD- 

BMS becomes industrial strength platforms to host XML applications in a standard 

compliance way by providing XML store and query capability. However, updating 

XML capability support remains to be proprietary in RDBMS until XQUF becomes 

the recommendation. XQUF is agnostic of how XML is stored so that propagation 

of actual update to any persistent XML store is beyond the scope of XQUF. In this 

paper, we show how XQUF can be incorporated into XMLQuery() to effectively 

update XML stored in XMLType column in the environment of XML enabled RDBMS, 

such as Oracle XMLDB. We present various compile time and run time optimisation 

techniques to show how XQUF can be efficiently implemented to declaratively 

update XML stored in RDBMS. We present how our approaches of optimising XQUF 

for common physical XML storage models: native binary XML storage model and 

relational decomposition of XML storage model. Although our study is done using 

Oracle XMLDB, all of the presented optimisation techniques are generic to XML 

stores that need to support update of persistent XML store and not specific to 

Oracle XMLDB implementation. 

Page 

121


Making Unstructured Data SPARQL Using Semantic Indexing in Oracle 

Database 

Souripriya Das (oracle) 

Seema Sundara (oracle) 

Matthew Perry (oracle) 

Jagannathan Srinivasan (oracle) 

Jayanta Banerjee (oracle) 

Aravind yalamanchi (oracle) 

This paper describes the Semantic Indexing feature introduced in Oracle Database 

for indexing unstructured text (document) columns. This capability enables searching 

for concepts (such as people, places, organizations, and events), in addition to 

words or phrases, with further options for sense disambiguation and term expansion 

by consulting knowledge captured in OWL/RDF ontologies. The distinguishing 

aspects of our approach are: 1) Indexing: Instead of building a traditional inverted 

index of (annotated) token and/or named entity occurrences, we extract the entities, 

associations, and events present in a text column data and store them as RDF 

named graphs in the Oracle Database Semantic Store. This base content can be 

further augmented with knowledge bases and inferred triples (obtained by applying 

domain-specific ontologies and rulebases). 2) Querying: Instead of relying on 

proprietary extensions for specifying a search, we allow users to specify a complete 

SPARQL query pattern that can capture arbitrarily complex relationships between 

query terms. We have implemented this feature by introducing a sem_contains 

SQL operator and the associated sem_indextype indexing scheme. The indexing 

scheme employs an extensible architecture that supports indexing of unstructured 

text using native as well as third party text extraction tools. The paper presents a 

model for the semantic index and querying, describes the feature, and outlines its 

implementation leveraging Oracle’s native support for RDF/OWL storage, inferencing, 

and querying. We also report a study involving use of this feature on a TREC 

collection of over 130,000 news articles. 

A meta-language for MDX queries in eLog Business Solution 

Sonia Bergamaschi (University of Modena and reggio Emilia) 

Matteo interlandi (University of Modena and reggio Emilia) 

Mario Longo (eBilling S.p.A.) 

Laura Po (University of Modena and reggio Emilia) 

Maurizio vincini (University of Modena and reggio Emilia) 

The adoption of business intelligence technology in industries is growing rapidly. 

Business managers are not satisfied with ad hoc and static reports and they ask for 

more flexible and easy to use data analysis tools. Recently, application interfaces 

that expand the range of operations available to the user, hiding the underlying 

complexity, have been developed. The paper presents eLog, a business intelligence 

solution designed and developed in collaboration between the database group of 

the University of Modena and Reggio Emilia and eBilling, an Italian SME supplier of 

solutions for the design, production and automation of documentary processes for 

top Italian companies. eLog enables business managers to define OLAP reports by 

means of a web interface and to customize analysis indicators adopting a simple 

meta-language. The framework translates the user’s reports into MDX queries and 

Page 

122

Abstracts 

is able to automatically select the data cube suitable for each query. Over 140 

medium and large companies have exploited the technological services of eBilling 

S.p.A. to manage their documents flows. In particular, eLog services have been used 

by the major media and telecommunications Italian companies and their foreign 

annex, such as Sky, Mediaset, H3G, Tim Brazil etc. The largest customer can provide 

up to 30 millions mail pieces within 6 months (about 200 GB of data in the relational 

DBMS). In a period of 18 months, eLog could reach 150 millions mail pieces (1 

TB of data) to handle. 

demo group 1: 

SMIX Live – A Self-Managing Index Infrastructure for Dynamic Workloads 

Thomas Kissinger (Dresden University of Technology) 

Hannes voigt (Dresden University of Technology) 

Wolfgang Lehner (Dresden University of Technology) 

As databases accumulate growing amounts of data at an increasing rate, adaptive 

indexing becomes more and more important. At the same time, applications and 

their use get more agile and flexible, resulting in less steady and less predictable 

workload characteristics. Being inert and coarse-grained, state-of-the-art index tuning 

techniques become less useful in such environments. Especially the full-column 

indexing paradigm results in lot of indexed but never queried data and prohibitively 

high memory and maintenance costs. In our demonstration, we present Self-Managing 

Indexes, a novel, adaptive, fine-grained, autonomous indexing infrastructure. 

In its core, our approach builds on a novel access path that automatically collects 

useful index information, discards useless index information, and competes with 

its kind for resources to host its index information. Compared to existing technologies 

for adaptive indexing, we are able to dynamically grow and shrink our indexes, 

instead of incrementally enhancing the index granularity. In the demonstration, we 

visualize performance and system measures for different scenarios and allow the 

user to interactively change several system parameters. 

Multi-Query Stream Processing on FPGAs 

Mohammad Sadoghi (University of Toronto) 

rija Javed (University of Toronto) 

Naif Tarafdar (University of Toronto) 

Harsh Singh (University of Toronto) 

rohan Palaniappan (University of Toronto) 

Hans-Arno Jacobsen (University of Toronto) 

We present an efficient multi-query event stream platform to support query processing 

over high-frequency event streams. Our platform is built over reconfigurable 

hardware—-FPGAs—-to achieve line-rate multi-query processing by exploiting 

unprecedented degrees of parallelism and potential for pipelining, only available 

through custom-built, application-specific and low-level logic design. Moreover, a 

multi-query event stream processing engine is at the core of a wide range of applications 

including real-time data analytics, algorithmic trading, targeted advertisement, 

and (complex) event processing. 

Page 

123


EUDEMON: A System for Online Video Frame Copy Detection by Earth 

Mover’s Distance 

Jia Xu (Northeastern University, china) 

Qiushi Bai (Northeastern University, china) 

yu Gu (Northeastern University, china) 


Guoren Wang (Northeastern University, china) 

Ge yu (Northeastern University, china) 

Zhenjie Zhang (Advanced Digital Sciences center, illinois at Singapore Pte.) 

The Earth Mover’s Distance, or EMD for short, has been proven to be effective for 

content-based image retrieval. However, due to the cubic complexity of EMD computation, 

it remains difficult to use EMD in applications with stringent requirement 

for efficiency. In this paper, we present our new system, called EUDEMON, which 

utilizes new techniques to support fast Online Video Frame Copy Detection based 

on the EMD. Given a group of registered frames as queries and a set of targeted 

detection videos, EUDEMON is capable of identifying relevant frames from the 

video stream in real time. The significant improvement on efficiency mainly relies 

on the primal-dual theory in linear programming and well-designed B+ tree filters 

for adaptive candidate pruning. Generally speaking, our system includes a variety 

of new features crucial to the deployment of EUDEMON in real applications. First, 

EUDEMON achieves high throughput even when a large number of queries are registered 

in the system. Second, EUDEMON contains self-optimization component to 

automatically enhance the effectiveness of the filters based on the recent content 

of the video stream. Finally, EUDEMON provides a user-friendly visualization interface, 

named EMD Flow Chart, to help the users to better understand the alarm with 

the perspective of the EMD. 

A Dataset Search Engine for the Research Document Corpus 

Meiyu Lu (National University of Singapore) 

Srinivas Bangalore (AT&T Labs–research) 

Graham cormode (AT&T Labs–research) 

Marios Hadjieleftheriou (AT&T Labs–research) 


A key step in validating a proposed idea or system is to evaluate over a suitable 

dataset. However, to this date there have been no useful tools for researchers to 

understand which datasets have been used for what purpose, or in what prior work. 

Instead, they have to manually browse through papers to find the suitable datasets 

and their corresponding URLs, which is laborious and inefficient. To better aid the 

dataset discovery process, and provide a better understanding of how and where 

datasets have been used, we propose a framework to effectively identify datasets 

within the scientific corpus. The key technical challenges are identification of datasets, 

and discovery of the association between a dataset and the URLs where they 

can be accessed. Based on this, we have built a user friendly web-based search 

interface for users to conveniently explore the dataset-paper relationships, and find 

relevant datasets and their properties. 

Page 

124

AskFuzzy: Attractive Visual Fuzzy Query Builder 

Keivan Kianmehr (University of Western ontario) 

Negar Koochakzadeh (University of calgary) 

reda Alhajj (University of calgary) 

Abstracts 

The user-centric query interface is very common application that allows expressing 

both the input and the output using fuzzy terms. This is becoming a need in the 

evolving internet-based era where web-based applications are very common and 

the number of users accessing structured databases is increasing rapidly. Restricting 

the user group to only experts in query coding must be avoided. The AskFuzzy 

system has been developed to address this vital issue which has social and industrial 

impact. It is an attractive and friendly visual user interface that facilitates 

expressing queries using both fuzziness and traditional methods. The fuzziness is 

not expressed explicitly inside the database; it is rather absorbed and effectively 

handled by an intermediate layer which is cleverly incorporated between the frontend 

visual user-interface and the back-end database. 

F2DB: The Flash-Forward Database System 

Ulrike Fischer (Dresden University of Technology) 

Frank rosenthal (Dresden University of Technology) 

Wolfgang Lehner (Dresden University of Technology) 

Forecasts are important to decision-making and risk assessment in many domains. 

Since current database systems do not provide integrated support for forecasting, 

it is usually done outside the database system by specially trained experts using 

forecast models. However, integrating model-based forecasting as a first-class 

citizen inside a DBMS speeds up the forecasting process by avoiding exporting the 

data and by applying database-related optimizations like reusing created forecast 

models. It especially allows subsequent processing of forecast results inside the database. 

In this demo, we present our prototype F2DB based on PostgreSQL, which 

allows for transparent processing of forecast queries. Our system automatically 

takes care of model maintenance when the underlying dataset changes. In addition, 

we offer optimizations to save maintenance costs and increase accuracy by using 

derivation schemes for multidimensional data. Our approach reduces the required 

expert knowledge by enabling arbitrary users to apply forecasting in a declarative 

way. 

Provenance-Based Debugging and Drill-Down in Data-Oriented Workflows 

robert ikeda (Stanford University) 

Junsang cho (Stanford University) 

charlie Fang (Stanford University) 

Semih Salihoglu (Stanford University) 

Satoshi Torikai (Stanford University) 

Jennifer Widom (Stanford University) 

Panda (for Provenance and Data) is a system that supports the creation and execution 

of data-oriented workflows, with automatic provenance generation and built-in 

provenance tracing operations. Workflows in Panda are arbitrary acyclic graphs 

Page 

125


containing both relational (SQL) processing nodes and opaque processing nodes 

programmed in Python. For both types of nodes, Panda generates logical provenance—-provenance 

information stored at the processing-node level—-and uses 

the generated provenance to support record-level backward tracing and forward 

tracing operations. In our demonstration we use Panda to integrate, process, and 

analyze actual education data from multiple sources. We specifically demonstrate 

how Panda’s provenance generation and tracing capabilities can be very useful for 

workflow debugging, and for drilling down on specific results of interest. 

demo group 2: 

M 3 : Stream Processing on Main-Memory MapReduce 

Ahmed M. Aly (Purdue University) 

Asmaa Sallam (Purdue University) 

Bala M. Gnanasekaran (Purdue University) 

Long-van Nguyen-Dinh (Purdue University) 

Walid G. Aref (Purdue University) 

Mourad ouzzani (Qatar computing research institute) 

Arif Ghafoor (Purdue University) 

The continuous growth of social web applications along with the development of 

sensor capabilities in electronic devices is creating countless opportunities to analyze 

the enormous amounts of data that is continuously steaming from these applications 

and devices. To process large scale data on large scale computing clusters, 

MapReduce has been introduced as a framework for parallel computing. However, 

most of the current implementations of the MapReduce framework support only 

the execution of fixed-input jobs. Such restriction makes these implementations 

inapplicable for most streaming applications, in which queries are continuous in 

nature, and input data streams are continuously received at high arrival rates. In 

this demonstration, we showcase M 3 , a prototype implementation of the MapReduce 

framework in which continuous queries over streams of data can be efficiently 

answered. M 3 extends Hadoop, the open source implementation of MapReduce, bypassing 

the Hadoop Distributed File System (HDFS) to support main-memory-only 

processing. Moreover, M 3 supports continuous execution of the Map and Reduce 

phases where individual Mappers and Reducers never terminate. 

A Deep Embedding of Queries into Ruby 

Torsten Grust (University of Tübingen) 

Manuel Mayr (University of Tübingen) 

We demonstrate SWITCH, a deep embedding of relational queries into Ruby and 

Ruby on Rails. With SWITCH, there is no syntactic or stylistic difference between 

Ruby programs that operate over in-memory array objects or database-resident 

tables, even if these programs rely on array order or nesting. SWITCH’s built-in 

compiler and SQL code generator guarantee to emit few queries, addressing longstanding 

performance problems that trace back to Rails’ ActiveRecord database 

binding. “Looks likes Ruby, but performs like handcrafted SQL,” is the ideal that 

drives the research and development effort behind SWITCH. 

Page 

126

Asking the Right Questions in Crowd Data Sourcing 

rubi Boim (Tel-Aviv University) 

ohad Greenshpan (Tel-Aviv University) 

Tova Milo (Tel-Aviv University) 

Slava Novgorodov (Tel-Aviv University) 

Neoklis Polyzotis (University of california, Santa cruz) 

Wang-chiew Tan (University of california, Santa cruz) 

Abstracts 

Crowd-based data sourcing is a new and powerful data procurement paradigm that 

engages Web users to collectively contribute information. In this work, we target 

the problem of gathering data from the crowd in an economical and principled 

fashion. We present AskIt!, a system that allows interactive data sourcing applications 

to effectively determine which questions should be directed to which users 

for reducing the uncertainty about the collected data. AskIt! uses a set of novel 

algorithms for minimizing the number of probing (questions) required from the 

different users. We demonstrate the challenge and our solution in the context of a 

multiple-choice question game played by the ICDE’12 attendees, targeted to gather 

information on the conference’s publications, authors and colleagues. 

LotusX: A Position-Aware XML Graphical Search System with 

Auto-Completion 

chunbin Lin (renmin University of china) 

Jiaheng Lu (renmin University of china) 

Tok Wang Ling (National Universtiy of Singapore) 

Bogdan cautis (Télécom ParisTech) 

The existing query languages for XML (e.g., XQuery) require professional programming 

skills to be formulated, however, such complex query languages burden the 

query processing. In addition, when issuing an XML query, users are required to 

be familiar with the content (including the structural and textual information) of 

the hierarchical XML, which is diffcult for common users. The need for designing 

userfriendly interfaces to reduce the burden of query formulation is fundamental to 

the spreading of XML community. We present a twig-based XML graphical search 

system, called LotusX, that provides a graphical interface to simplify the query 

processing without the need of learning query language and data schemas and the 

knowledge of the content of the XML document. The basic idea is that LotusX proposes 

“position-aware” and “auto-completion” features to help users to create treemodeled 

queries (twig pattern) by providing the possible candidates on-the-fly. 

In addition, complex twig queries (including ordersensitive queries) are supported 

in LotusX. Furthermore, a new ranking strategy and a query rewriting solution are 

implemented to rank and rewrite the query effectively. 

Efficient Top-k Keyword Search in Graphs with Polynomial Delay 

Mehdi Kargar (york University) 

Aijun An (york University) 

A system for efficient keyword search in graphs is demonstrated. The system has 

two components, a search through only the nodes containing the input keywords 

Page 

127


for a set of nodes that are close to each other and together cover the input keywords 

and an exploration for finding how these nodes are related to each other. The 

system generates all or top-k answers in polynomial delay. Answers are presented 

to the user according to a ranking criterion so that the answers with nodes closer to 

each other are presented before the ones with nodes farther away from each other. 

In addition, the set of answers produced by our system is duplication free. The 

system uses two methods for presenting the final answer to the user. The presentation 

methods reveal relationships among the nodes in an answer through a tree or 

a multi-center graph. We will show that each method has its own advantages and 

disadvantages. The system is demonstrated using two challenging datasets, very 

large DBLP and highly cyclic Mondial. Challenges and difficulties in implementing 

an efficient keyword search system are also demonstrated. 

TEDAS: a Twitter-based Event Detection and Analysis System 

rui Li (University of illinois at Urbana-champaign) 

Kin Hou Lei (Brigham young University) 

ravi Khadiwala (University of illinois at Urbana-champaign) 

Kevin chen-chuan chang (University of illinois at Urbana-champaign) 

Witnessing the emergence of Twitter, we propose a Twitter-based Event Detection 

and Analysis System (TEDAS), which helps to (1) detect new events, to (2) analyze 

the spatial and temporal pattern of an event, and to (3) identify importance of 

events. In this demonstration, we show the overall system architecture, explain in 

detail the implementation of the components that crawl, classify, and rank tweets 

and extract location from tweets, and present some interesting results of our system. 

AutoDict: Automated Dictionary Discovery 

Fei chiang (University of Toronto) 

Periklis Andritsos (University of Toronto) 

Erkang Zhu (University of Toronto) 

renee J. Miller (University of Toronto) 

An attribute dictionary is a set of attributes together with a set of common values 

of each attribute. Such dictionaries are valuable in understanding unstructured 

or loosely structured textual descriptions of entity collections, such as product 

catalogs. Dictionaries provide the supervised data for learning product or entity 

descriptions. In this demonstration, we will present AutoDict, a system that analyzes 

input data records, and discovers high quality dictionaries using information 

theoretic techniques. To the best of our knowledge, AutoDict is the first end-to-end 

system for building attribute dictionaries. Our demonstration will showcase the 

different information analysis and extraction features within AutoDict, and highlight 

the process of generating high quality attribute dictionaries. 

Page 

128

demo group 3: 

Abstracts 

Trust & Share: Trusted Information Sharing in Online Social Networks 

Barbara carminati (University of insubria) 

Elena Ferrari (University of insubria) 

Jacopo Girardi (University of insubria) 

Trust & Share (T&S) aims at providing relationship-based access control in the 

Facebook realm. T&S is a third-party Facebook application, designed to support a 

flexible and controlled sharing of user data. It makes users able to upload resources 

(i.e., any file) and specify for each of them which users have to be authorized by 

T&S to access them. To enforce this controlled information sharing, T&S relies on 

the OSN access control model proposed in \cite{tissec}, where social network relationships 

have an enhanced semantics than the contacts in Facebook. According to 

\cite{tissec}, OSN users associate with each of their contacts a type, representing 

the nature of the relationship (e.g., friends, colleagues, parents). Moreover, the creator 

of the relationship can assign to it also a trust level to represent the strength 

of the connection. This graph enables users to specify more expressive rules for 

the controlled information sharing. Indeed, on top of this enhanced social graph, 

T&S users can specify access constraints on the type, trust level and depth of the 

relationship it must exist with a given Facebook contact in order to access a certain 

resource. 

Evaluation of Clusterings – Metrics and Visual Support 

Elke Achtert (Ludwig-Maximilians-Universität München) 

Sascha Goldhofer (Ludwig-Maximilians-Universität München) 

Hans-Peter Kriegel (Ludwig-Maximilians-Universität München) 

Erich Schubert (Ludwig-Maximilians-Universität München) 

Arthur Zimek (Ludwig-Maximilians-Universität München) 

When comparing clustering results, any evaluation metric breaks down the available 

information to a single number. However, a lot of evaluation metrics are around, 

that are not always concordant nor easily interpretable in judging the agreement of 

a pair of clusterings. Here, we provide a tool to visually support the assessment of 

clustering results in comparing multiple clusterings. Along the way, the suitability of 

a couple of clustering comparison measures can be judged in different scenarios. 

Horton: Online Query Execution Engine For Large Distributed Graphs 

Mohamed Sarwat (University of Minnesota) 

Sameh Elnikety (Microsoft research) 

yuxiong He (Microsoft research) 

Gabriel Kliot (Microsoft research) 

Graphs are used in many large-scale applications, such as social networking. The 

management of these graphs poses new challenges as such graphs are too large 

for a single server to manage efficiently. Current distributed techniques such as 

map-reduce and Pregel are not well-suited to processing interactive ad-hoc queries 

against large graphs. In this paper we demonstrate Horton, a distributed interac- 

Page 

129


tive query execution engine for large graphs. Horton defines a query language that 

allows the expression of regular language reachability queries and provides a query 

execution engine with a query optimizer that allows interactive execution of queries 

on large distributed graphs in parallel. In the demo, we show the functionality of 

Horton managing a large graph for a social networking application called Codebook, 

whose graph represents data on software components, developers, development 

artifacts such as bug reports, and their interactions in large software projects. 

MXQuery With Hardware Acceleration 

Jens Teubner (ETH Zurich) 

Peter Fischer (University of Freiburg) 

We demonstrate MXQuery/H, a modified version of MXQuery that uses hardware 

acceleration to speed up XML processing. The main goal of this demonstration is to 

give an interactive example of hardware/software co-design and show how system 

performance and energy efficiency can be improved by off-loading tasks to FPGA 

hardware. To this end, we equipped MXQuery/H with various hooks to inspect the 

different parts of the system. Besides that, our system can finally really leverage the 

idea of XML projection. Though the idea of projection had been around for a while, 

its effectiveness remained always limited because of the unavoidable and high parsing 

overhead. By performing the task in hardware, we relieve the software part from 

this overhead and achieve processing speed-ups of several factors. 

Data 3 – A Kinect Interface for OLAP using Complex Event Processing 

Steffen Hirte (ilmenau University of Technology) 

Andreas Seifert (ilmenau University of Technology) 

Stephan Baumann (ilmenau University of Technology) 

Daniel Klan (ilmenau University of Technology) 

Kai-Uwe Sattler (ilmenau University of Technology) 

Motion sensing input devices like Microsoft’s Kinect offer an alternative to traditional 

computer input devices like keyboards and mouses. Daily new applications using 

this interface appear. Most of them implement their own gesture detection. In our 

demonstration we show a new approach using the data stream engine AnduIN. The 

gesture detection is done based on AnduIN’s complex event processing functionality. 

This way we build a system that allows to define new and complex gestures on 

the basis of a declarative programming interface. On this basis our demonstration 

data 3 provides a basic natural interaction OLAP interface for a sample star schema 

database using Microsoft’s Kinect. 

Analyzing Query Optimization Process: Portraits of Join 


Anisoara Nica (Sybase, An SAP company) 

ian charlesworth (University of Waterloo) 

Maysum Panju (University of Waterloo) 

Search spaces generated by query optimizers during the optimization process 

encapsulate characteristics of the join enumeration algorithms, the cost models, as 

Page 

130

Abstracts 

well as critical decisions made for pruning and choosing the best plan. We demonstrate 

the JoinEnumerationViewer which is a tool designed for visualizing, mining, 

and comparing plan search spaces generated by different join enumeration algorithms 

when optimizing same SQL statement. We have enhanced Sybase SQL Anywhere 

relational database management system to log, in a very compact format, 

its search space during an optimization process. Such optimization log can then 

be analyzed by the JoinEnumerationViewer which internally builds the logical and 

physical plan graphs representing complete and partial plans considered during the 

optimization process. The optimization logs also contain statistics of the resource 

consumption during the query optimization such as optimization time breakdown, 

for example, for logical join enumeration versus costing physical plans, and memory 

allocation for different optimization structures. The SQL Anywhere Optimizer 

implements a highly adaptable, self-managing, search space generation algorithm 

by having several join enumeration algorithms to choose from, each enhanced with 

different ordering and pruning techniques. The emphasis of the demonstration will 

be on comparing and contrasting these join enumeration algorithms by analyzing 

their optimization logs. The demonstration scenarios will include optimizing 

SQL statements under various conditions which will exercise different algorithms, 

pruning and ordering techniques. These search spaces will then be visualized and 

compared using the JoinEnumerationViewer. 

DPCube: Releasing Differentially Private Data Cubes for Health Information 

yonghui Xiao (Emory University) 

James Gardner (Digital reasoning Systems inc.) 

Li Xiong (Emory University) 

We propose to demonstrate DPCube, a component in our Health Information DEidentification 

(HIDE) framework, for releasing differentially private data cubes (or 

multidimensional histograms) for sensitive data. HIDE is a framework we developed 

for integrating heterogenous structured and unstructured health information and 

provides methods for privacy preserving data publishing. The DPCube component 

provides the differentially private multidimensional data cube release. The DPCube 

algorithm uses the differentially private access mechanisms as provided by HIDE 

and guarantees differential privacy for the released data. It utilizes an innovative 

two-step multidimensional partitioning technique to publish a generalized data 

cube or multi-dimensional histogram that achieve good utility while satisfying the 

privacy requirement. We demonstrate that the released data cubes can serve as a 

sanitized synopsis of the raw database and, together with an optional synthesized 

dataset based on the data cubes, can support various Online Analytical Processing 

(OLAP) queries and learning tasks. 

Page 

131


demo group 4: 

Nyaya: a System Supporting the Uniform Management of Large Sets of 

Semantic Data 

roberto De virgilio (Universitá roma Tre) 

Giorgio orsi (University of oxford) 

Letizia Tanca (Politecnico di Milano) 

riccardo Torlone (Universitá roma Tre) 

We present Nyaya, a flexible system for the management of large-scale semantic 

data which couples a general-purpose storage mechanism with efficient ontological 

query answering. Nyaya rapidly imports semantic data expressed in different 

formalisms into semantic data kiosks. Each kiosk exposes the native ontological 

constraints in a uniform fashion using datalog+-, a very general rule-based language 

for the representation of ontological constraints. A group of kiosks forms a semantic 

data market where the data in each kiosk can be uniformly accessed using conjunctive 

queries and where users can specify user-defined constraints over the data. 

Nyaya is easily extensible and robust to updates of both data and meta-data in the 

kiosk and can readily adapt to different logical organizations of the persistent storage. 

In the demonstration, we will show the capabilities of Nyaya over real-world 

case studies and demonstrate its efficiency over well-known benchmarks. 

R2DB: A System for Querying and Visualizing Weighted RDF Graphs 

Songling Liu (Arizona State University) 

Juan P. cedeno (Arizona State University) 

K. Selcuk candan (Arizona State University) 

Maria Luisa Sapino (University of Torino) 

Shengyu Huang (Arizona State University) 

Xinsheng Li (Arizona State University) 

Existing RDF query languages and RDF stores fail to support a large class of 

knowledge applications which associate utilities or costs on the available knowledge 

statements. A recent proposal includes (a) a ranked RDF (R2DF) specification 

to enhance RDF triples with an application specific weights and (b) a SPARankQL 

query language specification, which provides novel primitives on top of the 

SPARQL language to express top-k queries using traditional query patterns as well 

as novel flexible path predicates. We introduce and demonstrate R2DB, a database 

system for querying weighted RDF graphs. R2DB relies on the AR2Q query processing 

engine, which leverages novel index structures to support efficient ranked 

path search and includes query optimization strategies based on proximity and 

sub-result inter-arrival times. In addition to being the first data management system 

for the R2DF data model, R2DB also provides an innovative features-of-interest 

(FoI) based method for visualizing large sets of query results (i.e., subgraphs of the 

data graph). 

Page 

132

Project Daytona: Data Analytics as a Cloud Service 

roger Barga (Microsoft) 

Jaliya Ekanayake (Microsoft research) 

Wei Lu (Microsoft research) 

Abstracts 

Spreadsheets are established data collection and analysis tools in business, technical 

computing and academic research. Excel, for example, offers an attractive 

user interface, provides an easy to use data entry model, and offers substantial 

interactivity for what-if analysis. However, spreadsheets and other common client 

applications do not offer scalable computation for large scale data analytics and 

exploration. Increasingly researchers in domains ranging from the social sciences 

to environmental sciences are faced with a deluge of data, often sitting in spreadsheets 

such as Excel or other client applications, and they lack a convenient way to 

explore the data, to find related data sets, or to invoke scalable analytical models 

over the data. To address these limitations, we have developed a cloud data analytics 

service based on Daytona, which is an iterative MapReduce runtime optimized 

for data analytics. In our model, Excel and other existing client applications provide 

the data entry and user interaction surfaces, Daytona provides a scalable runtime 

on the cloud for data analytics, and our service seamlessly bridges the gap between 

the client and cloud. Any analyst can use our data analytics service to discover 

and import data from the cloud, invoke cloud scale data analytics algorithms 

to extract information from large datasets, invoke data visualization, and then store 

the data back to the cloud all through a spreadsheet or other client application they 

are already familiar with. 

Interactive User Feedback in Ontology Matching Using Signature Vectors 

isabel F. cruz (University of illinois at chicago) 

cosmin Stroe (University of illinois at chicago) 

Matteo Palmonari (University of Milano-Bicocca) 

When compared to a gold standard, the set of mappings that are generated by an 

automatic ontology matching process is neither complete nor are the individual 

mappings always correct. However, given the explosion in the number, size, and 

complexity of available ontologies, domain experts no longer have the capability 

to create ontology mappings without considerable effort. We present a solution 

to this problem that consists of making the ontology matching process interactive 

so as to incorporate user feedback in the loop. Our approach clusters mappings to 

identify where user feedback will be most beneficial in reducing the number of user 

interactions and system iterations. This feedback process has been implemented 

in the AgreementMaker system and is supported by visual analytic techniques that 

help users to better understand the matching process. Experimental results using 

the OAEI benchmarks show the effectiveness of our approach. We will demonstrate 

how users can interact with the ontology matching process through the AgreementMaker 

user interface to match real-world ontologies. 

Page 

133


DObjects+: Enabling Privacy-Preserving Data Federation Services 

Pawel Jurczyk (Google inc.) 

Li Xiong (Emory University) 

Slawomir Goryczka (Emory University) 

The emergence of cloud computing implies and facilitates managing large collections 

of highly distributed, autonomous, and possibly private databases. While 

there is an increasing need for services that allow integration and sharing of various 

data repositories, it remains a challenge to ensure the privacy, interoperability, and 

scalability for such services. In this paper we demonstrate a scalable and extensible 

framework that is aimed to enable privacy preserving data federations. The framework 

is built on top of a distributed mediator-wrapper architecture where nodes 

can form collaborative groups for secure anonymization and secure query processing 

when private data need to be accessed. New anonymization models and protocols 

will be demonstrated that counter potential attacks in the distributed setting. 

DRAGOON: An Information Accountability System for 

High-Performance Databases 

Kyriacos E. Pavlou (The University of Arizona) 

richard T. Snodgrass (The University of Arizona) 

Regulations and societal expectations have recently emphasized the need to mediate 

access to valuable databases, even access by insiders. Fraud occurs when a 

person, often an insider, tries to hide illegal activity. Companies would like to be 

assured that such tampering has not occurred, or if it does, that it will be quickly 

discovered and used to identify the perpetrator. At one end of the compliance spectrum 

lies the approach of restricting access to information and on the other that of 

information accountability. We focus on effecting information accountability of data 

stored in high-performance databases. The demonstrated work ensures appropriate 

use and thus end-to-end accountability of database information via a continuous 

assurance technology based on cryptographic hashing techniques. A prototype 

tamper detection and forensic analysis system named DRAGOON was designed and 

implemented to determine when tampering(s) occurred and what data were tampered 

with. DRAGOON is scalable, customizable, and intuitive. This work will show 

that information accountability is a viable alternative to information restriction for 

ensuring the correct storage, use, and maintenance of databases on extant DBMSes. 

Intuitive Interaction With Encrypted Query Execution in DataStorm 

Ken Smith (MiTrE) 

Ameet Kini (MiTrE) 

William Wang (MiTrE) 

chris Wolf (MiTrE) 

M. David Allen (MiTrE) 

Andrew Sillers (MiTrE) 

The encrypted execution of database queries promises powerful security protections, 

however users are currently unlikely to benefit without significant expertise. In 

this demonstration, we illustrate a simple workflow enabling users to design secure 

Page 

134

Abstracts 

executions of their queries. The DataStorm system demonstrated simplifies both the 

design and execution of encrypted execution plans, and represents progress toward 

the challenge of developing a general planner for encrypted query execution. 


oktie Hassanzadeh (University of Toronto & iBM research) 

Anastasios Kementsietsidis (iBM research) 

yannis velegrakis (University of Trento) 

We provide an overview of the current data management research issues in the 

context of the Semantic Web. The objective is to introduce the audience into the 

area of the Semantic Web, and to highlight the fact that the area provides many 

interesting research opportunities for the data management community. A new 

model, the Resource Description Framework (RDF), coupled with a new query 

language, called SPARQL, lead us to revisit some classical data management problems, 

including efficient storage, query optimization, and data integration. These 

are problems that the Semantic Web community has only recently started to explore, 

and therefore the experience and long tradition of the database community 

can prove valuable. We target both experienced and novice researchers that are 

looking for a thorough presentation of the area and its key research topics. 

Seminar 2: Discovering Multiple Clustering Solutions: Grouping Objects 

in Different Views of the Data 


Stephan Günnemann (rWTH Aachen University) 

ines Färber (rWTH Aachen University) 

Thomas Seidl (rWTH Aachen University) 

Traditional clustering algorithms identify just a single clustering of the data. Today’s 

complex data, however, allow multiple interpretations leading to several valid 

groupings hidden in different views of the database. Each of these multiple clustering 

solutions is valuable and interesting as different perspectives on the same data 

and several meaningful groupings for each object are given. Especially for high 

dimensional data, where each object is described by multiple attributes, alternative 

clusters in different attribute subsets are of major interest. In this tutorial, we 

describe several real world application scenarios for multiple clustering solutions. 

We abstract from these scenarios and provide the general challenges in this emerging 

research area. We describe state-of-the-art paradigms, we highlight specific 

techniques, and we give an overview of this topic by providing a taxonomy of the 

existing clustering methods. By focusing on open challenges, we try to attract 

young researchers for participating in this emerging research field. 

Seminar 3: Detecting Clones, Copying and Reuse on the Web 

Xin Luna Dong (AT&T Labs–research) 


The Web has enabled the availability of a vast amount of useful information in 

recent years. However, the web technologies that have enabled sources to share 

Page 

135


their information have also made it easy for sources to copy from each other and 

often publish without proper attribution. Understanding the copying relationships 

between sources has many benefits, including helping data providers protect their 

own rights, improving various aspects of data integration, and facilitating in- 

depth analysis of information flow. The importance of copy detection has led to a 

substantial amount of research in many disciplines of Computer Science, based on 

the type of information considered, such as text, images, videos, software code, and 

structured data. This seminar explores the similarities and differences between the 

techniques proposed for copy detection across the different types of information. 

We also examine the computational challenges associated with large-scale copy 

detection, indicating how they could be detected efficiently, and identify a range of 

open problems for the community. 

Seminar 4: Mining Knowledge from Data: An Information Network 

Analysis Approach 

Jiawei Han (University of illinois at Urbana-champaign) 

yizhou Sun (University of illinois at Urbana-champaign) 

Xifeng yan (University of california at Santa Barbara) 


Most people consider a database is merely a data repos- itory that supports data 

storage and retrieval. Actually, a database contains rich, inter-related, multi-typed 

data and information, forming one or a set of gigantic, interconnected, heterogeneous 

information networks. Much knowledge can be derived from such information 

networks if we systematically develop an effective and scalable database-oriented 

information network analysis technology. In this tutorial, we systematically introduce 

database-oriented information network analysis methods and demonstrate how 

such a technology can be used to turn database data into useful knowledge and 

how such information networks can be used to enhance data quality, consistency, 

and the generation of interesting knowledge. This tutorial presents an organized 

picture on how to turn a database into one or a set of organized heterogeneous 

information networks, how such information networks can be used for data cleaning, 

data consolidation, and data qualify improvement, how to perform OLAP in 

such information networks, how to discover various kinds of knowledge from such 

information networks, and how to transform database data into knowledge by 

information network analysis. Moreover, we present interesting case studies on real 

datasets, including DBLP and Flickr, and show how interesting and organized knowledge 

can be generated from such database-oriented information networks. 

Seminar 5: Emerging Graph Queries In Linked Data 

Arijit Khan (University of california, Santa Barbara) 

yinghui Wu (University of california, Santa Barbara) 

Xifeng yan (University of california, Santa Barbara) 

In a wide array of disciplines, data can be modeled as an interconnected network of 

entities, where various attributes could be associated with both the entities and the 

relations among them. Knowledge is often hidden in the complex structure and attributes 

inside these networks. While querying and mining these linked datasets are essential 

for various applications, traditional graph queries may not be able to capture 

Page 

136

Abstracts 

the rich semantics in these networks. With the advent of complex information networks, 

new graph queries are emerging, including graph pattern matching and mining, 

similarity search, ranking and expert finding, graph aggregation and OLAP. These 

queries require both the topology and content information of the network data, and 

hence, different from classical graph algorithms such as shortest path, reachability 

and minimum cut, which depend only on the structure of the network. In this tutorial, 

we shall give an introduction of the emerging graph queries, their indexing and resolution 

techniques, the current challenges and the future research directions. 

Seminar 6: Boolean Matrix Decomposition Problem: Theory, Variations 

and Applications to Data Engineering 

Jaideep vaidya (rutgers University) 

With the ubiquitous nature and sheer scale of data collection, the problem of data 

summarization is most critical for effective data management. Classical matrix 

decomposition techniques have often been used for this purpose, and have been 

the subject of much study. In recent years, several other forms of decomposition, 

including Boolean Matrix Decomposition have become of significant practical 

interest. Since much of the data collected is categorical in nature, it can be viewed 

in terms of a Boolean matrix. Boolean matrix decomposition (BMD), wherein a 

boolean matrix is expressed as a product of two Boolean matrices, can be used 

to provide concise and interpretable representations of Boolean data sets. The 

decomposed matrices give the set of meaningful concepts and their combination 

which can be used to reconstruct the original data. Such decompositions are useful 

in a number of application domains including role engineering, text mining as 

well as knowledge discovery from databases. In this seminar, we look at the theory 

underlying the BMD problem, study some of its variants and solutions, and examine 

different practical applications. 

Page 

137


Page 

138

Co-Located Workshops 

ICDE Workshop on Data-DrIvEn DECIsIon support 

anD GuIDanCE systEms (DGss 2012) 

http://dgss.vse.gmu.edu/ 

Decision support systems (Dss) are widely used to support business or organizational 

decision-making at the management, operations and planning levels of an organization. 

Decision guidance systems (DGs) are decision support systems that go beyond organizing 

and displaying information, providing actionable recommendations to and extracting 

knowledge from human decision-makers. this workshop will bring together DGss 

researchers and practitioners to present novel methodologies, models, algorithms, 

systems, tools, applications and case studies of DGss. most importantly, the workshop 

will be a forum to discuss how to utilize advances from multiple disciplines for building 

DGss that can intelligently merge human knowledge and expertise with formal 

mathematical models to make better decisions. the workshop will include both formal 

presentations and informal discussion of important research directions in DGss, and 

their interactions with knowledge and data engineering. 

Page 

139


Program 

8:50 – 9 am opening remarks 

9 – 10 am paper session 1 

10 – 10:30 am Coffee break 

10:30 am – noon paper session 2 

noon – 2 pm Lunch 

Page 

140 

A MAUT Approach for Reusing Ontologies 

Antonio Jiménez, Mari Carmen Suárez-Figueroa, Alfonso 

Mateos, Mariano Fernández-López and Asunción Gómez- 

Pérez 

Online Optimization through Preprocessing for Multi- 

Stage Production Decision Guidance Queries 

Nathan Egge, Alexander Brodsky and Igor Griva 

A Decision-theoretic Model of Disease Surveillance 

and Control and a Prototype Implementation for the 

Disease Influenza 

Michael Wagner, Gregory Cooper, Fuchiang Tsui, 

Jeremy Espino, Hendrik Harkema, John Levander, 

Ricardo Villamarin, Nicholas Millett, Shawn Brown and 

Anthony Gallagher 

Personal Health Explorer: A Semantic Health Recommendation 

System 

Thomas Morrell and Larry Kerschberg 

Striving for Market Dominance in UK’s Private Healthcare 

Sector: A Case of Cygnet Healthcare 

Mlungisi Masilela, Fenio Annansingh and Shaofeng Liu 

2 – 3:30 pm poster session: brief overview presentations followed up with 

parallel poster presentations 

Towards a DGSS Prototype for Early Warning for Ski 

Injuries 

Boris Delibašić and Zoran Obradović

3:30 – 4 pm Coffee break 

4 – 5 pm paper session 3 


Non-Parametric Synthesis Of Private Probabilistic 

Predictions 

Phan Giang 

Battle Management System: An Optimization for Military 

Decision Makers 

Richard Haberlin and Alexander Brodsky 

An explanation of decision-making under uncertainty – 

a qualitative research approach 

Eurico Lopes 

Agent Negotiation Strategies for Composing Service 

Workflows 

John Mcdowall and Larry Kerschberg 

A Scalable Data Warehouse Model based on Complex 

Semantic Event Processing in Distributed Systems 

Dingyu Yang and Jian Cao 

A Stigmergic Guiding System to Facilitate the Group 

Decision Process 

Constantin-Bala Zamfirescu and Ciprian Candea 

A Regression Based Algorithm for Optimizing Top-K 

Selection in Simulation Query Language 

Susan Farley, Alexander Brodsky and Chun-Hung Chen 

Towards a Training-Oriented Adaptive Decision Guidance 

and Support System 

Farhana Zulkernine, Patrick Martin, Sima Soltani, Wendy 

Powley, Serge Mankovskii and Mark Addleman 

5 – 5:30 pm Wrap-up session: Discussion on the future and 

organization of DGss 

Page 

141


3rD IntErnatIonaL Workshop on Data EnGInEErInG 

mEEts thE sEmantIC WEb (DEsWEb 2012) 

https://sites.google.com/site/desweb2012/ 

DEsWeb brings together researchers and practitioners from Data management and 

semantic Web. on one hand, the semantic Web brings several new data management 

problems, while on the other hand, several Data management problems can be 

solved with the help of semantic Web technologies. DEsWeb attracts papers on three 

broad areas: semantics in Data management, management of semantic Web Data, and 

semantic search and Linked Data. DEsWeb 2012 features an invited talk by prof. tim 

Finin on how to make semantic Web tools easier to use, as well as four regular contributions 

on the topics of improving query processing, benchmarking, schema matching, 

and challenges related to enabling semantic Web tools within Dataspaces. 

Program 

9 – 10 am session 1 


10:30 am – noon Invited talk 

noon – 2 pm Lunch 

2 – 3 pm session 2 

Page 

142 

Scientific SparQL 

Andrej Andrejev and Tore Risch 

A Benchmark for RDF-Based metadata 

Ivan Subotic, Lukas Rosenthaler and Heiko Schuldt 

Making the Semantic Web Easier to Use 

Tim Finin 

Opaque Attribute Alignment 

Jennifer Sleeman, Rafael Alonso, Hua Li, Art Pope and 

Antonio Badia 

Linked Data and Live Querying for Enabling Support 

Platforms for Web Dataspaces 

Jürgen Umbrich, Marcel Karnstedt, Josiane Xavier Parreira, 

Axel Polleres and Manfred Hauswirth 

3 – 3:30 pm Discussion and Wrap-up


1st IntErnatIonaL Workshop on Data manaGEmEnt 

In thE CLouD (DmC 2012) 

http://www.nec-labs.com/dm/dmc2012/ 

the cloud computing has emerged as a promising computing and business model. by providing 

on-demand scaling capabilities without any large upfront investment or long-term 

commitment, it is attracting wide range of users. the database community has also shown 

great interest in exploiting this new platform for data management services in a highly 

scalable and cost-efficient manner. as a result, the cloud computing presents challenges 

and opportunities for data management. the DmC workshop aims at bringing researchers 

and practitioners in cloud computing and data management systems together to discuss 

the research issues at the intersection of those areas, and also to draw more attention from 

the larger data management research community to this new and highly promising field. 

Program 

8:50 – 9 am Welcome 

9 – 10 am keynote 


10:30 am – noon session 1 

noon – 2:30 pm Lunch 

Supporting Extensible Performance SLAs for Cloud 


Olga Papaemmanouil (Brandeis University) 

Application-Managed Database Replication on Virtualized 

Cloud Environments 

Liang Zhao (National ICT Australia), Sherif Sakr (National 

ICT Australia), Alan Fekete (University of Sydney, Australia), 

Hiroshi Wada (National ICT Australia), and Anna Liu 

(National ICT Australia) 

Efficient Updates for Web-scale Indexes over the Cloud 

Panagiotis Antonopoulos (Microsoft Corp), Ioannis 

Konstantinou (National Technical University of Athens), 

Dimitrios Tsoumakos, and Nectarios Koziris (National 

Technical University of Athens) 

Page 

143


2:30 – 3:30 pm session 2 



Page 

144 

Secure Access for Healthcare Data in the Cloud Using 

Ciphertext-Policy Attribute-Based Encryption 

Suhair Alshehri (Rochester Inst. of Technology), 

Stanislaw Radziszowski, and Rajendra Raj (Rochester 

Inst. of Technology) 

Achieving Database Information Accountability in the 

Cloud 

Kyriacos Pavlou (The University of Arizona), and Richard 

Snodgrass (The University of Arizona) 

Building Large XML Stores in the Amazon Cloud 

Jesús Camacho-Rodríguez* (LRI, Universite Paris-Sud 11), 

Dario Colazzo (LRI, Universite Paris-Sud 11), and Ioana 

Manolescu (INRIA Saclay) 

Stream As You Go: The Case for Incremental Data 

Access and Processing in the Cloud 

Romeo Kienzler (ETH Zurich), Rémy Bruggmann (University 

of Berne), Anand Ranganathan (IBM Research), and 

Nesime Tatbul* (ETH Zurich) 

3rD IntErnatIonaL Workshop on Graph Data 

manaGEmEnt: tEChnIquEs anD appLICatIons 

(GDm 2012) 

http://www.cse.unsw.edu.au/~iwgdm/2012/ 

recently, there has been a lot of interest in the application of graphs in different domains. 

they have been widely used for data modeling of different application domains 

such as chemical compounds, multimedia databases, protein networks, social networks 

and semantic web. With the continued emergence and increase of massive and complex 

structural graph data, a graph database that efficiently supports elementary data 

management mechanisms is crucially required to effectively understand and utilize any 

collection of graphs. the overall goal of the workshop is to bring people from different 

fields together, exchange research ideas and results, and encourage discussion about 

how to provide efficient graph data management techniques in different application 

domains and to understand the research challenges of such area.

Program 

9 – 10 am Welcome and keynote presentation 


10:30 – noon research session 

noon – 2 pm Lunch break 


Keynote Speaker 

Prof. Jiawei Han - Univ. of Illinois at Urbana-Champaign 

A Comparison of Current Graph Database Models 

Renzo Angles 

Design of Declarative Graph Query Languages: On the 

Choice between Value, Pattern and Object-based Representations 

for Graphs 

Hasan M Jamil 

Benchmarking traversal operations over graph databases 

Marek Ciglan, Alex Averbuch, and Ladialav Hluchy 

Mining Associations Using Directed Hypergraphs 

Ramanuja Simha, Rahul Tripathi, and Mayur Thakur 

2 – 3:30 pm research session (Invited papers) 


Finding Skyline Nodes in Large Networks 

Arijit Khan, Vishwakarma Singh, and Jian Wu 

Partitioning Social Networks for Fast Retrieval of Timedependent 

Queries 

Mindi Yuan, David Stein, Berenice Carrasco, Joana M. F. 

Trindade, and Yi Lu 

Will Graph Data Management Techniques Contribute 

to the Successful Large-Scale Deployment of Semantic 

Web Technologies? 

Philippe Cudre-Mauroux 

Page 

145


4 – 6 pm Industrial session 

Page 

146 

Virtuoso 7 - Column Store and Adaptive Techniques for 

Graph 

Orri Erling 

HyperGraphDB: Model and Applications 

Borislav Iordanov 

The Bigdata(r) parallel graph database 

Bryan Thompson 

RDF Graph Stores 

Christopher J. Matheus 

ICDE Workshop on sECurE Data manaGEmEnt on 

smartphonEs anD mobILEs (sDmsm 2012) 

http://dig.csail.mit.edu/2012/ICDE-SDMSM/ 

there has been a widespread adoption of powerful mobile devices such as smartphones 

and tablets within the enterprise in the recent past. this widespread adoption of mobile 

devices raises serious data management challenges around data privacy and security of 

personal and enterprise data on these devices. the further adoption of mobile devices 

within the enterprise depends on strong guarantees that the enterprise is still in control 

of its sensitive data on mobile endpoints in the wild, and no data leakage or unauthorized 

modifications to the data can happen through these devices. popular mobile platforms such 

as android and ios allow users to download apps from respective marketplaces, and enterprises 

can host their own market places to distribute their own apps. however, given the 

personal nature of these devices, most users run both enterprise as well as personal apps 

on the same device simultaneously. since most apps on the public marketplaces are not 

security certified, and existing platform security solutions are lacking, for example by being 

coarse grained or being checked only at application install time, it is possible for malicious 

apps to steal/modify enterprise sensitive information that is resident on these devices. 

similarly, given the compact dimensions of mobile devices such as smartphones, users 

could potentially lose their phones, which carry sensitive data. Furthermore, most devices 

come packed with an array of sensors and communication capabilities such as Gps, cameras, 

near field communication (nFC), accelerometers, WiFi and bluetooth. these myriad 

on-device sensors generate large amounts of raw sensor data and managing this data to 

infer high-level events about the user and the end device remains a challenge. additionally, 

devices like ipads and Internet tablets are now being increasingly used in a multi-user environment 

where continuous and secure authentication and authorizations for data access is 

critical. In this workshop, we focus on the data management challenges that arise from the 

use of enterprise and other privacy sensitive data on mobile devices such as smartphones.

Program 

9 – 9:15 am opening address & speaker Introduction 


9:15 – 10 am Invited Talk: “Privacy in Mobile, Collaborative, 

Context-aware Systems” 

Prof Tim Finin 

10 – 10:30 am break 

10:30 – noon research papers (3 papers, 30 mins each) 

noon – 2 pm Lunch break 

2 – 2:45 pm Invited talk 

2:45 – 3:30 pm panel “managing data on smart phones: Enterprises and beyond” 

3:30 – 4 pm break 

4 – 5 pm research papers (3 papers, 30 mins each) 

5 pm – 5:15 pm Group Discussion 

5:15 – 5:30 pm Closing remarks 

7th IntErnatIonaL Workshop on sELF-manaGInG 

DatabasE systEms (smDb 2012) 

http://smdb2012.dvs.informatik.tu-darmstadt.de/ 

autonomic, or self-managing, systems are a promising approach to achieve the goal of 

systems that are easier to use and maintain in the face of growing system complexity. a 

system is considered to be autonomic if it is self-configuring, self-optimizing, self-healing 

and/or self-protecting. the aim of the smDb workshop is to provide a forum for 

researchers from both industry and academia to present and discuss ideas and experiences 

related to self-management and self-organization in all areas of Information management 

(Im) in general. smDb targets not only classical databases but also the new 

generation of storage engines such as column stores, key-value stores and in-memory 

databases. beyond databases smDb aims to cover autonomic aspects of data intensive 

systems represented by large-scale map-reduce (e.g., hadoop) and cloud environments 

where much work on self-management is needed. Last but not least, smDb wants to 

expand its horizons to include self-management of non-traditional, new areas of Im 

such as social networks and peer-to-peer systems. 

Page 

147


Program 

9 – 10 am session 1 

10 – 10:30 am break 

10:30 - 12:30 pm session 2 

12:30 – 2 pm Lunch break 

Page 

148 

Opening 

Alejandro Buchmann (TU Darmstadt), Malu Castellanos 

(HP Labs) 

Keynote: Quantitative Methods for Workload Management 

in Integrated Large Scale Data Platforms 

Nachum Shacham (eBay) 

Discovering Indicators for Congestion in DBMSs 

Mingyi Zhang (Queens University, Canada), Pat Martin 

(Queen’s University), Wendy Powley (Queen’s University), 

Paul Bird (IBM Toronto Lab), and Keith McDonald (IBM 

Toronto Lab) 

Online load balancing in parallel database queries with 

model predictive control 

Anastasios Gounaris (Aristotle University of Thessaloniki), 

and Christos Yfoulis (ATEI of Thessaloniki) 

Same Queries, Different Data: Can we Predict Runtime 

Performance? 

Adrian Daniel Popescu (EPFL), Vuk Ercegovac (IBM 

Almaden), Andrey Balmin (IBM Almaden), Miguel Branco 

(EPFL), and Anastasia Ailamaki (EPFL) 

Elastic Scale-out for Partition-Based Database Systems 

Umar Farooq Minhas (University of Waterloo), Rui Liu 

(University of Waterloo), Ashraf Aboulnaga (University of 

Waterloo), Ken Salem (University of Waterloo), Jonathan 

Ng (University of Waterloo), and Sean Robertson 

(University of Waterloo)

2 – 3:30 pm session 3 

3:30 – 4 pm break 



Adaptive class-based scheduling of continuous queries 

Lory Al Moakar (University of Pittsburgh), Alexandros 

Labrinidis (University of Pittsburgh), and Panos 

Chrysanthis (University of Pittsburgh) 

Adaptive Provisioning of Stream Processing Systems in 

the Cloud 

Javier Cervio (Universidad Politcnica de Madrid), Evangelia 

Kalyvianaki (Imperial College London), Joaqun Salvacha 

(Universidad Politcnica de Madrid), and Peter Pietzuch 

(Imperial College London) 

Lifting the burden of history in adaptive ordering of 

pipelined stream filters 

Efthymia Tsamoura (Aristotle University of Thessaloniki), 

Anastasios Gounaris (Aristotle University of Thessaloniki), 

and Yannis Manolopoulos (Aristotle University of 

Thessaloniki) 

Adaptive Index Buffer 

Hannes Voigt (TU Dresden), Tobias Jaekel (TU Dresden), 

Thomas Kissinger (TU Dresden), and Wolfgang Lehner 

(TU Dresden) 

Application of Micro-Specialization to Query Evaluation 

Operators 

Rui Zhang (University of Arizona), Richard Snodgrass 

(University of 

Arizona), and Saumya Debray (University of Arizona) 

Automatic Data Placement in MPP Databases 

Carlos Garcia-Alvarado (University of Houston), Venkatesh 

Raghavan (Greenplum EMC), Sivaramakrishnan 

Narayanan (Greenplum EMC), and Florian Waas (Greenplum 

EMC) 

Discussion & closing 

Alejandro Buchmann (TU Darmstadt), and Malu Castellanos 

(HP Labs) 

Page 

149


IntErnatIonaL Workshop on spatIo tEmporaL Data 

IntEGratIon anD rEtrIEvaL (stIr 2012) 

http://research.ihost.com/stir12/ 

the increasing world population is putting higher demands on the planet’s limited 

resources due to shifting life-styles. Consequently, we not only need to monitor how we 

consume resources but also optimize resource usage. some examples of the planet’s 

limited resources are water, energy, land, food and air. today, significant challenges exist 

for reducing usage of these resources, while maintaining quality of life. the challenges 

range from understanding regionally varied impacts of global environmental change, 

through tracking diffusion of avian flu and responding to natural disasters, to adapting 

business practice to dynamically changing resources, markets and geopolitical situations. 

this workshop is focused on making the research in information integration 

and retrieval more relevant to the challenges in systems with significant spatial and 

temporal components. the workshop will build upon traditional themes of interest 

namely integration architectures, information extraction, record linkage, named entity 

extraction, source meta-data learning, query execution and optimization. however, we 

gave special emphasis to how this can be applied to integrating information arising 

from systems that are (likely to be) deployed over wide geographic spaces, and collects 

and uses data that changes over time. 

Program 

8:30 – 10 am session 1 


10:30 – noon session 2 

Page 

150 

Opening and Welcome 

Invited Talk: “On the Roles of Spatio-Temporal Data in 

Web Search” 

Prof. Christian S Jensen, ACM & <strong>IEEE</strong> Fellow (Aarhus 

University, Denmark) 

TNeT: Tensor-based Neighborhood Discovery in 

Traffic Networks 

Yanan Sun, Vandana P Janeja, Aryya Gangopadhayay 

(University of Maryland, Baltimore County, USA) and 

Michael P McGuire (Towson University, USA)

noon – 1:30 pm Lunch 

1:30 – 3:30 pm session 3 


4 – 5:30 pm session 4 


A Study of the Correlation between the Spatial Attributes 

on Twitter 

Bumsuk Lee and Byung-Yeon Hwang (The Catholic 

University of Korea, Korea) 

Multi-representation Lens for Visual Analytics 

Sandro Danilo Gatto and Andre Santanche 

(UNICAMP, Brazil) 

Invited Talk/Panel Discussion - TBD 

Who was Where, When? Spatiotemporal Analysis of 

Researcher Mobility in Nuclear Science 

Miray Kas, Kathleen M Carley, and L. Richard Carley 

(Carnegie Mellon University, USA) 

Architecting the Database Access for a IT Infrastructure 

and Data Center Monitoring tool 

Pradeep Unde, Harrick Vin, Maitreya Natu, Vaishali Kulkarni, 

Dilys Thomas, Sreeram Vasudevan, Amruta Dhondage, 

Chinmay Jog, Shivam Sahai, and Rekha Pathak (Tata 

Research Development and Design Center, Pune, India) 

Moving Objects and KML Files 

Karine Reis Ferreira, Lúbia Vinhas, Antônio Miguel Vieira 

Monteiro and Gilberto Camara (National Institute of 

Space Research, Brazil) 

Closing Remarks 

Page 

151


Page 

152

Local Information 

Washington, DC: capItaL of tHE USa 

the city, which is located on the north bank of the potomac River, is bordered by the 

states of Virginia to the southwest and Maryland to the other sides. the District has a 

resident population of 599,657; because of commuters from the surrounding suburbs, its 

population rises to over one million during the workweek. the Washington Metropolitan 

area, of which the District is a part, has a population of 5.3 million, the ninth-largest metropolitan 

area in the country. the District has a total area of 68.3 square miles (177 km 2 ), 

of which 61.4 square miles (159 km 2 ) is land and 6.9 square miles (18 km 2 ) (10.16%) is 

water. the District has three major natural flowing streams: the potomac River and its 

tributaries, the anacostia River, and Rock creek, and tiber creek, a watercourse that once 

passed through the National Mall, but was fully enclosed underground during the 1870s. 

the highest natural point in the District of columbia is point Reno, located in fort 

Reno park, in the tenleytown neighborhood, at 409 feet (125 m) above sea level. the 

lowest point is sea level at the potomac River. the geographic center of Washington is 

located near the intersection of 4th and L Streets NW. 

approximately 19.4% of Washington, D.c. is parkland, which ties New York city for 

largest percentage of parkland among high-density U.S. cities. the U.S. National park 

Service manages most of the natural habitat in Washington, D.c., including Rock creek 

park, the chesapeake and ohio canal National Historical park, the National Mall, 

theodore Roosevelt Island, the constitution Gardens, Meridian Hill park, and anacostia 

park. the only significant area of natural habitat not managed by the National park 

Page 

153


LOCAL INFORMATION 

Service is the U.S. National arboretum, which is operated by the U.S. Department of 

agriculture. the Great falls of the potomac River are located upstream (northwest) of 

Washington. During the 19th century, the chesapeake and ohio canal, which starts in 

Washington, 

Georgetown, 

D.C 

was 

the 

used 

capital 

to allow 

of USA. 

barge traffic to bypass the falls. 

The city, which is located on the north bank of the Potomac River, is bordered by the states of 

Virginia 

Washington 

to the southwest 

is located 

and 

in the 

Maryland 

humid 

to 

subtropical 

the other sides. 

climate 

The 

zone, 

District 

exhibiting 

has a resident 

four distinct 

population 

of seasons. 599,657; Its because climate of commuters is typical of from Mid-atlantic the surrounding U.S. areas suburbs, removed its population from bodies rises to of over water. 

one Spring million and during fall are the warm, workweek. with The low Washington humidity, while Metropolitan winter is Area, cool, of with which annual the District snowfall is a 

part, averaging has a population 14.7 inches of 5.3 (370 million, mm). the average ninth-largest winter metropolitan lows tend to area be around in the country. 30°f (-1°c) The from 

District mid-December has a total area to mid-february. of 68.3 square Blizzards miles (177 affect km Washington on average once every four 

to six years. the most violent storms are called “nor’easters”, which typically feature high 

winds, heavy rains, and occasional snow. these storms often affect large sections of the 

U.S. East coast. Summers are hot and humid, with highs averaging in the upper 80s°f 

(lower 30s°c) and lows averaging in the upper 60s °f (lower 20s°c). the combination of 

heat and humidity in the summer brings very frequent thunderstorms, some of which 

occasionally produce tornadoes in the area. While hurricanes (or their remnants) occasionally 

track through the area in late summer and early fall, they have often weakened 

by the time they reach Washington, partly due to the city’s inland location. flooding of 

the potomac River, however, caused by a combination of high tide, storm surge, and 

runoff, has been known to cause extensive property damage in Georgetown. 

History 

an algonquian people known as the Nacotchtank inhabited the area around the anacostia 

River where Washington now lies when the first Europeans arrived in the 17th 

century; however, Native american people had largely relocated from the area by the 

early 18th century. Georgetown was chartered by the province of Maryland on the north 

bank of the potomac River in 1751. the town would be included within the new federal 

territory established nearly 40 years later. the city of alexandria, Virginia, founded in 

1749, was also originally included within the District. 

James Madison expounded the need for a federal district on January 23, 1788, in his 

“federalist No. 43”, arguing that the national capital needed to be distinct from the 

2 ), of which 61.4 square miles (159 km 2 ) is 

land and 6.9 square miles (18 km 2 ) (10.16%) is water. The District has three major natural 

flowing streams: the Potomac River and its tributaries, the Anacostia River, and Rock Creek, and 

Tiber Creek, a watercourse that once passed through the National Mall, but was fully enclosed 

underground during the 1870s. 

The highest natural point in the District of Columbia is Point Reno, located in Fort Reno Park, in 

the Tenleytown neighborhood, at 409 feet (125 m) above sea level. The lowest point is sea level 

at the Potomac River. The geographic center of Washington is located near the intersection of 4th 

and L Streets NW. 

Approximately 19.4% of Washington, D.C. is parkland, which ties New York City for largest 

percentage of parkland among high-density U.S. cities. The U.S. National Park Service manages 

most of the natural habitat in Washington, D.C., including Rock Creek Park, the Chesapeake and 

Ohio Canal National Historical Park, the National Mall, Theodore Roosevelt Island, the 

Constitution Gardens, Meridian Hill Park, and Anacostia Park. The only significant area of 

natural habitat not managed by the National Park Service is the U.S. National Arboretum, which 

is operated by the U.S. Department of Agriculture. The Great Falls of the Potomac River are 

located upstream (northwest) of Washington. During the 19th century, the Chesapeake and Ohio 

Canal, which starts in Georgetown, was used to allow barge traffic to bypass the falls. 

Washington is located in the humid subtropical climate zone, exhibiting four distinct seasons. Its 

climate is typical of Mid-Atlantic U.S. areas removed from bodies of water. Spring and fall are 

warm, with low humidity, while winter is cool, with annual snowfall averaging 14.7 inches 

(370 mm). Average winter lows tend to be around 30 °F (-1 °C) from mid-December to mid- 

Page 

154


states in order to provide for its own maintenance and safety. an attack on the congress 

at philadelphia by a mob of angry soldiers, known as the pennsylvania Mutiny of 1783, 

had emphasized the need for the government to see to its own security. therefore, the 

authority to establish a federal capital was provided in article one, Section Eight, of the 

United States constitution, which permits a “District (not exceeding ten miles square), 

by cession of particular states, and the acceptance of congress, become the seat of 

the government of the United States”. the constitution does not, however, specify a 

location for the new capital. In what later became known as the compromise of 1790, 

Madison, alexander Hamilton, and thomas Jefferson came to an agreement that the 

federal government would assume war debt carried by the states, on the condition that 

the new national capital would be located in the South. 

on July 16, 1790, the Residence act provided for a new permanent capital to be located 

on the potomac River, the exact area to be selected by president Washington. as permitted 

by the U.S. constitution, the initial shape of the federal district was a square, 

measuring 10 miles (16 km) on each side, totaling 100 square miles (260 km2). During 

1791-1792, andrew Ellicott and several assistants, including Benjamin Banneker, 

surveyed the border of the District with both Maryland and Virginia, placing boundary 

stones at every mile point; many of the stones are still standing. a new “federal city” 

was then constructed on the north bank of the potomac, to the east of the established 

settlement at Georgetown. on September 9, 1791, the federal city was named in honor 

of George Washington, and the district was named the territory of columbia, columbia 

being a poetic name for the United States in use at that time. congress held its first session 

in Washington on November 17, 1800. 

the organic act of 1801 officially organized the District of columbia and placed the 

entire federal territory, including the cities of Washington, Georgetown, and alexandria, 

under the exclusive control of congress. further, the unincorporated territory within the 

District was organized into two counties: the county of Washington to the east of the 

potomac and the county of alexandria to the west. following this act, citizens located 

in the District were no longer considered residents of Maryland or Virginia, thus ending 

their representation in congress. 

on august 24–25, 1814, in a raid known as the Burning of Washington, British forces 

invaded the capital during the War of 1812, following the sacking and burning of York 

(modern-day toronto). the capitol, treasury, and White House were burned and gutted 

during the attack. Most government buildings were quickly repaired, but the capitol, 

which was at the time largely under construction, was not completed in its current form 

until 1868. 

Since 1800, the District’s residents have protested their lack of voting representation 

in congress. to correct this, various proposals have been offered to return the land 

ceded to form the District back to Maryland and Virginia. this process is known as 

retrocession. However, such efforts failed to earn enough support until the 1830s when 

the District’s southern county of alexandria went into economic decline due to neglect 

by congress. alexandria was also a major market in the american slave trade, and 

Page 

155


rumors circulated that abolitionists in congress were attempting to end slavery in the 

District; such an action would have further depressed alexandria’s economy. Unhappy 

with congressional authority over alexandria, in 1840 the people began to petition for 

the retrocession of the District’s southern territory back to Virginia. the state legislature 

complied in february 1846, partly because the return of alexandria provided two 

additional pro-slavery delegates to the Virginia General assembly. on July 9, 1846, 

congress agreed to return all of the District’s territory south of the potomac River to the 

commonwealth of Virginia. 

confirming the fears of pro-slavery alexandrians, the compromise of 1850 outlawed the 

slave trade in the District, though not slavery itself. By 1860, approximately 80% of the 

city’s african american residents were free blacks. the outbreak of the american civil 

War in 1861 led to notable growth in the District’s population due to the expansion of the 

federal government and a large influx of freed slaves. In 1862, president abraham Lincoln 

signed the compensated Emancipation act, which ended slavery in the District of columbia 

and freed about 3,100 enslaved persons, nine months prior to the Emancipation 

proclamation. By 1870, the District’s population had grown to nearly 132,000. Despite the 

city’s growth, Washington still had dirt roads and lacked basic sanitation; the situation was 

so bad that some members of congress proposed moving the capital elsewhere. 

With the organic act of 1871, congress created a new government for the entire federal 

territory. this act effectively combined the city of Washington, Georgetown, and Washington 

county into a single municipality officially named the District of columbia. Even 

though the city of Washington legally ceased to exist after 1871, the name continued 

in use and the whole city became commonly known as Washington, D.c. In the same 

organic act, congress also appointed a Board of public Works charged with modernizing 

the city. In 1873, president Grant appointed the board’s most influential member, 

alexander Shepherd, to the new post of governor. that year, Shepherd spent $20 million 

on public works ($357 million in 2007), which modernized Washington but also 

bankrupted the city. In 1874, congress abolished Shepherd’s office in favor of direct 

rule. additional projects to renovate the city were not executed until the McMillan plan 

in 1901. 

the District’s population remained relatively stable until the Great Depression in the 

1930s when president franklin D. Roosevelt’s New Deal legislation expanded the bureaucracy 

in Washington. World War II further increased government activity, adding to the 

number of federal employees in the capital; by 1950, the District’s population had reached 

a peak of 802,178 residents. the twenty-third amendment to the United States constitution 

was ratified in 1961, granting the District three votes in the Electoral college. 

after the assassination of civil rights leader Dr. Martin Luther King, Jr., on april 4, 

1968, riots broke out in the District, primarily in the U Street, 14th Street, 7th Street, 

and H Street corridors, centers of black residential and commercial areas. the riots 

raged for three days until over 13,000 federal and National Guard troops managed to 

quell the violence. Many stores and other buildings were burned; rebuilding was not 

complete until the late 1990s. In 1973, congress enacted the District of columbia 

Page 

156


Home Rule act, providing for an elected mayor and city council for the District. In 1975, 

Walter Washington became the first elected and first black mayor of the District. However, 

Board during to oversee the later all municipal 1980s spending and 1990s, and rehabilitate city administrations the city government. were The District criticized for mismanagement 

regained control and over waste. its finances In 1995, in September congress 2001 and created the oversight the District board's operations of columbia were financial 

Board suspended. to oversee all municipal spending and rehabilitate the city government. The District 

control Board to oversee all municipal spending and rehabilitate the city government. 

regained control over its finances in September 2001 and the oversight board's operations were 

the suspended. District regained control over its finances in September 2001 and the oversight 

board’s Attractions operations in Washington, were D.C. suspended. 

Attractions White House in Washington, D.C. 

Attractions The White House in Washington, is the official residence D.C. and principal workplace of the President of the 

United States. Located at 1600 Pennsylvania Avenue NW in Washington, D.C., it was 

White 

White designed House 

House by Irish-born James Hoban and built 

The between White House 1792 and is 1800 the official in the late residence Georgian and style. principal workplace of the President of the 

the United It White has States. been House the Located residence is the at 1600 of official every Pennsylvania U.S. residence President Avenue NW in Washington, D.C., it was 

and designed since principal John by Irish-born Adams. workplace In 1814, James of during Hoban the president the and War built of of 

the between 1812, United 1792 the States. mansion and 1800 Located was set in the ablaze at late 1600 by Georgian the pennsyl- British style. 

vania It has Army 

avenue been in the Burning residence of Washington, 

NW in Washington, of every U.S. destroying 

D.c., President it 

was 

since the 

designed 

John interior Adams. and charring 

by Irish-born 

In 1814, much during of the 

James 

the exterior. 

Hoban 

War of 

Reconstruction began almost immediately, and 

1812, the mansion was set ablaze by the British 

and President built between James Monroe 1792 moved and 1800 into the in partially the late 

Army in the Burning of Washington, destroying 

Georgian reconstructed style. house It has in been October the 1817. residence Under 

the Harry interior S. Truman, and charring the interior much rooms of the were exterior. 

of Reconstruction every completely U.S. dismantled president began almost and since a new immediately, John internal adams. load- and 

In President 1814, bearing during James steel frame Monroe the constructed War moved of 1812, inside into the man- partially walls. 

sion reconstructed Once was this set work ablaze house was in by completed, October the British 1817. the interior army Under rooms in 

the Harry were Burning S. rebuilt. Truman, of Today, Washington, the interior the White rooms House destroying were Complex the includes interior the and Executive charring Residence, much West of the exterior. 

Reconstruction 

completely Wing, Cabinet dismantled Room, 

began 

and Roosevelt 

almost 

a new Room, 

immediately, 

internal East load- Wing, and the Old Executive Office 

and president James Monroe moved into 

bearing 

Building, 

steel 

which 

frame 

houses 

constructed 

the executive 

inside 

offices 

the walls. 

of the President and Vice President. 

the partially reconstructed house in october 1817. Under harry s. truman, the inte- 

Once this work was completed, the interior rooms 

rior Washington rooms were Monument 

were completely dismantled and a new internal load-bearing steel frame 

The rebuilt. Washington Today, Monument the White is an House obelisk Complex near the includes west end the of the Executive National Mall Residence, in West 

constructed Wing, inside the walls. once this work was completed, the interior rooms were 

Washington, Cabinet Room, D.C., built Roosevelt to commemorate Room, East the first Wing, and the Old Executive Office 

rebuilt. Building, U.S. president, today, which the houses General White the George house executive Washington. Complex offices The of includes the President the Executive and Vice President. Residence, West Wing, 

Cabinet monument Room, is both Roosevelt the world's Room, tallest stone East structure Wing, and the old Executive office Building, 

which Washington and houses the world's Monument the tallest executive obelisk, offices standing of 555 the feet president and Vice President. 

The 5⅛ Washington inches (169.294 Monument m). There is are an obelisk taller monumental near the west end of the National Mall in 

Washington, columns, but D.C., they built are neither to commemorate all stone nor true the first 

Washington obelisks. The corner Monument stone was laid on July 4, 1848. 

U.S. president, General George Washington. The 

the The Washington same trowel Monument was used that George is an obelisk Washington 

monument 

near the 

used to lay 

is both 

the cornerstone 

the world's 

of 

tallest 

the Capitol 

stone 

way 

structure 

back 

west and end of the National Mall in Washington, D.c., 

in the 1793. world's tallest obelisk, standing 555 feet 

built 5⅛ inches to commemorate (169.294 m). There the first are taller U.S. president, monumental 

General columns, Lincoln George but Memorial they Washington. are neither all the stone monument nor true is 

both obelisks. The the Lincoln world’s The corner Memorial tallest stone commemorates stone was laid structure on the July life and 4, of 1848. the 

Abraham Lincoln, the 16th President of the United 

world’s The same tallest trowel obelisk, was used standing that George 555 feet Washington 5-1/8 inch- 

used States. to lay It the is located cornerstone in Potomac of the Park, Capitol Washington, way back D.C. The Memorial was designed by 

es (169.294 Henry Bacon; m). the there style is are that taller of a Greek monumental Doric temple with 36 enormous columns. Inside 

in 1793. 

columns, but they are neither all stone nor true 

obelisks. Lincoln Memorial the corner stone was laid on July 4, 1848. 

the The same Lincoln trowel Memorial was commemorates used that George the life Washington of 

used Abraham to lay Lincoln, the cornerstone the 16th President of the capitol of the United way back in 1793. 

States. It is located in Potomac Park, Washington, D.C. The Memorial was designed by 

Lincoln Henry Bacon; Memorial the style is that of a Greek Doric temple with 36 enormous columns. Inside 

the Lincoln Memorial commemorates the life of abraham Lincoln, the 16th president 

of the United States. It is located in potomac park, Washington, D.c. the Memorial was 

Page 

157


the building is a huge statue of a sitting Lincoln. Also in th 

The World War II Memorial honors the 16 million who served in th 

designed by Henry Bacon; the style is that of a and Greek stone Doric engravings temple of with Lincoln's 36 enormous second inaugural addres 

columns. Inside U.S., the building the more is a than huge statue 400,000 of a sitting who died, Lincoln. and also all in who the Memorial supported the wa 

are two murals, and stone engravings of Lincoln’s On August second 28, inaugural 1963, Martin address Luther Symbolic and King, the Jr., of made the his def "I 

Gettysburg address. 

steps of the Lincoln Memorial (the speech was delivered o 

20th Century, the m 

Lincoln's statue); there is now an inscription on the step w 

on august 28, 1963, Martin Luther King, Jr., commemorating made his “I Have that a Dream” historic event. monument speech Dr. on King the was to speakin the sp 

steps of the Lincoln Memorial (the speech was for delivered Jobs and on Freedom. the landing 18 steps commitment below of the 

Lincoln’s statue); there is now an inscription on the step where Dr. King stood, The comSecond 

World 

National World War II Memorial 

memorating that historic event. Dr. King was speaking at the March on Washington for 

The World War II Memorial honors Century the 16 event million who comm se 

Jobs and freedom. 

U.S., the more than 400,000 who National died, and all Mall’s who suppor cen 

Symbolic 


20th Cent 

the World War II Memorial honors the 16 

monumen 

million who served in the armed forces of 

commitm 

the U.S., the more than 400,000 who died, 

The Seco 

and all who supported the war effort from 

Century e 

home. Symbolic of the defining event of the 

Japanese Cherry National BlM 

20th century, the memorial is a monument 

The National Cherr 

to the spirit, sacrifice, a spring and celebration commitment of in Washington, D.C. commemorating the March 

the american people. the Second World War 

is the only 20th Japanese century event cherry commemo- trees from Mayor Yukio Ozaki of Tokyo to the city 

rated on the National Mayor Mall’s Ozaki central donated axis. the trees in an effort to enhance the growing Japanese f 

The Natio 

United States and Japan and also celebrate the continued close relati 

Japanese Cherry Blossom Trees 

a spring celebration in Washington, D.C. commemorating 

two nations. 

Japanese cherry trees from Mayor Yukio Ozaki of Tokyo 

the National cherry Blossom festival is a spring celebration in Washington, D.c. 

commemorating In the 1994 March the 27, Festival 1912, gift was Mayor 

of Japanese expanded Ozaki donated 

cherry to trees two the 

from weeks trees in 

Mayor to an effort to enhance the 

Yukio accommodate th 

United States and Japan and also celebrate the continued c 

ozaki of tokyo happen to the city during of Washington. the trees’ Mayor blooming. 

two 

ozaki 

nations. 

donated Today the trees the in National an effort Cherry Blos 

to enhance the coordinated growing friendship by the between the 

In 1994 the Festival was expanded to two weeks Nationa to accom 

United States Festival, and Japan and Inc., also an celebrate umbrella the happen during the trees’ blooming. Today the National organiza C 

continued close relationship between the two coordinated by the 

nations. representatives of business 

Festival, Inc., an umbrella 

governmental representatives of 

organiza 

In 1994 the festival 700,000 was expanded people visit to two weeks governmental 

Washing 

to accommodate the many activities that happen 700,000 people visit 

admire the blossoming cherry t 

during the trees’ blooming. today the National admire the blossoming 

cherry Blossom beginning festival is coordinated of spring by in the the beginning of spring in the 

nation’s 

National cherry This Blossom year’s festival, Festival Inc., an (100th umThis 

year’s Festival (100th 

Anniver 

brella organization consisting of representatives Trees) will be March 31 – 

Trees) will be March 31 – April 15 

of business, civic, and governmental organizaSaturday, 

April 14. 

Saturday, April 14. 

tions. More than 700,000 people visit Washing- 

(www.nationalcherryblossomfestival.org) 

ton each year to admire the blossoming cherry 

Franklin Delano Roosevelt Memorial 

trees that herald the beginning of spring in the nation’s capital. 

this year’s festival (100th anniversary of the Gift of trees) will be March 31 – april 15; 

with the parade on Saturday, april 14. (www.nationalcherryblossomfestival.org) 

Page 

158 

commemorating that historic event. Dr. King was speaking at the M 

for Jobs and Freedom. 


(www.nationalcherryblossomfestival.org) 

Franklin Delano Roosevelt Memorial


Franklin Delano Roosevelt Memorial 

Located along the famous cherry tree Walk on the Western edge of the tidal Basin near 

the National Mall, this is a memorial not only to fDR, but also to the era he represents. 

the memorial traces twelve years of american History through a sequence of four outdoor 

rooms - each one devoted to one of fDR’s terms of office. Sculptures inspired by 

photographs depict the 32nd president: a 10-foot statue shows him in a wheeled chair; 

a bas-relief depicts him riding in a car during his first inaugural. at the very beginning 

of the memorial in a prologue room there is a statue with fDR seated in a wheelchair 

much like the one he actually used. 

Jefferson Memorial 

this presidential memorial is dedicated to thomas Jefferson, an american founding 

father and the third president of the United States. the neoclassical building was 

designed by John Russell pope. construction began in 1939, the building was completed 

in 1943, and the bronze statue of Jefferson was added in 1947. When completed, 

the memorial occupied one of the last significant sites left in the city. In 2007, it was 

ranked fourth on the List of america’s favorite architecture by the american Institute 

of architects. 

Smithsonian 

this is an educational foundation chartered by congress in 1846 that maintains most of 

the nation’s official museums and galleries in Washington, D.c. the U.S. government 

partially funds the Smithsonian, thus making its collections open to the public free of 

charge. the most visited of the Smithsonian museums in 2007 was the National Museum 

of Natural History located on the National Mall. other Smithsonian Institution 

museums and galleries located on the mall are: the National air and Space Museum; 

the National Museum of african art; the National Museum of american History; the 

National Museum of the american Indian; the Sackler and freer galleries, which both 

focus on asian art and culture; the Hirshhorn Museum and Sculpture Garden; the arts 

and Industries Building; the S. Dillon Ripley center; and the Smithsonian Institution 

Building (also known as “the castle”), which serves as the institution’s headquarters. 

the Smithsonian american art Museum (formerly known as the National Museum of 

american art) and the National portrait Gallery are located in the same building, the 

Donald W. Reynolds center, near Washington’s chinatown. the Reynolds center is 

also known as the old patent office Building. the Renwick Gallery is officially part of 

the Smithsonian american art Museum but is located in a separate building near the 

White House. other Smithsonian museums and galleries include: the anacostia community 

Museum in Southeast Washington; the National postal Museum near Union 

Station; and the National Zoo in Woodley park. 

National Gallery of Art 

the National Gallery is located on the National Mall near the capitol, but is not a part 

of the Smithsonian Institution. It is instead wholly owned by the U.S. government; 

thus admission to the gallery is free. the gallery’s West Building features the nation’s 

collection of american and European art through the 19th century. the East Building, 

designed by architect I. M. pei, features works of modern art. the Smithsonian ameri- 

Page 

159


can art Museum and the National portrait Gallery are often confused with the National 

Gallery of art when they are in fact entirely separate institutions. the National Building 

Museum occupies the former pension Building located near Judiciary Square, and was 

chartered by congress as a private institution to host exhibits on architecture, urban 

planning, and design. there are many private art museums in the District of columbia, 

which house major collections and exhibits open to the public such as: the National Museum 

of Women in the arts; the corcoran Gallery of art, the largest private museum in 

Washington; and the phillips collection in Dupont circle, the first museum of modern 

art in the United States. other private museums in Washington include the Newseum, 

the <strong>International</strong> Spy Museum, the National Geographic Society Museum, and the 

Marian Koshland Science Museum. the United States Holocaust Memorial Museum 

located near the National Mall maintains exhibits, documentation, and artifacts related 

to the Holocaust. 

Performing Arts and Music 

Washington, D.c. is a national center for the arts. the John f. Kennedy center for the 

performing arts, which is located along the potomac River, is home to the National 

Symphony orchestra, the Washington National opera, and the Washington Ballet. the 

Kennedy center Honors are awarded each year to those in the performing arts who 

have contributed greatly to the cultural life of the United States. the president and first 

Lady typically attend the Honors ceremony, as the first Lady is the honorary chair of 

the Kennedy center Board of trustees. Washington also has a local independent theater 

tradition. Institutions such as arena Stage, the Shakespeare theatre company, and the 

Studio theatre feature classic works and new american plays. 

the U street Corridor in Northwest D.c., known as “Washington’s Black Broadway”, 

is home to institutions like Bohemian Caverns and the Lincoln theatre, which hosted 

music legends such as Washington-native Duke Ellington, John Coltrane, and Miles 

Davis. other jazz venues feature modern blues such as Madam’s organ in adams Morgan 

and Blues alley in Georgetown. D.c. has its own native music genre called go-go; 

a post-funk, percussion-driven flavor of R&B that blends live sets with relentless dance 

rhythms. the most accomplished practitioner was D.c. band leader Chuck Brown, who 

brought go-go to the brink of national recognition with his 1979 Lp Bustin’ Loose. 

Green Initiatives 

• 70 percent of land in Washington, DC is controlled by the National Park Service. 

there are 250,000 acres of parkland in the Greater Washington Metropolitan area. 

• In 2007, DC was named the most walkable city in the US in a study by the Brookings 

Institute. 

• In late 2006, City Council passed an initiative making the nation’s capital the first 

major city to require developers to adhere to guidelines established by the U.S. Green 

Building council. 

• The Washington Nationals Ballpark is striving to be the country’s first green-certified 

ballpark 

• The Walter E. Washington Convention Center is a green meeting facility, with 

earth-friendly features like low emission glass that controls heat gain and loss and 

Page 

160


maximizes natural lighting; energy-conserving heating, ventilation and air conditioning 

systems that operate in zones; high-efficiency lighting; automatic controls on 

restroom fixtures; plus recycling programs and easy public transportation access. 

• DC’s hotels have implemented green initiatives, including wind power, renewable 

energy credits, recycling and adopt-a-park programs with neighborhood green spaces. 

<strong>International</strong> DC 

• 84,000 DC residents (15%) speaking a language other than English at home. 

• 74,000 DC residents (12%) are foreign-born. 

• The Greater Washington region is home to 400 international association, 700 internationally 

owned companies and more than 150 embassies and international cultural 

centers. 

Page 

161


Page 

162

Platinum Sponsors 

Gold Sponsors 

Silver Sponsors 

Bronze Sponsor 

Supported By

28th IEEE International Conference on Data Engineering - ICDE 2012

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?