23.11.2012 Views

28th IEEE International Conference on Data Engineering - ICDE 2012

28th IEEE International Conference on Data Engineering - ICDE 2012

28th IEEE International Conference on Data Engineering - ICDE 2012

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

★ ★ ★<br />

<str<strong>on</strong>g>28th</str<strong>on</strong>g> <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong><br />

Washingt<strong>on</strong>, DC • April 1-5, <strong>2012</strong><br />

<strong>Data</strong> <strong>Engineering</strong> (<strong>ICDE</strong>)


<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Program<br />

<str<strong>on</strong>g>28th</str<strong>on</strong>g> <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong><br />

<strong>Data</strong> <strong>Engineering</strong> (<strong>ICDE</strong>)<br />

April 1-5, <strong>2012</strong><br />

Washingt<strong>on</strong>, DC<br />

COVER PHOTOS: Copyright © <strong>2012</strong> by Tasos Kementsietsidis


Table of C<strong>on</strong>tents<br />

Table of C<strong>on</strong>tents ................................................................................................3<br />

Message from the <strong>ICDE</strong> <strong>2012</strong> Program Chairs ........................................5<br />

and the General Chair<br />

<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong> .................................................................................7<br />

<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Venue ..............................................................................................17<br />

Program at a Glance .........................................................................................21<br />

Sessi<strong>on</strong> C<strong>on</strong>tents ...............................................................................................25<br />

Keynotes ................................................................................................................51<br />

Seminars ............................................................................................................... 55<br />

Panels ..................................................................................................................... 61<br />

Awards .................................................................................................................. 67<br />

Abstracts .............................................................................................................. 69<br />

Co-Located Workshops ................................................................................139<br />

Local Informati<strong>on</strong> ............................................................................................153<br />

Page<br />

3


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

4


Message from the <strong>ICDE</strong><br />

<strong>2012</strong> Program Chairs and<br />

the General Chair<br />

Since 1984, <strong>ICDE</strong> has established itself as a premier forum in the area of data management,<br />

providing a unique opportunity for database researchers, users, practiti<strong>on</strong>ers,<br />

and developers to exchange new ideas. The <str<strong>on</strong>g>28th</str<strong>on</strong>g> <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong><br />

<strong>Data</strong> <strong>Engineering</strong> takes place in the city of Washingt<strong>on</strong>, United States, from April 1 to 5,<br />

<strong>2012</strong>. We are proud to present its program in these proceedings.<br />

Each of the main days of the c<strong>on</strong>ference starts out with a keynote by a distinguished<br />

scientist: Serge Abiteboul from INRIA in France <strong>on</strong> April 2; Surajit Chaudhuri from<br />

Microsoft Research in the United States <strong>on</strong> April 3; and Peter Druschel from the Max-<br />

Planck Institute for Software Systems in Germany <strong>on</strong> April 4.<br />

We thank all the authors who submitted their work to <strong>ICDE</strong> for making the c<strong>on</strong>ference<br />

happen. We received 413 paper submissi<strong>on</strong>s for the research track, 22 submissi<strong>on</strong>s for<br />

the industrial track, and 68 demo proposals. The program committee was organized<br />

into fifteen topic-based tracks. Each track was headed by a vice-chair who formed a committee<br />

to evaluate the papers assigned to that track. This resulted in a research program<br />

committee c<strong>on</strong>sisting of 188 members for the research tracks, 12 members for the<br />

industrial track, and 30 members for the demo track. The evaluati<strong>on</strong> process c<strong>on</strong>sisted<br />

of three distinct phases: initial reviews of the papers by PC members, some initial dis-<br />

Page<br />

5


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

cussi<strong>on</strong>s, author resp<strong>on</strong>ses to these reviews, and then further discussi<strong>on</strong> by the PC and<br />

fine-tuning of the reviews.<br />

The research program features 100 papers, the industrial program 9 papers, and the<br />

dem<strong>on</strong>strati<strong>on</strong> program 28 demos. The c<strong>on</strong>ference program also includes 6 seminar<br />

tutorials and <strong>on</strong>e panel. As a feature of <strong>ICDE</strong> c<strong>on</strong>ferences in recent years, all papers are<br />

presented at a poster sessi<strong>on</strong>. Accompanying the main c<strong>on</strong>ference are seven workshops.<br />

The success of <strong>ICDE</strong> <strong>2012</strong> is a result of collegial teamwork from many individuals who<br />

worked tirelessly to make the c<strong>on</strong>ference a success. We thank Nico Bruno and Ken Ross<br />

who served as Industrial Chairs; Christof Bornhoevd, Richard Goodwin, and Mirek Riedewald<br />

who served as Demo Chairs; Aryya Gangopadhyay who served as Seminar Chair;<br />

Michael Gertz and Alex Tuzhilin who served as Panel Chairs; Anupam Joshi and Sharad<br />

Mehrotra who served as Workshop Chairs; and also the organizers of the accompanying<br />

workshops. We also express our deep appreciati<strong>on</strong> of the outstanding work put in over<br />

many m<strong>on</strong>ths by the organizati<strong>on</strong> team: Nabil Adam, Alex Brodsky and Vijay Atluri<br />

served as general (vice-)chairs, Carlotta Domenic<strong>on</strong>i and Huzefa Rangwala were the<br />

Local Organizati<strong>on</strong> and Sp<strong>on</strong>sorship Chairs, Hui Xi<strong>on</strong>g served as Finance Chair, So<strong>on</strong><br />

Ae Chun served as Publicity Chair, Anastasios Kementsietsidis and Marcos Vaz Salles<br />

as Proceedings Chairs, and Micah Sherr as Web Chair. We thank Carmen Saliba and<br />

Alkenia Winst<strong>on</strong> from the <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Computer Society’s <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Support Services for<br />

helping secure the various necessary c<strong>on</strong>tracts in a timely manner, and Beth Grohnke of<br />

GMU’s Office of Event Management for helping with many local arrangement issues.<br />

The Best Paper Award Committee included Minos Garofalakis (chair), Anth<strong>on</strong>y Tung,<br />

and Ugur Centintemel. Without the c<strong>on</strong>tributi<strong>on</strong>s of all of these excellent c<strong>on</strong>ference officers,<br />

this c<strong>on</strong>ference would not have been a success. We are also thankful to the many<br />

student volunteers from George Mas<strong>on</strong> University.<br />

We also thank the Microsoft CMT Team and the <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Publicati<strong>on</strong>s Team<br />

for their assistance and quick replies to our multitude of requests.<br />

We also gratefully acknowledge the financial support of our sp<strong>on</strong>sors: Microsoft and<br />

the Nati<strong>on</strong>al Science Foundati<strong>on</strong> as Platinum Sp<strong>on</strong>sors, EMC and Greenplum as Gold<br />

Sp<strong>on</strong>sors, HP and IBM Research as Silver Sp<strong>on</strong>sors, and Google as a Br<strong>on</strong>ze Sp<strong>on</strong>sor.<br />

Finally, we thank all the authors, presenters, and participants of the c<strong>on</strong>ference. We<br />

hope that all of you enjoy the c<strong>on</strong>ference!<br />

<strong>ICDE</strong> <strong>2012</strong> PC Chairs<br />

Johannes Gehrke (Cornell University, USA)<br />

Beng Chin Ooi (Nati<strong>on</strong>al University of Singapore, Singapore)<br />

Evaggelia Pitoura (University of Ioannina, Greece)<br />

<strong>ICDE</strong> <strong>2012</strong> General Chair<br />

X. Sean Wang (Fudan University, China)<br />

Page<br />

6


<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />

Organizing COmmittee<br />

General Chairs<br />

X. Sean Wang (Fudan University)<br />

nabil r. adam (US DHS S&t, rutgers University)<br />

General Vice Chairs<br />

alex Brodsky (george mas<strong>on</strong> University)<br />

Vijay atluri (rutgers University)<br />

Program Chairs<br />

Johannes gehrke (Cornell University)<br />

Beng Chin Ooi (nati<strong>on</strong>al University of Singapore)<br />

evaggelia Pitoura (University of ioannina)<br />

Industrial Program Chairs<br />

nicolas Bruno (microsoft research)<br />

Liang-Jie zhang (iBm research)<br />

Kenneth ross (Columbia University)<br />

Seminar/Tutorial Chair<br />

aryya gangopadhyay (UmBC)<br />

Page<br />

7


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Workshop Chairs<br />

Sharad mehrotra (Univ. of California, irvine)<br />

anupam Joshi (UmBC)<br />

Panel Chairs<br />

alex tuzhilin (new York University)<br />

michael gertz (University of Heidelberg)<br />

Poster Chairs<br />

Jaideep Vaidya (rutgers University)<br />

zachary ives (University of Pennsylvania)<br />

Demo Chairs<br />

Christof Bornhoevd (SaP research)<br />

richard goodwin (iBm)<br />

mirek riedewald (northeastern University)<br />

Proceedings Chairs<br />

anastasios Kementsietsidis (iBm)<br />

marcos Vaz Salles (University of Copenhagen)<br />

Local Organizati<strong>on</strong> Chairs and Sp<strong>on</strong>osrship Chairs<br />

Carlotta Domenic<strong>on</strong>i (george mas<strong>on</strong> University)<br />

Huzefa rangwala (george mas<strong>on</strong> University)<br />

Finance Chair<br />

Hui Xi<strong>on</strong>g (rutgers University)<br />

Publicity Chair<br />

So<strong>on</strong> ae Chun (City University of new York)<br />

Web Chair<br />

micah Sherr (georgetown University)<br />

PrOgram COmmittee<br />

Program Committee Area Vice Chairs<br />

Cloud, data warehousing, and large data<br />

Volker markl (tU Berlin, germany)<br />

<strong>Data</strong> Integrati<strong>on</strong>, metadata management, interoperability<br />

erhard rahm (Univ. of Leipzig, germany)<br />

Page<br />

8


<strong>Data</strong> mining and knowledge discovery<br />

anth<strong>on</strong>y tung (nati<strong>on</strong>al University of Singapore, Singapore)<br />

<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />

Distributed, peer-to-peer, grid, and mobile data management<br />

aoying zhou (east normal University, China)<br />

Indexing and storage<br />

Lei Chen (University of Science and technology, H<strong>on</strong>gk<strong>on</strong>g)<br />

Privacy and security<br />

elena Ferrari (University of insubria, italy)<br />

Query processing and query optimizati<strong>on</strong><br />

Kaushik Chakrabarti (microsoft research, USa)<br />

Scientific data and data visualizati<strong>on</strong><br />

zachary ives (University of Pennsylvania, USa)<br />

Semistructured data, XML<br />

ioana manolescu (inria, France)<br />

Social networks, web, and pers<strong>on</strong>al informati<strong>on</strong> management<br />

aris gi<strong>on</strong>is (Yahoo! research, Spain)<br />

Spatial, temporal, and multimedia data<br />

Heng tao Shen (University of Queensland, australia)<br />

Streams, sensor networks, and complex events processing<br />

Ugur Cetintemel (Brown University, USa)<br />

Systems, performance, and transacti<strong>on</strong> management<br />

Bettina Kemme (mcgill University, Canada)<br />

Text, graphs, and search<br />

Venkatesh ganti (google)<br />

Uncertain and probabilistic data<br />

minos garofalakis (technical University of Crete, greece)<br />

Research Program Committee Members<br />

Yanif ahmad, Johns Hopkins University<br />

aris anagnostopoulos, Sapienza University of Rome<br />

Walid aref, Purdue University<br />

ismail ari, Ozyegin University<br />

Soeren auer, Leipzig School of Media<br />

Page<br />

9


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Shivnath Babu, Duke University<br />

roger Barga, Microsoft<br />

zohra Bellahsene, University of M<strong>on</strong>tpellier II<br />

elisa Bertino, Purdue University<br />

Claudio Bettini, University of Milan<br />

michael Bohlen, University of Zurich<br />

Paolo Boldi, University of Milan<br />

Francesco B<strong>on</strong>chi, Yahoo! Research<br />

Peter B<strong>on</strong>cz, CWI<br />

angela B<strong>on</strong>ifati, ICAR-CNR, Italy<br />

Vinayak Borkar, University of California, Irvine<br />

Christof Bornhoevd, SAP<br />

randal Burns, Johns Hopkins University<br />

andrea Cali, University of Oxford<br />

Selcuk Candan, Ariz<strong>on</strong>a State University<br />

Barbara Carminati, University of Insubria, Italy<br />

Deepayan Chakrabarti, Yahoo! Research<br />

Chee Y<strong>on</strong>g Chan, Nati<strong>on</strong>al University of Singapore<br />

Badrish Chandramouli, Microsoft<br />

gang Chen, Zhejing University, China<br />

Shimin Chen, Intel Labs Pittsburgh<br />

Su Chen, Nati<strong>on</strong>al University of Singapore<br />

Yi Chen, Ariz<strong>on</strong>a State University<br />

reynold Cheng, University of H<strong>on</strong>g-K<strong>on</strong>g<br />

Sarah Cohen-Boulakia, LRI Orsay<br />

gao C<strong>on</strong>g, Nanyang Technological University, Singapore<br />

Stefan C<strong>on</strong>rad, University of Dortmund<br />

mariano C<strong>on</strong>sens, University of Tor<strong>on</strong>to<br />

graham Cormode, AT&T Research<br />

isabel Cruz, University of Illinois at Chicago<br />

Bin Cui, Beijing University, China<br />

alfredo Cuzzocrea, ICAR-CNR and University of Calabria, Italy<br />

Colazzo Dario, University Paris Sud<br />

gautam Das, University of Texas-Arlingt<strong>on</strong><br />

anish Das Sarma, Google<br />

Khuzaima Daudjee, University of Waterloo<br />

ant<strong>on</strong>ios Deligiannakis, Technical University of Crete<br />

Stefan Dessloch, University of Kaiserslautern<br />

zhiming Ding, Institute of Software, Chinese Academy of Science<br />

Jens Dittrich, Universitaet Saarland<br />

anhai Doan, University of Wisc<strong>on</strong>sin<br />

eduard Dragut, Purdue University<br />

Sameh elnikety, Microsoft<br />

Vuk ercegovac, IBM Almaden<br />

Wenfei Fan, University of Edinburgh<br />

alan Fekete, University of Sidney<br />

Page<br />

10


<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />

alvaro Fernandes, University of Manchester<br />

Johann Christoph Freytag, University of Berlin<br />

avigdor gal, Techni<strong>on</strong><br />

Helena galhardas, Instituto Superior Tecnico, Portugal<br />

tingjian ge, University of Kentucky<br />

Bugra gedik, IBM<br />

Floris geerts, University of Edinburgh<br />

Sreenivas gollapudi, Microsoft Research<br />

Le gruenwald, University of Oklahoma<br />

torsten grust, University of Tuebingen<br />

amarnath gupta, San Diego Supercomputing Center<br />

Peter Haas, IBM Almaden<br />

Jeff Hammerbacher, Cloudera<br />

Wook-Shin Han, Korean Nati<strong>on</strong>al University<br />

Oktie Hassanzadeh, University of Tor<strong>on</strong>to<br />

magnus Lie Hetland, NTNU, Norway<br />

Vagelis Hristidis, Florida <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> University<br />

zi Huang, University of Queensland<br />

Seung-w<strong>on</strong> Hwang, POSTECH, Korea<br />

Stratos idreos, CWI<br />

Yoshiharu ishikawa, Nagoya University<br />

ryan Johns<strong>on</strong>, University of Tor<strong>on</strong>to<br />

theodore Johns<strong>on</strong>, AT&T Research<br />

Panos Kalnis, King Abdullah University of Science and Technology (KAUST)<br />

murat Kantarcioglu, University of Texas-Dallas<br />

Panagiotis Karras, Nati<strong>on</strong>al University of Singapore<br />

alf<strong>on</strong>s Kemper, TU Muenchen<br />

eam<strong>on</strong>n Keogh, University of California, Riverside<br />

Christoph Koch, EPFL<br />

george Kollios, Bost<strong>on</strong> University<br />

nick Koudas, University of Tor<strong>on</strong>to<br />

tim Kraska, University of California Berkeley<br />

Wang-Chien Lee, Penn State University<br />

Ulf Leser, Humboldt University Berlin<br />

Jure Leskovec, Stanford<br />

guiping Li, Renmin University of China<br />

Feifei Li, Florida State University<br />

guoliang Li, Tsinghua University<br />

ninghui Li, Purdue University<br />

Xiang Lian, H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology<br />

Xueming Lin, University of South Wales<br />

Kun Liu, Yahoo! Labs<br />

Ling Liu, Georgia Tech<br />

eric Lo, H<strong>on</strong>g K<strong>on</strong>g Polytechnic University<br />

Bo<strong>on</strong> thau Loo, University of Pennsylvania<br />

alexander Losup, TU Delft<br />

Page<br />

11


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Hua Lu, Aalborg University<br />

Bertram Ludaescher, University of California Davis<br />

Bradley malin, Vanderbilt University<br />

nikos mamoulis, The University of H<strong>on</strong>g K<strong>on</strong>g<br />

Stefan manegold, CWI<br />

Sebastian maneth, NICTA, Australia<br />

ioana manolescu, Inria<br />

alexandra meliou, University of Washingt<strong>on</strong><br />

Paolo missier, Newcastle University<br />

mohamed F. mokbel, University of Minnesota<br />

mirella moro, Universidade Federal de Minas Gerais, Brazil<br />

Vivek narasayya, Microsoft Research<br />

thomas neumann, Technical University Munich<br />

Silvia nittel, University of Maine<br />

Dan Olteanu, Oxford University<br />

tamer Ozsu, University of Waterloo<br />

thanasis Papaioannou, EPFL<br />

marta Patino-martinez, Technical University of Madrid<br />

glenn Paulley, Sybase<br />

Dino Pedreschi, University of Pisa<br />

Jian Pei, Sim<strong>on</strong> Fraser University<br />

Peter Pietzuch, Imperial College L<strong>on</strong>d<strong>on</strong><br />

neoklis Polyzotis, University of California Santa Cruz<br />

rachel Pottinger, UBC<br />

Sunil Prabhakar, Purdue University<br />

Weining Qian, East China Normal University<br />

Christoph Quix, RWTH Aachen<br />

ravi ramamurthy, Microsoft<br />

Vijayshankar raman, IBM<br />

Vibhor rastogi, Yahoo! Research<br />

indrakshi ray, Colorado State University<br />

Christopher re, University of Wisc<strong>on</strong>sin-Madis<strong>on</strong><br />

matthias renz, Ludwig-Maximilians-University Munich<br />

marcos Vaz Salles, University of Copenhagen<br />

Jagan Sankaranarayanan, NEC Labs America<br />

ralf Schenkel, Saarland University<br />

Heiko Schuldt, University of Basel<br />

Sudipta Sengupta, Microsoft Research<br />

Jayavel Shanmugasundaram, Google<br />

Jie Shao, University of Melbourne<br />

Jialie Shen, Singapore Management University<br />

elaine Shi, UC Berkeley<br />

Kyuseok Shim, Seoul Nati<strong>on</strong>al University<br />

Pavel Shvaiko, Informatica Trentina<br />

Claudio Silva, University of Utah<br />

mauro Sozio, Max Planck Institute for Computer Science, Germany<br />

Page<br />

12


Divesh Srivastava, AT&T Research<br />

Jessica Stadd<strong>on</strong>, Google<br />

S Sudarshan, IIT Bombay<br />

torsten Suel, Polytechnic Institute of NYU<br />

Kian-Lee tan, Nati<strong>on</strong>al University of Singapore<br />

Yufei tao, Chinese University of H<strong>on</strong>g K<strong>on</strong>g<br />

James terwilliger, Microsoft<br />

evimaria terzi, Bost<strong>on</strong> University<br />

Jens teubner, ETH Zurich<br />

Hannu toiv<strong>on</strong>en, University of Helsinki<br />

Panayiotis tsaparas, Microsoft Research<br />

antti Ukk<strong>on</strong>en, Yahoo! Research<br />

Shivakumar Vaithyanathan, IBM Almaden<br />

Vasilis Vassalos, Athens University of Ec<strong>on</strong>omics and Business<br />

Yannis Velegrakis, University of Trento<br />

Quang Hieu Vu, EBTIC<br />

Daisy zhe Wang, University of Florida<br />

guoren Wang, Northeastern University of China<br />

Haixun Wang, Microsoft Research<br />

Jiany<strong>on</strong>g Wang, Tsinghua University<br />

Junhu Wang, Griffith University, Australia<br />

Wei Wang, UNC<br />

Kyu-Young Whang, KAIST<br />

andrew Witkowski, Oracle<br />

raym<strong>on</strong>d W<strong>on</strong>g, H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology<br />

Sai Wu, Nati<strong>on</strong>al University of Singapore<br />

tianyi Wu, Microsoft<br />

Xiaokui Xiao, Nanyang Technological University, Singapore<br />

D<strong>on</strong>g Xin, Google<br />

Jianliang Xu, H<strong>on</strong>g K<strong>on</strong>g Baptist University<br />

Linhao Xu, IBM Research<br />

Xifeng Yan, UCSB<br />

Bin Yang, Max-Planck-Institut für Informatik<br />

Jun Yang, Duke University<br />

Linjun Yang, Microsoft Research Asia<br />

Ke Yi, H<strong>on</strong>g-K<strong>on</strong>g University of Science and Technology<br />

ge Yu, Northeastern University, China<br />

Hwanjo Yu, POSTECH<br />

Carlo zaniolo, UCLA<br />

D<strong>on</strong>gxiang zhang, Nati<strong>on</strong>al University of Singapore<br />

rui zhang, University of Melbourne<br />

zhenjie zhang, NUS<br />

minqi zhou, East China Normal University<br />

Xiangmin zhou, CSIRO<br />

Freida zhu, Singapore Management University<br />

<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />

Page<br />

13


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Industrial Program Committee Members<br />

Bishwaranjan Bhattacharjee, IBM<br />

Philippe B<strong>on</strong>net, IT University of Copenhagen<br />

John Cieslewicz, Aster <strong>Data</strong><br />

amol Deshpande, University of Maryland<br />

Cesar galindo-Legaria, Microsoft<br />

Leo giakoumakis, Microsoft<br />

masaru Kitsuregawa, The University of Tokyo<br />

Harumi Kuno, HP<br />

Jun rao, LinkedIn<br />

rajeev rastogi, Yahoo!<br />

Florian Waas, EMC<br />

mohammed zait, Oracle<br />

Demo Program Committee Members<br />

Sihem amer-Yahia, Qatar Computing Research Institute<br />

arvind arasu, Microsoft Research<br />

Sunil arvindam, SAP Research, India<br />

magdalena Balazinska, University of Washingt<strong>on</strong><br />

Fabio Casati, University of Trento, Italy<br />

malu Castellanos, HP Labs, USA<br />

mariano Cilia, Intel Corporati<strong>on</strong>, Argentina<br />

Brian F Cooper, Google<br />

adina Crainiceanu, US Naval Academy<br />

abhinandan Das, Google<br />

alin Dobra, University of Florida<br />

Javier garcia-garcia, UNAM University, Mexico<br />

Pablo guerrero, TU Darmstadt, Germany<br />

melanie Herschel, Tubingen University<br />

Christian K<strong>on</strong>ig, Microsoft Research<br />

georgia Koutrika, IBM Almaden Research Center<br />

Wolfgang Lehner, TU Dresden, Germany<br />

Feifei Li, Florida State University<br />

ashwin machanavajjhala, Yahoo Research<br />

thomas neumann, TU Munchen<br />

Dan Olteanu, University of Oxford<br />

Carlos Ord<strong>on</strong>ez, University of Houst<strong>on</strong><br />

Peter Pietzuch, Imperial College L<strong>on</strong>d<strong>on</strong><br />

Lin Qiao, IBM Almaden<br />

Berthold reinwald, IBM Almaden, USA<br />

Vladislav Shkapenyuk, ATT Research<br />

adam Silberstein, Yahoo Research<br />

alkis Simitsis, HP Labs<br />

Page<br />

14


ioana r Stanoi, IBM Almaden<br />

ming-Chuan Wu, Microsoft, USA<br />

External Reviewers<br />

albert angel<br />

Pantelis aravogliadis<br />

Vassilis athitsos<br />

evandrino Barros<br />

nicole Bidoit<br />

nicolas B<strong>on</strong>vin<br />

Daniele Braga<br />

Lorenz Buehmann<br />

ruichu Cai<br />

Xin Cao<br />

Bogdan Cautis<br />

Yi-Ling Chen<br />

S<strong>on</strong>gting Chen<br />

Shiwen Cheng<br />

Fei Chiang<br />

Byr<strong>on</strong> Choi<br />

Juan Da Cruz Pinto<br />

maria Daltayanni<br />

mahashweta Das<br />

David DeHaan<br />

Bolin Ding<br />

marius Dumitru<br />

Santiago ezcurra<br />

Ju Fan<br />

Wei Feng<br />

Chuanc<strong>on</strong>g gao<br />

Shen ge<br />

Haris georgiadis<br />

Christan grant<br />

nitin gupta<br />

Yeye He<br />

arvid Heise<br />

Haibo Hu<br />

Heng Huang<br />

Lili Jiang<br />

Xin Jin<br />

alekh Jindal<br />

matti Järvisalo<br />

abhijith Kashyap<br />

asterios Katsifodimos<br />

Batya Kenig<br />

arijit Khan<br />

Julien Leblay<br />

Jae-gil Lee<br />

aurelien Lemay<br />

Jianxin Li<br />

nan Li<br />

Xingjie Liu<br />

Shuai ma<br />

Vincenzo maltese<br />

Bruno martins<br />

michael mathioudakis<br />

manuel mayr<br />

giansalvatore mecca<br />

gengxin miao<br />

Pablo michelis<br />

nabeel mohamed<br />

miyuki nakano<br />

akash nanavati<br />

axel ng<strong>on</strong>ga<br />

anisoara nica<br />

Bart niechweij<br />

tomasz nykiel<br />

matteo Palm<strong>on</strong>ari<br />

Panagiotis Papadimitriou<br />

Charalampos<br />

Papamanthou<br />

Xu Pu<br />

Jianzh<strong>on</strong>g Qi<br />

H<strong>on</strong>gda ren<br />

astrid rheinlaender<br />

Daniele rib<strong>on</strong>i<br />

Jan rittinger<br />

Senjuti Basu roy<br />

eduardo ruiz<br />

michael rys<br />

tomer Sagi<br />

Sim<strong>on</strong>as Saltenis<br />

Carlo Sartiani<br />

Jörg Schad<br />

Stefan Schuh<br />

Pierre Senellart<br />

<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />

Chih-Ya Shen<br />

reza Sherkat<br />

Kelvin Sim<br />

guojie S<strong>on</strong><br />

Claus Stadler<br />

Johannes Starlinger<br />

Yizhou Sun<br />

andrej taliun<br />

takayuki tamura<br />

nan tang<br />

Saravanan<br />

thirumuruganathan<br />

andreas thor<br />

Xinmei tian<br />

masashi toyoda<br />

Frederico Ulliana<br />

Jörg Unbehauen<br />

Jiannan Wang<br />

gerhard Weikum<br />

zeyi Wen<br />

raym<strong>on</strong>d Chi-Wing<br />

W<strong>on</strong>g<br />

Yinghui Wu<br />

mao Ye<br />

Peifeng Ying<br />

man Lung Yiu<br />

Wenyuan Yu<br />

ning zhang<br />

Qijun zhu<br />

Bo z<strong>on</strong>g<br />

Page<br />

15


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

16


<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Venue<br />

ERENCE VENUE<br />

The c<strong>on</strong>ference will take place in the Renaissance Arlingt<strong>on</strong> Capital View Hotel<br />

located at 2800 South Potomac Avenue, Arlingt<strong>on</strong>, Virginia 22202 USA. If using a<br />

GPS navigator, you may try to search for the address 2899 Jeffers<strong>on</strong> Davis Highway,<br />

Arlingt<strong>on</strong>, VA 22202 as an alternative address for locating the destinati<strong>on</strong>.<br />

<strong>on</strong>ference will take place in the Renaissance Arlingt<strong>on</strong> Capital View Hotel loc<br />

0 South Potomac Avenue, Arlingt<strong>on</strong>, Virginia 22202 USA<br />

earest Metro Stati<strong>on</strong> is Crystal City Metro (Blue and Red Line)<br />

Page<br />

17


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

The Nearest Metro Stati<strong>on</strong> is Crystal City Metro (Blue and Yellow Lines).<br />

To take metro to the hotel, you may take off at the Crystal City Metro Stati<strong>on</strong>. Complimentary<br />

hotel shuttle to and from Crystal City Metro stati<strong>on</strong> every 20 minutes between<br />

7am-11pm. (Call 1-703-413-1300 if problem).<br />

Page<br />

18


<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Venue<br />

Complimentary hotel shuttle to and from Reagan (DCA) airport every (20) thirty minutes<br />

between 5am-11pm. Pick up and drop off at Terminal A (hotel shuttle area) or Gates 5 and<br />

9 <strong>on</strong> Level 1 of Terminal B & C.<br />

Nati<strong>on</strong>’s Capital<br />

Washingt<strong>on</strong>, DC<br />

<strong>ICDE</strong><br />

Hotel<br />

Historic Old Towne<br />

Alexandria, VA<br />

Page<br />

19


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> will be held <strong>on</strong> the<br />

sec<strong>on</strong>d floor of the hotel.<br />

Page<br />

20<br />

Registrati<strong>on</strong><br />

Internet<br />

Room


Program at a Glance<br />

(see next page)<br />

Page<br />

21


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

8aM BreaKFaSt eacH day (Prefuncti<strong>on</strong> Area)<br />

Page<br />

22<br />

9aM — 10aM<br />

10aM — 10:30aM<br />

10:30aM — no<strong>on</strong><br />

no<strong>on</strong> — 2pM<br />

2pM — 3:30pM<br />

3:30pM — 4pM<br />

4pM — 5:30pM<br />

afterno<strong>on</strong> & evening<br />

Sunday, april 1<br />

WOrKSHOpS<br />

DGSS (Studio F), SMDB<br />

(Studio B), STIR (Studio D),<br />

DESWEB (Studio E)<br />

Coffee Break<br />

DGSS (Studio F), SMDB<br />

(Studio B), STIR (Studio D),<br />

and DESWEB (Studio E)<br />

Lunch Break (<strong>on</strong> your own)<br />

DGSS (Studio F), SMDB<br />

(Studio B), STIR (Studio D),<br />

and DESWEB (Studio E)<br />

Coffee Break<br />

DGSS (Studio F), SMDB<br />

(Studio B), STIR (Studio D),<br />

and DESWEB (Studio E)<br />

receptiOn<br />

5:30-8 (Sal<strong>on</strong> 4567)<br />

MOnday, april 2<br />

Keynote 1 (Sal<strong>on</strong> 4567)<br />

Serge Abiteboul<br />

Coffee Break<br />

Sessi<strong>on</strong> 1 (Studio F)<br />

Privacy<br />

Sessi<strong>on</strong> 2 (Studio B)<br />

Web 2.0 Applicati<strong>on</strong>s<br />

Sessi<strong>on</strong> 3 (Studio C)<br />

Storage Management<br />

Sessi<strong>on</strong> 4 (Studio D)<br />

<strong>Data</strong> Streams Processing<br />

Seminar 1 (Sal<strong>on</strong> 123)<br />

Demo Group 1 (Studio E)<br />

Business Lunch and Award<br />

Cerem<strong>on</strong>y (Sal<strong>on</strong> 4567)<br />

Sessi<strong>on</strong> 5 (Studio F) Graphs<br />

Sessi<strong>on</strong> 6 (Studio B)<br />

Uncertain and Probabilistic<br />

<strong>Data</strong>bases<br />

Sessi<strong>on</strong> 7 (Studio C) <strong>Data</strong><br />

Integrati<strong>on</strong> and Extracti<strong>on</strong><br />

Sessi<strong>on</strong> 8 (Studio D)<br />

Spatio-Temporal <strong>Data</strong><br />

Management<br />

Seminar 2 (Sal<strong>on</strong> 123)<br />

Demo Group 2 (Studio E)<br />

Coffee Break<br />

Sessi<strong>on</strong> 9 (Studio F)<br />

Query Processing<br />

Sessi<strong>on</strong> 10 (Studio B) Locati<strong>on</strong><br />

Aware <strong>Data</strong> Processing<br />

Sessi<strong>on</strong> 11 (Studio C) Map-<br />

Reduce based <strong>Data</strong> Processing<br />

Sessi<strong>on</strong> 12 (Studio D)<br />

Social Media<br />

Seminar 3 (Sal<strong>on</strong> 123)<br />

Demo Group 3 (Studio E)<br />

nSF icde <strong>2012</strong> career<br />

panel 7:30-9PM (Sal<strong>on</strong> 123)


tueSday, april 3<br />

Keynote 2 (Sal<strong>on</strong> 4567)<br />

Surajit Chaudhuri<br />

Coffee Break<br />

Sessi<strong>on</strong> 13 (Studio F)<br />

P2P and Distributed<br />

Processing<br />

Sessi<strong>on</strong> 14 (Studio B)<br />

XML and RDF <strong>Data</strong><br />

Management<br />

Sessi<strong>on</strong> 15 (Studio C)<br />

Performance<br />

Industrial Sessi<strong>on</strong> 1<br />

(Studio D) Support for<br />

Large Scale <strong>Data</strong> Analytics<br />

Seminar 4 (Sal<strong>on</strong> 123)<br />

Demo Group 4 (Studio E)<br />

Funders sessi<strong>on</strong> with lunch<br />

(Sal<strong>on</strong> 4567)<br />

Sessi<strong>on</strong> 16 (Studio F) <strong>Data</strong><br />

Extracti<strong>on</strong> and Quality<br />

Sessi<strong>on</strong> 17 (Studio B)<br />

Top-K Processing<br />

Industrial Sessi<strong>on</strong> 2<br />

(Studio C) Evolving Platforms<br />

for New Applicati<strong>on</strong>s<br />

Seminar 5 (Studio 123)<br />

Panel (Studio D) The Future<br />

of Scientific <strong>Data</strong> Bases<br />

Demo Group 1 (Studio E)<br />

Coffee Break<br />

Posters (Sal<strong>on</strong> 4567)<br />

cruiSe and Banquet<br />

5:30PM (Bus leaves hotel)<br />

WedneSday, april 4<br />

Keynote 3 (Sal<strong>on</strong> 4567)<br />

Peter Druschel<br />

Coffee Break<br />

Sessi<strong>on</strong> 18 (Studio F)<br />

Similarity<br />

Sessi<strong>on</strong> 19 (Studio B)<br />

Text and Strings<br />

Sessi<strong>on</strong> 20 (Studio C)<br />

Query Processing II<br />

Industrial Sessi<strong>on</strong> 3<br />

(Studio D) Indexing,<br />

Updates and Processing<br />

Seminar 6 (Sal<strong>on</strong> 123)<br />

Demo Group 2 (Studio E)<br />

Lunch (Sal<strong>on</strong> 4567)<br />

Sessi<strong>on</strong> 21 (Studio F)<br />

<strong>Data</strong> Mining<br />

Sessi<strong>on</strong> 22 (Studio B)<br />

Scientific <strong>Data</strong>, Analysis<br />

and Visualizati<strong>on</strong><br />

Sessi<strong>on</strong> 23 (Studio D)<br />

Similarity Search and<br />

Detecti<strong>on</strong><br />

Demo Group 3 (Studio E)<br />

Coffee Break<br />

Sessi<strong>on</strong> 24 (Studio B)<br />

Sensors Network and<br />

Trajectory<br />

Sessi<strong>on</strong> 25 (Studio D)<br />

Error Reducti<strong>on</strong> and<br />

<strong>Data</strong> Security<br />

Demo Group 4 (Studio E)<br />

Program at a Glance<br />

tHurSday, april 5<br />

WOrKSHOpS<br />

DMC (Studio B),<br />

GDM (Studio D), and<br />

SDMSM (Studio F)<br />

Coffee Break<br />

DMC (Studio B),<br />

GDM (Studio D), and<br />

SDMSM (Studio F)<br />

Lunch Break<br />

DMC (Studio B),<br />

GDM (Studio D), and<br />

SDMSM (Studio F)<br />

Coffee Break<br />

DMC (Studio B),<br />

GDM (Studio D), and<br />

SDMSM (Studio F)<br />

Page<br />

23


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

24


Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Sunday, april 1<br />

8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />

9AM - 5:30PM Workshops<br />

Studio F: data-driven decisi<strong>on</strong> Guidance and<br />

Support Systems (dGSS)<br />

Studio B: Self-Managing database Systems (SMdB)<br />

Studio d: Spatio Temporal data integrati<strong>on</strong> and<br />

retrieval (STir)<br />

Studio E: data <strong>Engineering</strong> Meets the Semantic Web<br />

(dESWEB)<br />

5:30PM - 8PM <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Recepti<strong>on</strong> (Sal<strong>on</strong> 4567)<br />

M<strong>on</strong>day, april 2<br />

8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />

9AM - 10AM Keynote 1 (Sal<strong>on</strong> 4567): Serge Abiteboul — Viewing<br />

the Web as a Distributed Knowledge Base<br />

Sessi<strong>on</strong> Chair: Evaggelia Pitoura<br />

Page<br />

25


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

10AM - 10:30AM Coffee break<br />

10:30AM - No<strong>on</strong> Sessi<strong>on</strong>s 1-4, Seminar 1, Demo Group 1<br />

Page<br />

26<br />

Sessi<strong>on</strong> 1: Privacy (Studio F)<br />

Sessi<strong>on</strong> Chair: Murat Kantarcioglu<br />

Privacy in Social Networks: How Risky is Your Social Graph?<br />

Cuneyt Gurcan akcora (university of insubria)<br />

Barbara Carminati (university of insubria)<br />

Elena Ferrari (university of insubria)<br />

Differentially Private Spatial Decompositi<strong>on</strong>s<br />

Graham Cormode (aT&T labs – research)<br />

Cecilia procopiuc (aT&T labs – research)<br />

Ent<strong>on</strong>g Shen (north Carolina State university)<br />

divesh Srivastava (aT&T labs – research)<br />

Ting yu (north Carolina State university)<br />

Differentially Private Histogram Publicati<strong>on</strong><br />

Jia Xu (northeastern university, China)<br />

Zhenjie Zhang (advanced digital Sciences Center, illinois at<br />

Singapore pte.)<br />

Xiaokui Xiao (nanyang Technological university)<br />

yin yang (advanced digital Sciences Center, illinois at<br />

Singapore pte.)<br />

Ge yu (northeastern university, China)<br />

Privacy-Preserving and C<strong>on</strong>tent-Protecting Locati<strong>on</strong><br />

Based Queries<br />

russell paulet (Victoria university)<br />

Md. Golam Kaosar (Victoria university)<br />

Xun yi (Victoria university)<br />

Elisa Bertino (purdue university)<br />

Sessi<strong>on</strong> 2: Web 2.0 Applicati<strong>on</strong>s (Studio B)<br />

Sessi<strong>on</strong> Chair: Kyuseok Shim<br />

GeoFeed: A Locati<strong>on</strong>-Aware News Feed<br />

Jie Bao (university of Minnesota at Twin Cities)<br />

Mohamed F. Mokbel (university of Minnesota at Twin Cities)<br />

Chi-yin Chow (City university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Entity Search Strategies for Mashup Applicati<strong>on</strong>s<br />

Stefan Endrullis (university of leipzig)<br />

andreas Thor (university of leipzig)<br />

Erhard rahm (university of leipzig)


Sessi<strong>on</strong> C<strong>on</strong>tents<br />

CI-Rank: Ranking Keyword Search Results Based <strong>on</strong><br />

Collective Importance<br />

Xiaohui yu (york university & Shand<strong>on</strong>g university)<br />

Huxia Shi (york university)<br />

Temporal Analytics <strong>on</strong> Big <strong>Data</strong> for Web Advertising<br />

Badrish Chandramouli (Microsoft research)<br />

J<strong>on</strong>athan Goldstein (Microsoft Corporati<strong>on</strong>)<br />

S<strong>on</strong>gyun duan (iBM T. J. Wats<strong>on</strong> research Center)<br />

Sessi<strong>on</strong> 3: Storage Management (Studio C)<br />

Sessi<strong>on</strong> Chair: Alf<strong>on</strong>s Kemper<br />

Lookup Tables: Fine-Grained Partiti<strong>on</strong>ing for<br />

Distributed <strong>Data</strong>bases<br />

aubrey l. Tatarowicz (MiT)<br />

Carlo Curino (MiT)<br />

Evan p. C. J<strong>on</strong>es (MiT)<br />

Sam Madden (MiT)<br />

Temporal Support for Persistent Stored Modules<br />

richard T. Snodgrass (university of ariz<strong>on</strong>a)<br />

dengfeng Gao (iBM Silic<strong>on</strong> Valley lab)<br />

rui Zhang (university of ariz<strong>on</strong>a)<br />

Stephen W. Thomas (Queen’s university, Kingst<strong>on</strong>)<br />

Energy Efficient Storage Management Cooperated with<br />

Large <strong>Data</strong> Intensive Applicati<strong>on</strong>s<br />

norifumi nishikawa (The university of Tokyo)<br />

Miyuki nakano (The university of Tokyo)<br />

Masaru Kitsuregawa (The university of Tokyo)<br />

ISOBAR Prec<strong>on</strong>diti<strong>on</strong>er for Effective and High-throughput<br />

Lossless <strong>Data</strong> Compressi<strong>on</strong><br />

Eric r. Schendel (north Carolina State university)<br />

ye Jin (north Carolina State university)<br />

neil Shah (north Carolina State university)<br />

Jackie Chen (Sandia nati<strong>on</strong>al laboratory)<br />

C.S. Chang (princet<strong>on</strong> plasma physics laboratory,<br />

princet<strong>on</strong>, nJ 08543, uSa)<br />

Seung-Hoe Ku (new york university)<br />

Stephane Ethier (princet<strong>on</strong> plasma physics laboratory)<br />

Scott Klasky (oak ridge nati<strong>on</strong>al laboratory)<br />

robert latham (arg<strong>on</strong>ne nati<strong>on</strong>al laboratory)<br />

robert ross (arg<strong>on</strong>ne nati<strong>on</strong>al laboratory)<br />

nagiza F. Samatova (north Carolina State university &<br />

oak ridge nati<strong>on</strong>al laboratory)<br />

Page<br />

27


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

28<br />

Sessi<strong>on</strong> 4: <strong>Data</strong> Streams Processing (Studio D)<br />

Sessi<strong>on</strong> Chair: Bugra Gedik<br />

Physically Independent Stream Merging<br />

Badrish Chandramouli (Microsoft research)<br />

david Maier (portland State university)<br />

J<strong>on</strong>athan Goldstein (Microsoft Corporati<strong>on</strong>)<br />

On Computing Correlated Aggregates over a <strong>Data</strong> Stream<br />

Srikanta Tirthapura (iowa State university)<br />

david p. Woodruff (iBM almaden research Center)<br />

Accuracy-Aware Uncertain Stream <strong>Data</strong>bases<br />

Tingjian Ge (university of Kentucky)<br />

Fujun liu (university of Kentucky)<br />

On Discovery of Traveling Compani<strong>on</strong>s from Streaming<br />

Trajectories<br />

lu-an Tang (uiuC)<br />

yu Zheng (MSra)<br />

Jing yuan (MSra)<br />

Jiawei Han (uiuC)<br />

alice leung (BBn)<br />

Chih-Chieh Hung (yahoo!)<br />

Wen-Chih peng (nCTu)<br />

Seminar 1 (Sal<strong>on</strong> 123)<br />

<strong>Data</strong> Management Issues <strong>on</strong> the Semantic Web<br />

oktie Hassanzadeh (university of Tor<strong>on</strong>to & iBM research)<br />

anastasios Kementsietsidis (iBM research)<br />

yannis Velegrakis (university of Trento)<br />

Demo Group 1 (Studio E)<br />

SMIX Live – A Self-Managing Index Infrastructure for<br />

Dynamic Workloads<br />

Thomas Kissinger (dresden university of Technology)<br />

Hannes Voigt (dresden university of Technology)<br />

Wolfgang lehner (dresden university of Technology)<br />

Multi-Query Stream Processing <strong>on</strong> FPGAs<br />

Mohammad Sadoghi (university of Tor<strong>on</strong>to)<br />

rohan palaniappan (university of Tor<strong>on</strong>to)<br />

rija Javed (university of Tor<strong>on</strong>to)<br />

naif Tarafdar (university of Tor<strong>on</strong>to),<br />

Harsh Singh (university of Tor<strong>on</strong>to)<br />

Hans-arno Jacobsen (university of Tor<strong>on</strong>to)


Sessi<strong>on</strong> C<strong>on</strong>tents<br />

EUDEMON: A System for Online Video Frame Copy<br />

Detecti<strong>on</strong> by Earth Mover Distance<br />

Jia Xu (northeastern university, China)<br />

Qiushi Bai (northeastern university, China),<br />

yu Gu (northeastern university, China)<br />

anth<strong>on</strong>y Tung (nati<strong>on</strong>al university of Singapore),<br />

Guoren Wang (northeastern university, China),<br />

Ge yu (northeastern university, China),<br />

Zhenjie Zhang (advanced digital Sciences Center, illinois at<br />

Singapore pte.)<br />

A <strong>Data</strong>set Search Engine for the Research<br />

Document Corpus<br />

Meiyu lu (nati<strong>on</strong>al univ. of Singapore)<br />

Srinivas Bangalore (aT&T research labs),<br />

Graham Cormode (aT&T labs – research),<br />

Marios Hadjieleftheriou (aT&T labs – research),<br />

divesh Srivastava (aT&T labs – research)<br />

AskFuzzy: Attractive Visual Fuzzy Query Builder<br />

Keivan Kianmehr (university of Western <strong>on</strong>tario)<br />

negar Koochakzadeh (university of Calgary)<br />

reda alhajj (university of Calgary)<br />

F2DB: The Flash-Forward <strong>Data</strong>base System<br />

ulrike Fischer (dresden university of Technology)<br />

Frank rosenthal (dresden university of Technology)<br />

Wolfgang lehner (dresden university of Technology)<br />

Provenance-Based Debugging and Drill-Down in<br />

<strong>Data</strong>-Oriented Workflows<br />

robert ikeda (Stanford university)<br />

Junsang Cho (Stanford university),<br />

Charlie Fang (Stanford university)<br />

Semih Salihoglu (Stanford university),<br />

Satoshi Torikai (Stanford university)<br />

Jennifer Widom (Stanford university)<br />

No<strong>on</strong> – 2PM Business Lunch & Award Cerem<strong>on</strong>y (Sal<strong>on</strong> 4567)<br />

Page<br />

29


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

2PM - 3:30PM Sessi<strong>on</strong>s 5-8, Seminar 2, Demo Group 2<br />

Page<br />

30<br />

Sessi<strong>on</strong> 5: Graphs (Studio F)<br />

Sessi<strong>on</strong> Chair: Sameh Elnikety<br />

Iterative Graph Feature Mining for Graph Indexing<br />

dayu yuan (penn State university)<br />

prasenjit Mitra (penn State university)<br />

Huiwen yu (penn State university)<br />

C. lee Giles (penn State university)<br />

An Efficient Graph Indexing Method<br />

Xiaoli Wang (nati<strong>on</strong>al university of Singapore)<br />

Xiaofeng ding (Huazh<strong>on</strong>g university of Science and<br />

Technology)<br />

anth<strong>on</strong>y K.H. Tung (nati<strong>on</strong>al university of Singapore)<br />

Shanshan ying (nati<strong>on</strong>al university of Singapore)<br />

Hai Jin (Huazh<strong>on</strong>g university of Science and Technology)<br />

PRAGUE: Towards Blending Practical Visual Subgraph<br />

Query Formulati<strong>on</strong> and Query Processing<br />

Changjiu Jin (nanyang Technological university)<br />

Sourav S Bhowmick (nanyang Technological univ)<br />

Byr<strong>on</strong> Choi (H<strong>on</strong>g K<strong>on</strong>g Baptist university)<br />

Shuigeng Zhou (Fudan university)<br />

Ego-centric Graph Pattern Census<br />

Walaa Eldin Moustafa (university of Maryland, College park)<br />

amol deshpande (university of Maryland, College park)<br />

lise Getoor (university of Maryland, College park)<br />

Sessi<strong>on</strong> 6: Uncertain and Probabilistic<br />

<strong>Data</strong>bases (Studio B)<br />

Sessi<strong>on</strong> Chair: Elena Ferrari<br />

Searching Uncertain <strong>Data</strong> Represented by N<strong>on</strong>-Axis Parallel<br />

Gaussian Mixture Models<br />

Katrin Haegler (university of Munich)<br />

Frank Fiedler (university of Munich)<br />

Christian Boehm (university of Munich)<br />

Aggregate Query Answering <strong>on</strong> Possibilistic <strong>Data</strong> with Cardinality<br />

C<strong>on</strong>straints<br />

Graham Cormode (aT&T labs – research)<br />

Ent<strong>on</strong>g Shen (north Carolina State university)<br />

divesh Srivastava (aT&T labs – research)<br />

Ting yu (north Carolina State university)


Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Discovering Threshold-based Frequent Closed Itemsets<br />

over Probabilistic <strong>Data</strong><br />

y<strong>on</strong>gxin T<strong>on</strong>g (H<strong>on</strong>g K<strong>on</strong>g univeristy of Science and<br />

<strong>Engineering</strong>)<br />

lei Chen (H<strong>on</strong>g K<strong>on</strong>g univeristy of Science and <strong>Engineering</strong>)<br />

Bolin ding (university of illinois at urbana-Champaign)<br />

Ranking Query Results in Probabilistic <strong>Data</strong>bases:<br />

Complexity and Efficient Algorithms<br />

dan olteanu (university of oxford)<br />

H<strong>on</strong>gkai Wen (university of oxford)<br />

Sessi<strong>on</strong> 7: <strong>Data</strong> Integrati<strong>on</strong> and Extracti<strong>on</strong> (Studio C)<br />

Sessi<strong>on</strong> Chair: Daisy Zhe Wang<br />

Joint Entity Resoluti<strong>on</strong><br />

Steven Whang (Stanford university)<br />

Hector Garcia-Molina (Stanford university)<br />

A Self-C<strong>on</strong>figuring Schema Matching System<br />

Eric peukert (Sap research dresden)<br />

Julian Eberius (dresden university of Technology)<br />

Erhard rahm (university of leipzig)<br />

Incremental Detecti<strong>on</strong> of Inc<strong>on</strong>sistencies in<br />

Distributed <strong>Data</strong><br />

Wenfei Fan (university of Edinburgh)<br />

Jianzh<strong>on</strong>g li (Harbin institute of Technology)<br />

nan Tang (university of Edinburgh & Qatar Computing research<br />

institute)<br />

Wenyuan yu (university of Edinburgh)<br />

Recomputing Materialized Instances after Changes to<br />

Mappings and <strong>Data</strong><br />

Todd J. Green (university of California, davis)<br />

Zachary G. ives (university of pennsylvania)<br />

Sessi<strong>on</strong> 8: Spatio-Temporal <strong>Data</strong><br />

Management (Studio D)<br />

Sessi<strong>on</strong> Chair: Lei Chen<br />

SWST: A Disk Based Index for Sliding Window<br />

Spatio-Temporal <strong>Data</strong><br />

Manish Singh (university of Michigan, ann arbor)<br />

Qiang Zhu (university of Michigan, dearborn)<br />

H.V. Jagadish (university of Michigan, ann arbor)<br />

Page<br />

31


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

32<br />

Querying Uncertain Spatio-Temporal <strong>Data</strong><br />

Tobias Emrich (ludwig-Maximilians-universität München)<br />

Hans-peter Kriegel (ludwig-Maximilians-universität München)<br />

nikos Mamoulis (university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Matthias renz (ludwig-Maximilians-universität München)<br />

andreas Züfle (ludwig-Maximilians-universität München)<br />

The Min-dist Locati<strong>on</strong> Selecti<strong>on</strong> Query<br />

Jianzh<strong>on</strong>g Qi (university of Melbourne)<br />

rui Zhang (university of Melbourne)<br />

lars Kulik (university of Melbourne)<br />

dan lin (Missouri university of Science and Technology)<br />

yuan Xue (university of Melbourne)<br />

Bi-level Locality Sensitive Hashing for K-Nearest<br />

Neighbor Computati<strong>on</strong><br />

Jia pan (unC Chapel Hill)<br />

dinesh Manocha (unC Chapel Hill)<br />

Seminar 2 (Sal<strong>on</strong> 123)<br />

Discovering Multiple Clustering Soluti<strong>on</strong>s: Grouping<br />

Objects in Different Views of the <strong>Data</strong><br />

Emmanuel Müller (Karlsruhe institute of Technology)<br />

Stephan Günnemann (rWTH aachen university)<br />

ines Färber (rWTH aachen university)<br />

Thomas Seidl (rWTH aachen university)<br />

Demo Group 2 (Studio E)<br />

M 3 : Stream Processing <strong>on</strong> Main-Memory MapReduce<br />

ahmed M. aly (purdue university)<br />

asmaa Sallam (purdue university)<br />

Bala M. Gnanasekaran (purdue university)<br />

l<strong>on</strong>g-Van nguyen-dinh (purdue university)<br />

Walid G. aref (purdue university)<br />

Mourad ouzzani (Qatar Computing research institute)<br />

arif Ghafoor (purdue university)<br />

A Deep Embedding of Queries into Ruby<br />

Torsten Grust (university of Tübingen)<br />

Manuel Mayr (university of Tübingen)


3:30PM - 4PM Coffee Break<br />

Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Asking the Right Questi<strong>on</strong>s in Crowd <strong>Data</strong> Sourcing<br />

rubi Boim (Tel-aviv university)<br />

ohad Greenshpan (Tel-aviv university)<br />

Tova Milo (Tel-aviv university)<br />

Slava novgorodov (Tel-aviv university),<br />

neoklis polyzotis (university of California, Santa Cruz)<br />

Wang-Chiew Tan (university of California, Santa Cruz)<br />

LotusX: A Positi<strong>on</strong>-Aware XML Graphical Search System<br />

with Auto-Completi<strong>on</strong><br />

Chunbin lin (renmin university of China)<br />

Jiaheng lu (renmin university of China),<br />

Tok Wang ling (nati<strong>on</strong>al universtiy of Singapore)<br />

Bogdan Cautis (Télécom parisTech)<br />

Efficient Top-k Keyword Search in Graphs with<br />

Polynomial Delay<br />

Mehdi Kargar (york university)<br />

aijun an (york university)<br />

TEDAS: a Twitter Based Event Detecti<strong>on</strong> and<br />

Analysis System<br />

rui li (university of illinois at urbana-Champaign)<br />

Kin Hou lei (Brigham young university),<br />

ravi Khadiwala (university of illinois at urbana-Champaign)<br />

Kevin Chen-Chuan Chang (university of illinois at<br />

urbana-Champaign)<br />

AutoDict: Automated Dicti<strong>on</strong>ary Discovery<br />

Fei Chiang (university of Tor<strong>on</strong>to)<br />

periklis andritsos (university of Tor<strong>on</strong>to),<br />

Erkang Zhu (university of Tor<strong>on</strong>to)<br />

renee Miller (university of Tor<strong>on</strong>to)<br />

4PM - 5:30PM Sessi<strong>on</strong>s 9-12, Seminar 3, Demo Group 3<br />

Sessi<strong>on</strong> 9: Query Processing (Studio F)<br />

Sessi<strong>on</strong> Chair: Walid G. Aref<br />

Learning-based Query Performance Modeling<br />

and Predicti<strong>on</strong><br />

Mert akdere (Brown university)<br />

ugur Cetintemel (Brown university)<br />

Matteo ri<strong>on</strong>dato (Brown university)<br />

Eli upfal (Brown university)<br />

Stanley B. Zd<strong>on</strong>ik (Brown university)<br />

Page<br />

33


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

34<br />

Parametric Plan Caching Using Density-Based Clustering<br />

Gunes aluc (university of Waterloo)<br />

david E. deHaan (Sybase, an Sap Company)<br />

ivan T. Bowman (Sybase, an Sap Company)<br />

Effective and Robust Pruning for Top-Down Join<br />

Enumerati<strong>on</strong> Algorithms<br />

pit Fender (Mannheim university)<br />

Guido Moerkotte (Mannheim university)<br />

Thomas neumann (Technical university of Munich)<br />

Viktor leis (Technical university of Munich)<br />

Towards Preference-aware Relati<strong>on</strong>al <strong>Data</strong>bases<br />

anastasios arvanitis (nati<strong>on</strong>al Technical university of athens)<br />

Georgia Koutrika (iBM almaden research Center)<br />

Sessi<strong>on</strong> 10: Locati<strong>on</strong> Aware <strong>Data</strong><br />

Processing (Studio B)<br />

Sessi<strong>on</strong> Chair: Oktie Hassanzadeh<br />

A Foundati<strong>on</strong> for Efficient Indoor Distance-Aware<br />

Query Processing<br />

Hua lu (aalborg university)<br />

Xin Cao (nanyang Technological university)<br />

Christian S. Jensen (aarhus university)<br />

LARS: A Locati<strong>on</strong>-Aware Recommender System<br />

Justin J. levandoski (Microsoft research)<br />

Mohamed Sarwat (university of Minnesota)<br />

ahmed Eldawy (university of Minnesota)<br />

Mohamed F. Mokbel (university of Minnesota)<br />

Approximate Shortest Distance Computing:<br />

A Query-Dependent Local Landmark Scheme<br />

Miao Qiao (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

H<strong>on</strong>g Cheng (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

lijun Chang (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Jeffrey Xu yu (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Desks: Directi<strong>on</strong>-Aware Spatial Keyword Search<br />

Guoliang li (Tsinghua university)<br />

Jianhua Feng (Tsinghua university)<br />

Jing Xu (Tsinghua university)


Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Sessi<strong>on</strong> 11: Map-Reduce based <strong>Data</strong> Processing<br />

(Studio C)<br />

Sessi<strong>on</strong> Chair: Minqi Zhou<br />

Extending Map-Reduce for Efficient Predicate-Based<br />

Sampling<br />

raman Grover (university of California, irvine)<br />

Michael Carey (university of California, irvine)<br />

Fuzzy Joins Using MapReduce<br />

Foto afrati (nati<strong>on</strong>al Technical university athens)<br />

anish das Sarma (Google, inc.-work initiated at yahoo! research)<br />

david Menestrina (Google, inc.)<br />

aditya parameswaran (Stanford university)<br />

Jeffrey d. ullman (Stanford university)<br />

Parallel Top-K Similarity Join Algorithms Using MapReduce<br />

youngho<strong>on</strong> Kim (Seoul nati<strong>on</strong>al university)<br />

Kyuseok Shim (Seoul nati<strong>on</strong>al university)<br />

Load Balancing in MapReduce Based <strong>on</strong> Scalable<br />

Cardinality Estimates<br />

Benjamin Gufler (Technische universität München)<br />

nikolaus augsten (Free university of Bolzano-Bozen)<br />

angelika reiser (Technische universität München)<br />

alf<strong>on</strong>s Kemper (Technische universität München)<br />

Sessi<strong>on</strong> 12: Social Media (Studio D)<br />

Sessi<strong>on</strong> Chair: Zack Ives<br />

Community Detecti<strong>on</strong> with Edge C<strong>on</strong>tent in Social<br />

Media Networks<br />

Guo-Jun Qi (university of illinois at urbana-Champaign)<br />

Charu C. aggarwal (iBM T. J. Wats<strong>on</strong> research Center)<br />

Thomas S. Huang (university of illinois at urbana-Champaign)<br />

Cross Domain Search by Exploiting Wikipedia<br />

Chen liu (nati<strong>on</strong>al university of Singapore)<br />

Sai Wu (nati<strong>on</strong>al university of Singapore)<br />

Shouxu Jiang (Harbin institute of Technology)<br />

anth<strong>on</strong>y K.H. Tung (nati<strong>on</strong>al university of Singapore)<br />

Provenance-based Indexing Support in Micro-blog<br />

Platforms<br />

Junjie yao (peking university)<br />

Bin Cui (peking university)<br />

Zijun Xue (peking university)<br />

Qingyun liu (peking university)<br />

Page<br />

35


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

36<br />

Learning Stochastic Models of Informati<strong>on</strong> Flow<br />

luke dickens (imperial College l<strong>on</strong>d<strong>on</strong>)<br />

ian Molloy (iBM T. J. Wats<strong>on</strong> research Center)<br />

Jorge lobo (iBM T. J. Wats<strong>on</strong> research Center)<br />

pau-Chen Cheng (iBM T. J. Wats<strong>on</strong> research Center)<br />

alessandra russo (imperial College l<strong>on</strong>d<strong>on</strong>)<br />

Seminar 3 (Sal<strong>on</strong> 123)<br />

Detecting Cl<strong>on</strong>es, Copying and Reuse <strong>on</strong> the Web<br />

Xin luna d<strong>on</strong>g (aT&T labs–research)<br />

divesh Srivastava (aT&T labs–research)<br />

Demo Group 3 (Studio E)<br />

Trust & Share: Trusted Informati<strong>on</strong> Sharing in Online<br />

Social Networks<br />

Barbara Carminati (university of insubria)<br />

Elena Ferrari (university of insubria)<br />

Jacopo Girardi (university of insubria)<br />

Evaluati<strong>on</strong> of Clusterings – Metrics and Visual Support<br />

Elke achtert (ludwig-Maximilians-universität München)<br />

Sascha Goldhofer (ludwig-Maximilians-universität München)<br />

Hans-peter Kriegel (ludwig-Maximilians-universität München)<br />

Erich Schubert (ludwig-Maximilians-universität München)<br />

arthur Zimek (ludwig-Maximilians-universität München)<br />

Hort<strong>on</strong>: Online Query Executi<strong>on</strong> Engine For Large<br />

Distributed Graphs<br />

Mohamed Sarwat (university of Minnesota)<br />

Sameh Elnikety (Microsoft research)<br />

yuxi<strong>on</strong>g He (Microsoft research)<br />

Gabriel Kliot (Microsoft research)<br />

MXQuery With Hardware Accelerati<strong>on</strong><br />

Jens Teubner (ETH Zurich)<br />

peter Fischer (university of Freiburg)<br />

<strong>Data</strong> 3 – A Kinect Interface for OLAP using Complex<br />

Event Processing<br />

Steffen Hirte (ilmenau university of Technology)<br />

andreas Seifert (ilmenau university of Technology)<br />

Stephan Baumann (ilmenau university of Technology)<br />

daniel Klan (ilmenau university of Technology)<br />

Kai-uwe Sattler (ilmenau university of Technology)


Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Analyzing Query Optimizati<strong>on</strong> Process: Portraits of Join<br />

Enumerati<strong>on</strong> Algorithms<br />

anisoara nica (Sybase, an Sap Company)<br />

ian Charlesworth (university of Waterloo)<br />

Maysum panju (university of Waterloo)<br />

DPCube: Releasing Differentially Private <strong>Data</strong> Cubes for<br />

Health Informati<strong>on</strong><br />

y<strong>on</strong>ghui Xiao (Emory university)<br />

James Gardner (digital reas<strong>on</strong>ing Systems inc.)<br />

li Xi<strong>on</strong>g (Emory university)<br />

7:30PM - 9PM NSF <strong>ICDE</strong> <strong>2012</strong> Career Panel (Sal<strong>on</strong> 123)<br />

Panel Moderator: Philip Bernstein (Microsoft Research)<br />

Panelists: Alexandros Labrindis (CS, UPitt), James M.<br />

Kang (NGA), Srinivasan Parthasarathy (CS, OSU), and<br />

Yuanyuan Tian (IBM Research)<br />

TuESday, april 3<br />

8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />

9AM - 10AM Keynote 2 (Sal<strong>on</strong> 4567): Surajit Chaudhuri — How<br />

Different Is Big <strong>Data</strong>?<br />

Sessi<strong>on</strong> Chair: Beng Chin Ooi<br />

10AM - 10:30AM Coffee Break<br />

10:30AM - No<strong>on</strong> Sessi<strong>on</strong>s 13-15, Industrial Sessi<strong>on</strong> 1, Seminar 4,<br />

Demo Group 4<br />

Sessi<strong>on</strong> 13: P2P and Distributed<br />

Processing (Studio F)<br />

Sessi<strong>on</strong> Chair: Guoliang Li<br />

BestPeer++: A Peer-to-Peer based Large-scale<br />

<strong>Data</strong> Processing<br />

Gang Chen (netEase.com inc. & Zhejiang university)<br />

Tianlei Hu (netEase.com inc. & Zhejiang university)<br />

dawei Jiang (nati<strong>on</strong>al university of Singapore)<br />

peng lu (nati<strong>on</strong>al university of Singapore)<br />

Kian-lee Tan (nati<strong>on</strong>al university of Singapore)<br />

Hoang Tam Vo (nati<strong>on</strong>al university of Singapore)<br />

Sai Wu (Bestpeer pte. ltd. & nati<strong>on</strong>al university of Singapore)<br />

Page<br />

37


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

38<br />

Effective <strong>Data</strong> Density Estimati<strong>on</strong> in Ring-based<br />

2P Networks<br />

Minqi Zhou (East China normal university)<br />

Heng Tao Shen (The university of Queensland)<br />

Xiaofang Zhou (The university of Queensland)<br />

Weining Qian (East China normal university)<br />

aoying Zhou (East China normal university)<br />

Processing of Rank Joins in Highly Distributed Systems<br />

Christos doulkeridis (norwegian university of Science and<br />

Technology (nTnu))<br />

akrivi Vlachou (norwegian university of Science and<br />

Technology (nTnu))<br />

Kjetil nørvåg (norwegian university of Science and<br />

Technology (nTnu))<br />

yannis Kotidis (athens university of Ec<strong>on</strong>omics and<br />

Business (auEB))<br />

neoklis polyzotis (uC Santa Cruz (uCSC))<br />

Load Balancing for MapReduce-based Entity Resoluti<strong>on</strong><br />

lars Kolb (university of leipzig)<br />

andreas Thor (university of leipzig)<br />

Erhard rahm (university of leipzig)<br />

Sessi<strong>on</strong> 14: XML and RDF <strong>Data</strong><br />

Management (Studio B)<br />

Sessi<strong>on</strong> Chair: Dan Olteanu<br />

Mapping XML to a Wide Sparse Table<br />

liang Jeff Chen (uCSd)<br />

philip a. Bernstein (Microsoft Corp.)<br />

peter Carlin (Microsoft Corp.)<br />

dimitrije Filipovic (Microsoft Corp.)<br />

Michael rys (Microsoft Corp.)<br />

nikita Shamgunov (Facebook inc.)<br />

James F. Terwilliger (Microsoft Corp.)<br />

Milos Todic (Microsoft Corp.)<br />

Sasa Tomasevic (Microsoft Corp.)<br />

dragan Tomic (Microsoft Corp.)<br />

Querying XML <strong>Data</strong>: As You Shape It<br />

Curtis E. dyres<strong>on</strong> (utah State university)<br />

Sourav S. Bhowmick (nanyang Technological university)


Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Branch Code: A Labeling Scheme for Efficient Query<br />

Answering <strong>on</strong> Trees<br />

yanghua Xiao (Fudan university)<br />

Ji H<strong>on</strong>g (Fudan university)<br />

Wanyun Cui (Fudan university)<br />

Zhenying He (Fudan university)<br />

Wei Wang (Fudan university)<br />

Guod<strong>on</strong>g Feng (Fudan university)<br />

Scalable Multi-Query Optimizati<strong>on</strong> for SPARQL<br />

Wangchao le (university of utah)<br />

anastasios Kementsietsidis (iBM T. J. Wats<strong>on</strong> research Center)<br />

S<strong>on</strong>gyun duan (iBM T. J. Wats<strong>on</strong> research Center)<br />

Feifei li (university of utah)<br />

Sessi<strong>on</strong> 15: Performance (Studio C)<br />

Sessi<strong>on</strong> Chair: Eric Lo<br />

GSLPI: a Cost-based Query Progress Indicator<br />

Jiexing li (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

rimma V. nehme (Microsoft Jim Gray Systems lab)<br />

Jeffrey naught<strong>on</strong> (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Micro-Specializati<strong>on</strong> in DBMSes<br />

rui Zhang (university of ariz<strong>on</strong>a)<br />

richard T. Snodgrass (university of ariz<strong>on</strong>a)<br />

Saumya debray (university of ariz<strong>on</strong>a)<br />

Towards Multi-Tenant Performance SLOs<br />

Willis lang (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Srinath Shankar (Microsoft Jim Gray Systems lab)<br />

Jignesh M. patel (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

ajay Kalhan (Microsoft Corp.)<br />

Multi-Versi<strong>on</strong> C<strong>on</strong>currency via Timestamp Range<br />

C<strong>on</strong>flict Management<br />

david lomet (Microsoft research)<br />

alan Fekete (university of Sydney)<br />

rui Wang (Microsoft research)<br />

peter Ward (university of Sydney)<br />

Industrial Sessi<strong>on</strong> 1: Support for Large Scale <strong>Data</strong><br />

Analytics (Studio D)<br />

Sessi<strong>on</strong> Chair: Arbee L.P. Chen<br />

Exploiting Comm<strong>on</strong> Subexpressi<strong>on</strong>s for Cloud Query Processing<br />

yasin n. Silva (ariz<strong>on</strong>a State university)<br />

per-ake lars<strong>on</strong> (Microsoft research)<br />

Jingren Zhou (Microsoft Corp.)<br />

Page<br />

39


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

40<br />

Vectorwise: a Vectorized Analytical DBMS<br />

Marcin Zukowski (actian netherlands)<br />

Mark van de Wiel (actian Corp.)<br />

peter B<strong>on</strong>cz (CWi)<br />

Scalable and Numerically Stable Descriptive Statistics<br />

in SystemML<br />

yuanyuan Tian (iBM almaden research Center)<br />

Shirish Tatik<strong>on</strong>da (iBM almaden research Center)<br />

Berthold reinwald (iBM almaden research Center)<br />

Seminar 4 (Sal<strong>on</strong> 123)<br />

Mining Knowledge from <strong>Data</strong>: An Informati<strong>on</strong> Network<br />

Analysis Approach<br />

Jiawei Han (university of illinois at urbana-Champaign)<br />

yizhou Sun (university of illinois at urbana-Champaign)<br />

Xifeng yan (university of California at Santa Barbara)<br />

philip S. yu (university of illinois at Chicago)<br />

Demo Group 4 (Studio E)<br />

Nyaya: a System Supporting the Uniform Management of<br />

Large Sets of Semantic <strong>Data</strong><br />

roberto de Virgilio (universita’ roma Tre)<br />

Giorgio orsi (university of oxford)<br />

letizia Tanca (politecnico di Milano)<br />

riccardo Torl<strong>on</strong>e (universita’ roma Tre)<br />

R2DB: A System for Querying and Visualizing Weighted<br />

RDF Graphs<br />

S<strong>on</strong>gling liu (ariz<strong>on</strong>a State university)<br />

Juan Cedeno (ariz<strong>on</strong>a State university)<br />

Selcuk Candan (ariz<strong>on</strong>a State university)<br />

Maria luisa Sapino (university of Turin)<br />

Shengyu Huang (ariz<strong>on</strong>a State university)<br />

Xinsheng li (ariz<strong>on</strong>a State university)<br />

Project Dayt<strong>on</strong>a: <strong>Data</strong> Analytics as a Cloud Service<br />

roger Barga (Microsoft)<br />

Jaliya Ekanayake (Microsoft research)<br />

Wei lu (Microsoft research)<br />

Interactive User Feedback in Ontology Matching Using<br />

Signature Vector<br />

isabel Cruz (university of illinois at Chicago)<br />

Cosmin Stroe (university of illinois at Chicago)<br />

Matteo palm<strong>on</strong>ari (university of Milano-Bicocca)


DObjects+: Enabling Privacy-Preserving <strong>Data</strong><br />

Federati<strong>on</strong> Services<br />

pawel Jurczyk (Google inc.)<br />

li Xi<strong>on</strong>g (Emory university)<br />

Slawomir Goryczka (Emory university)<br />

Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Drago<strong>on</strong>: An Informati<strong>on</strong> Accountability System for<br />

High-Performance <strong>Data</strong>bases<br />

Kyriacos pavlou (university of ariz<strong>on</strong>a)<br />

richard Snodgrass (university of ariz<strong>on</strong>a)<br />

Intuitive Interacti<strong>on</strong> With Encrypted Query Executi<strong>on</strong><br />

in <strong>Data</strong>Storm<br />

Ken Smith (MiTrE)<br />

ameet Kini (MiTrE)<br />

William Wang (MiTrE)<br />

Chris Wolf (MiTrE)<br />

M. david allen (MiTrE)<br />

andrew Sillers (MiTrE)<br />

No<strong>on</strong> - 2PM Funders Sessi<strong>on</strong> and Lunch (Sal<strong>on</strong> 4567)<br />

Panel Organizer: Frank Olken (C<strong>on</strong>sultant)<br />

Panelists: Le Gruenwald (Nati<strong>on</strong>al Science Foundati<strong>on</strong>),<br />

Ceren Sust (Department of Energy), and Olga Brazhnik<br />

(Nati<strong>on</strong>al Institutes of Health)<br />

2PM - 3:30PM Sessi<strong>on</strong>s 16-17, Industrial Sessi<strong>on</strong> 2, Seminar 5, Panel,<br />

Demo Group 1<br />

Sessi<strong>on</strong> 16: <strong>Data</strong> Extracti<strong>on</strong> and Quality (Studio F)<br />

Sessi<strong>on</strong> Chair: Anish Das Sarma<br />

Automatic Extracti<strong>on</strong> of Structured Web <strong>Data</strong> with<br />

Domain Knowledge<br />

nora derouiche (Télécom parisTech – CnrS lTCi)<br />

Bogdan Cautis (Télécom parisTech – CnrS lTCi)<br />

Talel abdessalem (Télécom parisTech – CnrS lTCi)<br />

Discovering C<strong>on</strong>servati<strong>on</strong> Rules<br />

lukasz Golab (university of Waterloo)<br />

Howard Karloff (aT&T labs–research)<br />

Flip Korn (aT&T labs–research)<br />

Barna Saha (aT&T labs–research)<br />

divesh Srivastava (aT&T labs–research)<br />

Answering Why-not Questi<strong>on</strong>s <strong>on</strong> Top-k Queries<br />

Zhian He (H<strong>on</strong>g K<strong>on</strong>g polytechnic university)<br />

Eric lo (H<strong>on</strong>g K<strong>on</strong>g polytechnic university)<br />

Page<br />

41


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

42<br />

An Efficient Trie-based Method for Approximate Entity<br />

Extracti<strong>on</strong> with Edit-Distance C<strong>on</strong>straints<br />

d<strong>on</strong>g deng (Tsinghua university)<br />

Guoliang li (Tsinghua university)<br />

Jianhua Feng (Tsinghua university)<br />

Sessi<strong>on</strong> 17: Top-K Processing (Studio B)<br />

Sessi<strong>on</strong> Chair: Tingjian Ge<br />

On Top-k Structural Similarity Search<br />

pei lee (university of British Columbia)<br />

laks V.S. lakshmanan (university of British Columbia)<br />

Jeffrey Xu yu (Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Relevance Matters: Capitalizing <strong>on</strong> Less<br />

(Top-k Matching in Publish/Subscribe)<br />

Mohammad Sadoghi (university of Tor<strong>on</strong>to)<br />

Hans-arno Jacobsen (university of Tor<strong>on</strong>to)<br />

Efficiently M<strong>on</strong>itoring Top-k Pairs over Sliding Windows<br />

Zhitao Shen (unSW)<br />

Muhammad aamir Cheema (unSW)<br />

Xuemin lin (unSW & ECnu)<br />

Wenjie Zhang (unSW)<br />

Haixun Wang (Microsoft research asia)<br />

Processing and Notifying Range Top-k Subscripti<strong>on</strong>s<br />

albert yu (duke university)<br />

pankaj K. agarwal (duke university)<br />

Jun yang (duke university)<br />

Industrial Sessi<strong>on</strong> 2: Evolving Platforms for New<br />

Applicati<strong>on</strong>s (Studio C)<br />

Sessi<strong>on</strong> Chair: Rui Zhang<br />

Earlybird: Real-Time Search at Twitter<br />

Michael Busch (Twitter)<br />

Krishna Gade (Twitter)<br />

Brian lars<strong>on</strong> (Twitter)<br />

patrick lok (Twitter)<br />

Samuel luckenbill (Twitter)<br />

Jimmy lin (Twitter)<br />

<strong>Data</strong> Infrastructure at LinkedIn<br />

linkedin data infrastructure Team


The Credit Suisse Meta-data Warehouse<br />

Claudio Jossen (Credit Suisse aG)<br />

lukas Blunschi (ETH Zurich)<br />

Magdalini Mori (Credit Suisse aG)<br />

d<strong>on</strong>ald Kossmann (ETH Zurich)<br />

Kurt Stockinger (Credit Suisse aG)<br />

Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Panel: The Future of Scientific <strong>Data</strong> Bases (Studio D)<br />

Panel Moderator: Michael St<strong>on</strong>ebraker (MIT)<br />

Panelists: Anastasia Ailamaki (EPFL), Jeremy Kepner<br />

(MIT), and Alex Szalay (Johns Hopkins University)<br />

Seminar 5 (Sal<strong>on</strong> 123)<br />

3:30PM - 4PM Coffee Break<br />

Emerging Graph Queries In Linked <strong>Data</strong><br />

arijit Khan (university of California, Santa Barbara)<br />

yinghui Wu (university of California, Santa Barbara)<br />

Xifeng yan (university of California, Santa Barbara)<br />

Demo Group 1 (Studio E)<br />

See “demo Group 1” listing above<br />

4PM - 5:30PM Poster Sessi<strong>on</strong>, all papers (Sal<strong>on</strong> 4567)<br />

5:30PM Departure for cruise and c<strong>on</strong>ference banquet<br />

WEdnESday, april 4<br />

8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />

9AM - 10AM Keynote 3 (Sal<strong>on</strong> 4567): Peter Druschel —<br />

Accountability and Trust in Cooperative<br />

Informati<strong>on</strong> Systems<br />

Sessi<strong>on</strong> Chair: Johannes Gehrke<br />

10AM - 10:30AM Coffee Break<br />

10:30AM - No<strong>on</strong> Sessi<strong>on</strong>s 18-20, Industrial Sessi<strong>on</strong> 3, Seminar 6,<br />

Demo Group 2<br />

Page<br />

43


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

44<br />

Sessi<strong>on</strong> 18: Similarity (Studio F)<br />

Sessi<strong>on</strong> Chair: Matthias Renz<br />

Efficient Exact Similarity Searches using Multiple<br />

Token Orderings<br />

J<strong>on</strong>gik Kim (Ch<strong>on</strong>buk nati<strong>on</strong>al university)<br />

H<strong>on</strong>grae lee (Google inc.)<br />

Efficient Graph Similarity Joins with Edit<br />

Distance C<strong>on</strong>straints<br />

Xiang Zhao (The university of new South Wales & niCTa)<br />

Chuan Xiao (The university of new South Wales)<br />

Xuemin lin (The university of new South Wales & East China<br />

normal university)<br />

Wei Wang (The university of new South Wales)<br />

Parameter-Free Determinati<strong>on</strong> of Distance Thresholds for<br />

Metric Distance C<strong>on</strong>straints<br />

Shaoxu S<strong>on</strong>g (Tsinghua university)<br />

lei Chen (The H<strong>on</strong>g K<strong>on</strong>g university of Science and<br />

Technology)<br />

H<strong>on</strong>g Cheng (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Random Error Reducti<strong>on</strong> in Similarity Search <strong>on</strong> Time<br />

Series: A Statistical Approach<br />

Wush Chi-Hsuan Wu (academia Sinica)<br />

Mi-yen yeh (academia Sinica)<br />

Jian pei (Sim<strong>on</strong> Fraser university)<br />

Sessi<strong>on</strong> 19: Text and Strings (Studio B)<br />

Sessi<strong>on</strong> Chair: Feifei Li<br />

Optimizing Statistical Informati<strong>on</strong> Extracti<strong>on</strong> Programs<br />

Over Evolving Text<br />

Fei Chen (Hp labs China)<br />

Xixuan Feng (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Christopher re (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Min Wang (Hp labs China)<br />

Approximate String Membership Checking: A Multiple<br />

Filter, Optimizati<strong>on</strong>-Based Approach<br />

Ch<strong>on</strong>g Sun (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Jeffrey F. naught<strong>on</strong> (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Siddharth Barman (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

On Text Clustering with Side Informati<strong>on</strong><br />

Charu C. aggarwal (iBM T. J. Wats<strong>on</strong> research Center)<br />

yuchen Zhao (university of illinois at Chicago)<br />

philip S. yu (university of illinois at Chicago)


Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Fast SLCA and ELCA Computati<strong>on</strong> for XML Keyword<br />

Queries based <strong>on</strong> Set Intersecti<strong>on</strong><br />

Junfeng Zhou (yanshan university)<br />

Zhifeng Bao (nati<strong>on</strong>al university of Singapore)<br />

Wei Wang (The university of new South Wales)<br />

Tok Wang ling (nati<strong>on</strong>al university of Singapore)<br />

Ziyang Chen (yanshan university)<br />

Xud<strong>on</strong>g lin (yanshan university)<br />

Jingfeng Guo (yanshan university)<br />

Sessi<strong>on</strong> 20: Query Processing II (Studio C)<br />

Sessi<strong>on</strong> Chair: Volker Markl<br />

Optimizati<strong>on</strong> of Massive Pattern Queries by Dynamic<br />

C<strong>on</strong>figurati<strong>on</strong> Morphing<br />

nikolay laptev (university of California, los angeles)<br />

Carlo Zaniolo (university of California, los angeles)<br />

Three-level Processing of Multiple Aggregate<br />

C<strong>on</strong>tinuous Queries<br />

Shenoda Guirguis (university of pittsburgh)<br />

Mohamed a. Sharaf (The university of Queensland)<br />

panos K. Chrysanthis (university of pittsburgh)<br />

alexandros labrinidis (university of pittsburgh)<br />

Accelerating Range Queries For Brain Simulati<strong>on</strong>s<br />

Farhan Tauheed (EpFl)<br />

laurynas Biveinis (aalborg university)<br />

Thomas Heinis (EpFl)<br />

Felix Schürmann (EpFl)<br />

Henry Markram (EpFl)<br />

anastasia ailamaki (EpFl)<br />

Keyword Query Reformulati<strong>on</strong> <strong>on</strong> Structured <strong>Data</strong><br />

Junjie yao (peking university)<br />

Bin Cui (peking university)<br />

liansheng Hua (peking university)<br />

yuxin Huang (peking university)<br />

Industrial Sessi<strong>on</strong> 3: Indexing, Updates and<br />

Processing (Studio D)<br />

Efficient Support of XQuery Update Facility in XML<br />

Enabled RDBMS<br />

Zhen Hua liu (oracle)<br />

Hui Chang (oracle)<br />

Balasubramanyam Sthanikam (oracle)<br />

Page<br />

45


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

46<br />

Making Unstructured <strong>Data</strong> SPARQL Using Semantic<br />

Indexing in Oracle <strong>Data</strong>base<br />

Souripriya das (oracle)<br />

Seema Sundara (oracle )<br />

Matthew perry (oracle)<br />

Jagannathan Srinivasan (oracle)<br />

Jayanta Banerjee (oracle)<br />

aravind yalamanchi (oracle)<br />

A meta-language for MDX queries in eLog<br />

Business Soluti<strong>on</strong><br />

S<strong>on</strong>ia Bergamaschi (university of Modena and reggio Emilia)<br />

Matteo interlandi (university of Modena and reggio Emilia)<br />

Mario l<strong>on</strong>go (eBilling S.p.a.)<br />

laura po (university of Modena and reggio Emilia)<br />

Maurizio Vincini (university of Modena and reggio Emilia)<br />

Seminar 6 (Sal<strong>on</strong> 123)<br />

Boolean Matrix Decompositi<strong>on</strong> Problem: Theory, Variati<strong>on</strong>s<br />

and Applicati<strong>on</strong>s to <strong>Data</strong> <strong>Engineering</strong><br />

Jaideep Vaidya (rutgers university)<br />

Demo Group 2 (Studio E)<br />

See “demo Group 2” listing above<br />

No<strong>on</strong> - 2PM Lunch (Provided by <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> with Sal<strong>on</strong> 4567)<br />

2PM - 3:30PM Sessi<strong>on</strong>s 21-23, Demo Group 3<br />

Sessi<strong>on</strong> 21: <strong>Data</strong> Mining (Studio F)<br />

Sessi<strong>on</strong> Chair: Anth<strong>on</strong>y Tung<br />

Predicting Approximate Protein-DNA Binding Cores Using<br />

Associati<strong>on</strong> Rule Mining<br />

po-yuen W<strong>on</strong>g (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Tak-Ming Chan (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Man-H<strong>on</strong> W<strong>on</strong>g (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Kw<strong>on</strong>g-Sak leung (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Upgrading Uncompetitive Products Ec<strong>on</strong>omically<br />

Hua lu (aalborg university)<br />

Christian S. Jensen (aarhus university)


Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Attribute-Based Subsequence Matching and Mining<br />

yu peng (The H<strong>on</strong>g K<strong>on</strong>g university of Science and<br />

Technology)<br />

raym<strong>on</strong>d Chi-Wing W<strong>on</strong>g (The H<strong>on</strong>g K<strong>on</strong>g university of<br />

Science and Technology)<br />

liangliang ye (The H<strong>on</strong>g K<strong>on</strong>g university of Science and<br />

Technology)<br />

philip S. yu (university of illinois at Chicago)<br />

Integrating Frequent Pattern Mining from Multiple <strong>Data</strong><br />

Domains for Classificati<strong>on</strong><br />

dhaval patel (nati<strong>on</strong>al university of Singapore)<br />

Wynne Hsu (nati<strong>on</strong>al university of Singapore)<br />

M<strong>on</strong>g li lee (nati<strong>on</strong>al university of Singapore)<br />

Sessi<strong>on</strong> 22: Scientific <strong>Data</strong>, Analysis and<br />

Visualizati<strong>on</strong> (Studio B)<br />

Sessi<strong>on</strong> Chair: Christopher Re<br />

Efficient Versi<strong>on</strong>ing for Scientific Array <strong>Data</strong>bases<br />

adam Seering (MiT CSail)<br />

philippe Cudre-Mauroux (university of Fribourg)<br />

Samuel Madden (MiT CSail)<br />

Michael St<strong>on</strong>ebraker (MiT CSail)<br />

Multidimensi<strong>on</strong>al Analysis of Atypical Events in<br />

Cyber-Physical <strong>Data</strong><br />

lu-an Tang (uiuC)<br />

Xiao yu (uiuC)<br />

Sangkyum Kim (uiuC)<br />

Jiawei Han (uiuC)<br />

Wen-Chih peng (nati<strong>on</strong>al Chiao Tung university)<br />

yizhou Sun (uiuC)<br />

Hector G<strong>on</strong>zalez (Google)<br />

Sebastian Seith (Morning Star)<br />

HiCS: High C<strong>on</strong>trast Subspaces for Density-Based<br />

Outlier Ranking<br />

Fabian Keller (Karlsruhe institute of Technology)<br />

Emmanuel Müller (Karlsruhe institute of Technology)<br />

Klemens Böhm (Karlsruhe institute of Technology)<br />

Extracting Analyzing and Visualizing Triangle K-Core Motifs<br />

within Networks<br />

yang Zhang (The ohio State university)<br />

Srinivasan parthasarathy (The ohio State university)<br />

Page<br />

47


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

3:30PM - 4PM Coffee Break<br />

Page<br />

48<br />

Sessi<strong>on</strong> 23: Similarity Search and Detecti<strong>on</strong> (Studio D)<br />

Sessi<strong>on</strong> Chair: Xuemin Lin<br />

Horiz<strong>on</strong>tal Reducti<strong>on</strong>: Instance-Level Dimensi<strong>on</strong>ality<br />

Reducti<strong>on</strong> for Similarity Search in Large Document<br />

<strong>Data</strong>bases<br />

Min Soo Kim (KaiST)<br />

Kyu-young Whang (KaiST)<br />

yang-Sae Mo<strong>on</strong> (Kangw<strong>on</strong> nati<strong>on</strong>al university)<br />

Adaptive Windows for Duplicate Detecti<strong>on</strong><br />

uwe draisbach (Hasso-plattner-institute)<br />

Felix naumann (Hasso-plattner-institute)<br />

Sascha Szott (Zuse institute)<br />

oliver W<strong>on</strong>neberg (r. lindner GmbH & Co. KG)<br />

Efficient Dual-Resoluti<strong>on</strong> Layer Indexing for Top-k Queries<br />

J<strong>on</strong>gwuk lee (pohang university of Science and Technology<br />

(poSTECH))<br />

Hyunsouk Cho (pohang university of Science and Technology<br />

(poSTECH))<br />

Seung-w<strong>on</strong> Hwang (pohang university of Science and<br />

Technology (poSTECH))<br />

Evaluating Probabilistic Queries over Uncertain Matching<br />

reynold Cheng (The university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Jian G<strong>on</strong>g (The university of H<strong>on</strong>g K<strong>on</strong>g)<br />

david W. Cheung (The university of H<strong>on</strong>g K<strong>on</strong>g)<br />

Jiefeng Cheng (Shenzhen institute of advanced Technology)<br />

Demo Group 3 (Studio E)<br />

See “demo Group 3” listing above<br />

4PM - 5:30PM Sessi<strong>on</strong>s 24-25, Demo Group 4<br />

Sessi<strong>on</strong> 24: Sensors Network and Trajectory<br />

(Studio B)<br />

Sessi<strong>on</strong> Chair: Flip Korn<br />

Detecting Outliers in Sensor Networks using the Geometric<br />

Approach<br />

Sabbas Burdakis (Technical university of Crete)<br />

ant<strong>on</strong>ios deligiannakis (Technical university of Crete)


Efficient Threshold M<strong>on</strong>itoring for Distributed<br />

Probabilistic <strong>Data</strong><br />

Mingwang Tang (university of utah)<br />

Feifei li (university of utah)<br />

Jeff M. phillips (university of utah)<br />

Jeffrey Jestes (university of utah)<br />

Incorporating Durati<strong>on</strong> Informati<strong>on</strong> for Trajectory<br />

Classificati<strong>on</strong><br />

dhaval patel (nati<strong>on</strong>al university of Singapore)<br />

Chang Sheng (dBS Bank)<br />

Wynne Hsu (nati<strong>on</strong>al university of Singapore)<br />

M<strong>on</strong>g li lee (nati<strong>on</strong>al university of Singapore)<br />

Sessi<strong>on</strong> C<strong>on</strong>tents<br />

Reducing Uncertainty of Low-Sampling-Rate Trajectories<br />

Kai Zheng (The university of Queensland)<br />

yu Zheng (Microsoft research asia)<br />

Xing Xie (Microsoft research asia)<br />

Xiaofang Zhou (The university of Queensland)<br />

Sessi<strong>on</strong> 25: Error Reducti<strong>on</strong> and <strong>Data</strong><br />

Security (Studio D)<br />

Sessi<strong>on</strong> Chair: Graham Cormode<br />

Efficient Similarity Search over Encrypted <strong>Data</strong><br />

Mehmet Kuzu (The university of Texas at dallas)<br />

Mohammad Saiful islam (The university of Texas at dallas)<br />

Murat Kantarcioglu (The university of Texas at dallas)<br />

Obfuscating the Topical Intenti<strong>on</strong> in Enterprise Text Search<br />

HweeHwa pang (Singapore Management university)<br />

Xiaokui Xiao (nanyang Technological university)<br />

Jialie Shen (Singapore Management university)<br />

Correlati<strong>on</strong> Support for Risk Evaluati<strong>on</strong> in <strong>Data</strong>bases<br />

Katrin Eisenreich (Sap research)<br />

Jochen adamek (Technische universität Berlin)<br />

philipp rösch (Sap research)<br />

Volker Markl (Technische universität Berlin)<br />

Gregor Hackenbroich (Sap research)<br />

A Game-Theoretic Approach for High-Assurance of <strong>Data</strong><br />

Trustworthiness in Sensor Networks<br />

Hyo-Sang lim (purdue university & Computer and Telecommunicati<strong>on</strong>s<br />

<strong>Engineering</strong> divisi<strong>on</strong>, South Korea)<br />

Gabriel Ghinita (university of Massachusetts at Bost<strong>on</strong>)<br />

Elisa Bertino (purdue university)<br />

Murat Kantarcioglu (university of Texas at dallas)<br />

Page<br />

49


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

THurSday, april 5<br />

Page<br />

50<br />

Demo Group 4 (Studio E)<br />

See “demo Group 4” listing above<br />

8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />

9AM - 5:30PM Workshops<br />

Studio B: data Management in the Cloud (dMC)<br />

Studio d: Graph data Management: Techniques and<br />

applicati<strong>on</strong>s (GdM)<br />

Studio F: Secure data Management <strong>on</strong> Smartph<strong>on</strong>es and<br />

Mobiles (SdMSM)


Keynotes<br />

awarded in 2008 an ERC Advanced Grant, namely Webdam, <strong>on</strong> Foundati<strong>on</strong>s<br />

of Web <strong>Data</strong> Management. He is a member of the French Academy of<br />

Keynote 1: M<strong>on</strong>day, april 2<br />

Sciences since 2008.<br />

Viewing the Web as a Distributed Knowledge Base<br />

Serge abiteboul (Professor at Collège de France and Senior researcher at<br />

INRIA Saclay)<br />

ABstrAct: Informati<strong>on</strong> of interest may be found <strong>on</strong> the Web<br />

in a variety of forms, in many systems, and with different access<br />

protocols. A typical user may have informati<strong>on</strong> <strong>on</strong> many devices<br />

(smartph<strong>on</strong>e, laptop, TV box, etc.), many systems (mailers,<br />

blogs, Web sites, etc.), many social networks (Facebook, Picasa,<br />

etc.). This same user may have access to more informati<strong>on</strong> from<br />

Keynote family, 2 (Tuesday friends, April associati<strong>on</strong>s, 3):<br />

companies, and organizati<strong>on</strong>s. Today, the c<strong>on</strong>trol and<br />

management of the diversity of data and tasks in this setting are bey<strong>on</strong>d the skills<br />

How Different Is Big <strong>Data</strong>?<br />

of casual users. Facing similar issues, companies see the cost of managing and inte-<br />

Surajit Chaudhuri (Microsoft Corp)<br />

grating informati<strong>on</strong> skyrocketing.<br />

TALK ABSTRACT<br />

One buzzword that has been popular in the last couple of years is Big <strong>Data</strong>. In simplest<br />

terms, We Big are <strong>Data</strong> interested symbolizes the aspirati<strong>on</strong> here to build in platforms the and management tools to ingest, store and of such data. Our focus is not <strong>on</strong> har-<br />

analyze data that can be voluminous, diverse, and possibly fast changing. In this talk, I<br />

will vesting try to reflect all <strong>on</strong> a the few of the data technical of problems a particular presented by the explorati<strong>on</strong> user or of Big a group of users and then managing it<br />

<strong>Data</strong>. Some of these challenges in data analytics have been addressed by our community<br />

in the a past centralized in a more traditi<strong>on</strong>al relati<strong>on</strong>al manner. database Instead, c<strong>on</strong>text but <strong>on</strong>ly we with mixed are results. c<strong>on</strong>cerned I with the management of Web<br />

will review these quests and study some of the key less<strong>on</strong>s learned. At the same time,<br />

significant data in developments place such in as a the distributed emergence of cloud infrastructure manner, and availability with of a possibly large number of aut<strong>on</strong>omous,<br />

data rich web services hold the potential for transforming our industry. I will discuss the<br />

heterogeneous systems collaborating to support certain tasks.<br />

unique opportunities they present for Big <strong>Data</strong> Analytics.<br />

BIOGRAPHICAL SKETCH<br />

Surajit Our Chaudhuri thesis is a is Distinguished that managing Scientist at Microsoft the research. richness His current areas and of diversity of user-centric data residing<br />

interest are enterprise data analytics, self-manageability and multi-tenant technology for<br />

<strong>on</strong> the Web can be tamed using a holistic approach based <strong>on</strong> a distributed knowledge<br />

base. All Web informati<strong>on</strong>s are represented as logical facts, and Web data<br />

cloud database services. Surajit is an ACM Fellow, a recipient of the ACM SIGMOD<br />

Page<br />

51


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

management tasks as logical rules. We discuss Webdamlog, a variant of datalog for<br />

distributed data management that we use for this purpose. The automatic reas<strong>on</strong>ing<br />

povided by its inference engine, operating over the Web knowledge base,<br />

greatly benefits a variety of complex data management tasks that currently require<br />

intense work and deep expertise.<br />

This work is part of the Webdam European project, http://webdam.inria.fr/.<br />

Bio: Serge Abiteboul, Telecom Paris, PhD computer science, USC Los Angeles, and<br />

Thèse d’Etat, University of Paris Sud. He has held professor positi<strong>on</strong>s at Stanford<br />

and Ecole Polytechnique. He is <strong>on</strong>e of the co-authors of Foundati<strong>on</strong>s of <strong>Data</strong>bases,<br />

and, recently, of Web <strong>Data</strong> Management. He co-founded in 2000 a start-up,<br />

named Xyleme. He received the 1998 ACM SIGMOD Innovati<strong>on</strong> Award. He has been<br />

program chair of a number of c<strong>on</strong>ferences including ACM PODS-95, ICALP-94,<br />

ICDT-90, ECDL-99 and VLDB-09, <strong>ICDE</strong>-11, track of WWW-12. He has been awarded<br />

in 2008 an ERC Advanced Grant, namely Webdam, <strong>on</strong> Foundati<strong>on</strong>s of Web <strong>Data</strong><br />

Management. He is a member of the French Academy of Sciences since 2008.<br />

F. Codd Innovati<strong>on</strong>s Award, ACM SIGMOD C<strong>on</strong>tributi<strong>on</strong>s Award, and a VLDB<br />

ar Best Paper Award. Surajit received his Ph.D. from Stanford University and<br />

h from the Indian Keynote Institute of Technology, 2: TueSday, Kharagpur. april 3<br />

How Different is Big <strong>Data</strong>?<br />

Surajit Chaudhuri (Microsoft Corp)<br />

tALK ABstrAct: One buzzword that has been popular in the<br />

last couple of years is Big <strong>Data</strong>. In simplest terms, Big <strong>Data</strong><br />

symbolizes the aspirati<strong>on</strong> to build platforms and tools to ingest,<br />

store and analyze data that can be voluminous, diverse, and<br />

possibly fast changing. In this talk, I will try to reflect <strong>on</strong> a few of<br />

the technical problems presented by the explorati<strong>on</strong> of Big <strong>Data</strong>.<br />

Some of these challenges in data analytics have been addressed<br />

by our community in the past in a more traditi<strong>on</strong>al relati<strong>on</strong>al database c<strong>on</strong>text but<br />

<strong>on</strong>ly with mixed results. I will review these quests and study some of the key less<strong>on</strong>s<br />

learned. At the same time, significant developments such as the emergence of<br />

ote 3 (Wednesday cloud infrastructure April 4) and availability of data rich web services hold the potential for<br />

untability<br />

transforming<br />

and Trust in Cooperative<br />

our industry.<br />

Informati<strong>on</strong><br />

I will discuss<br />

Systems<br />

the unique opportunities they present for<br />

Big <strong>Data</strong> Analytics.<br />

r Druschel (Max Planck Institute for Software Systems (MPI-SWS)<br />

rslautern and BioGrAPHicAL Saarbrücken, Germany) sKEtcH: Surajit Chaudhuri is a Distinguished Scientist at Microsoft<br />

research. His current areas of interest are enterprise data analytics, self-<br />

erati<strong>on</strong> and trust play an increasingly important role in today’s informati<strong>on</strong><br />

ms. For instance, manageability peer-to-peer systems and multi-tenant like BitTorrent, Sopcast technology and Skype for are cloud database services. Surajit is<br />

red by resource an ACM c<strong>on</strong>tributi<strong>on</strong>s Fellow, from a participating recipient users; of the federated ACM SIGMOD systems like Edgar F. Codd Innovati<strong>on</strong>s Award,<br />

ternet have ACM to respect SIGMOD the interests, C<strong>on</strong>tributi<strong>on</strong>s policies and laws Award, of participating and a VLDB 10 year Best Paper Award. Surajit<br />

izati<strong>on</strong>s and received countries; in his the Ph.D. Cloud, from users entrust Stanford their data University and computati<strong>on</strong> and B.Tech from the Indian Institute of<br />

rd-part infrastructure.<br />

Technology, Kharagpur.<br />

s talk, we c<strong>on</strong>sider accountability as a way to facilitate transparency and trust<br />

perative systems. We look at practical techniques to account for the integrity<br />

tributed, cooperative computati<strong>on</strong>s, and look at some of the difficulties and<br />

problems in accountability.<br />

talk describes joint work with Paarijaat Aditya, Ioan- nis Avramopoulos,<br />

ael Backes, Andreas Haeberlen, Petr Kuznetsov, Yin Lin, Bruce Maggs,<br />

fer Rexford, Rodrigo Rodrigues, Dominique Unruh, Bill Wish<strong>on</strong> and<br />

chen Zhao.<br />

Page<br />

52


Bio: Peter Druschel is the founding director of the Max Planck Institute for<br />

Software Systems (MPI-SWS) in Germany. Previ- ously, he was a Professor of<br />

Computer Science and Electrical and Computer <strong>Engineering</strong> at Rice University in<br />

Houst<strong>on</strong>, Texas. He received the Dipl-Ing. (FH) in <strong>Data</strong> Systems Engi- neering<br />

from Fachhochschule Munich, Germany in 1986 and the Ph.D. degree in<br />

Computer Science from the University of Ariz<strong>on</strong>a in 1994. His research interests<br />

include distributed systems and operating systems. He is the recipient of an NSF<br />

CAREER Award, Alfred P. Sloan Fellowship and the ACM SIGOPS Mark Weiser<br />

Award, and a member of Academia Europaea and the German Academy of<br />

Sciences Leopoldina.<br />

Keynote 3: WedneSday, april 4<br />

Keynotes<br />

Accountability and trust in cooperative<br />

informati<strong>on</strong> systems<br />

peter druschel (Max Planck Institute for Software Systems (MPI-SWS)<br />

Kaiserslautern and Saarbrücken, Germany)<br />

Cooperati<strong>on</strong> and trust play an increasingly important role in<br />

today’s informati<strong>on</strong> systems. For instance, peer-to-peer systems<br />

like BitTorrent, Sopcast and Skype are powered by resource<br />

c<strong>on</strong>tributi<strong>on</strong>s from participating users; federated systems like<br />

the Internet have to respect the interests, policies and laws of<br />

participating organizati<strong>on</strong>s and countries; in the Cloud, users entrust their data and<br />

computati<strong>on</strong> to third-part infrastructure.<br />

In this talk, we c<strong>on</strong>sider accountability as a way to facilitate transparency and trust<br />

in cooperative systems. We look at practical techniques to account for the integrity<br />

of distributed, cooperative computati<strong>on</strong>s, and look at some of the difficulties and<br />

open problems in accountability.<br />

This talk describes joint work with Paarijaat Aditya, loannis Avramopoulos, Michael<br />

Backes, Andreas Haeberlen, Petr Kuznetsov, Yin Lin, Bruce Maggs, Jennifer Rexford,<br />

Rodrigo Rodrigues, Dominique Unruh, Bill Wish<strong>on</strong> and Mingchen Zhao.<br />

Bio: Peter Druschel is the founding director of the Max Planck Institute for Software<br />

Systems (MPI-SWS) in Germany. Previously, he was a Professor of Computer<br />

Science and Electrical and Computer <strong>Engineering</strong> at Rice University in Houst<strong>on</strong>,<br />

Texas. He received the DiplIng. (FH) in <strong>Data</strong> Systems <strong>Engineering</strong> from Fachhochschule<br />

Munich, Germany in 1986 and the Ph.D. degree in Computer Science from the<br />

University of Ariz<strong>on</strong>a in 1994. His research interests include distributed systems and<br />

operating systems. He is the recipient of an NSF CAREER Award, Alfred P. Sloan<br />

Fellowship and the ACM SIGOPS Mark Weiser Award, and a member of Academia<br />

Europaea and the German Academy of Sciences Leopoldina.<br />

Page<br />

53


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

54


Seminars<br />

Seminar 1:<br />

<strong>Data</strong> ManageMent ISSueS <strong>on</strong> the SeMantIc Web<br />

Seminar 1: <strong>Data</strong> Management Issues <strong>on</strong> the Semantic Web<br />

Seminar 1: <strong>Data</strong> Management Issues <strong>on</strong> the Semantic Web<br />

Oktie<br />

Oktie Hassanzadeh<br />

HassanzadeH<br />

is a Research<br />

is a Research<br />

Staff Member<br />

Staff Member<br />

at IBM T.J.<br />

at IBM T.J.<br />

Oktie Wats<strong>on</strong> Hassanzadeh Research Center. is a Research His research Staff Member interests at are IBM in T.J. the<br />

Wats<strong>on</strong> Research Center. His research interests are in the areas<br />

areas Wats<strong>on</strong> of Research data Center. cleaning His and research integrati<strong>on</strong>, interests Web are in data the<br />

of management areas<br />

data cleaning<br />

of data and and <strong>on</strong>line cleaning<br />

integrati<strong>on</strong>, data and analytics. integrati<strong>on</strong>,<br />

Web He data has management received Web data the and<br />

<strong>on</strong>line IBM management PhD data fellowship analytics. and <strong>on</strong>line in He 2010, data has and analytics. received is a recipient He the has IBM received of PhD the 2010 fellowship the in<br />

2010, Yahoo! IBM PhD and Key fellowship is a Scientific recipient in Challenges 2010, of the and 2010 award. is a Yahoo! recipient He is Key of a two the Scientific 2010 -time Challenges<br />

recipient Yahoo! award. Key of the Scientific He first is prize a Challenges two-time at the Triplificati<strong>on</strong> recipient award. He of Challenge, is the a two first -time an prize at the<br />

Triplificati<strong>on</strong> annual recipient c<strong>on</strong>test of the Challenge, first that prize awards an at annual the prizes Triplificati<strong>on</strong> to c<strong>on</strong>test the most Challenge, that promising awards an prizes to<br />

the projects annual most c<strong>on</strong>test in promising the area that of awards projects Linked prizes <strong>Data</strong>. in the He to area the is a of most graduate Linked promising of <strong>Data</strong>. the He is a<br />

University<br />

graduate<br />

projects in of<br />

of<br />

the Tor<strong>on</strong>to<br />

the<br />

area<br />

University<br />

of (M.Sc., Linked Ph.D.)<br />

of<br />

<strong>Data</strong>.<br />

Tor<strong>on</strong>to<br />

He and is Sharif<br />

(M.Sc.,<br />

a graduate University<br />

Ph.D.)<br />

of<br />

and<br />

the of<br />

Sharif<br />

University Technology of (B.Sc.). Tor<strong>on</strong>to (M.Sc., Ph.D.) and Sharif University of<br />

University of Technology (B.Sc.).<br />

Technology (B.Sc.).<br />

Dr Anastasios Kementsietsidis is a Research Staff Member<br />

at Dr IBM Anastasios T.J. Wats<strong>on</strong> Kementsietsidis Research is Center a Research at Hawthorne, Staff Member NY.<br />

dr. Anastasios at anastasiOs IBM T.J. has Wats<strong>on</strong> a kementsietsidis PhD Research in computer Center at is science a Hawthorne, Research from NY. the Staff Member<br />

at University Anastasios IBM T.J. Wats<strong>on</strong> of has Tor<strong>on</strong>to. a Research PhD He in is currently computer Center at interested science Hawthorne, in from various NY. the Anastasios<br />

has aspects University a PhD of in of computer RDF Tor<strong>on</strong>to. data He science management is currently from (including, interested the University in querying, various of Tor<strong>on</strong>to. He<br />

is currently storing aspects and of interested RDF benchmarking data in management various RDF aspects data). (including, In of the RDF querying, past, data he manage-<br />

worked (and is still interested in c<strong>on</strong>tinuing working) <strong>on</strong> data<br />

ment<br />

storing<br />

(including,<br />

and benchmarking<br />

querying, storing<br />

RDF data).<br />

and benchmarking<br />

In the past, he<br />

RDF data).<br />

integrati<strong>on</strong>, worked (and cleaning, is still interested provenance in c<strong>on</strong>tinuing and annotati<strong>on</strong>, working) security, <strong>on</strong> data<br />

In the<br />

as well<br />

past,<br />

as<br />

he<br />

(distributed)<br />

worked<br />

query<br />

(and<br />

evaluati<strong>on</strong><br />

is still interested<br />

and optimizati<strong>on</strong><br />

in c<strong>on</strong>tinuing<br />

<strong>on</strong><br />

work-<br />

integrati<strong>on</strong>, cleaning, provenance and annotati<strong>on</strong>, security,<br />

ing) relati<strong>on</strong>al as <strong>on</strong> well data as (distributed) or integrati<strong>on</strong>, semi-structured query cleaning, evaluati<strong>on</strong> data. provenance and He optimizati<strong>on</strong> has and several annotati<strong>on</strong>,<br />

<strong>on</strong><br />

security, publicati<strong>on</strong>s relati<strong>on</strong>al as well or in the as semi-structured (distributed) leading database data. query c<strong>on</strong>ferences, He evaluati<strong>on</strong> has including several and optimizati<strong>on</strong><br />

a publicati<strong>on</strong>s <strong>on</strong> best relati<strong>on</strong>al paper in award the or leading semi-structured in <strong>ICDE</strong> database 2007, a c<strong>on</strong>ferences, best data. demo He has including award several in publica-<br />

EDBT 2006, and his CIKM ti<strong>on</strong>s 2009 a best in paper the paper leading was award a runner-up database in <strong>ICDE</strong> for 2007, c<strong>on</strong>ferences, a best a paper best demo award. including award He has a in best paper<br />

served EDBT 2006, <strong>on</strong> the and program his CIKM committee award 2009 paper in of <strong>ICDE</strong> several was 2007, a leading runner-up a best c<strong>on</strong>ferences for demo a best award and paper workshops. in award. EDBT He 2006, has and his<br />

served <strong>on</strong> the program committee of several leading c<strong>on</strong>ferences and workshops.<br />

Yannis Velegrakis is a faculty member of the Department of<br />

Informati<strong>on</strong> Yannis Velegrakis <strong>Engineering</strong> is a faculty and member Computer of the Science Department of the of Page<br />

University Informati<strong>on</strong> of <strong>Engineering</strong> Trento. He holds and a Computer PhD degree Science in Computer of the 55<br />

Science University from of Trento. the University He holds of Tor<strong>on</strong>to. a PhD degree His research in Computer areas<br />

of Science expertise from the include University informati<strong>on</strong> of Tor<strong>on</strong>to. integrati<strong>on</strong>, His research mappings areas<br />

across of expertise heterogeneous include informati<strong>on</strong> data sources, integrati<strong>on</strong>, interoperability, mappings


Anastasios has a PhD in computer science from the<br />

University of Tor<strong>on</strong>to. He is currently interested in various<br />

aspects of RDF data management (including, querying,<br />

storing and benchmarking RDF data). In the past, he<br />

<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> worked (and is still interested in c<strong>on</strong>tinuing working) <strong>on</strong> data<br />

integrati<strong>on</strong>, cleaning, provenance and annotati<strong>on</strong>, security,<br />

as well as (distributed) query evaluati<strong>on</strong> and optimizati<strong>on</strong> <strong>on</strong><br />

relati<strong>on</strong>al or semi-structured data. He has several<br />

CIKM publicati<strong>on</strong>s 2009 in paper the leading was a database runner-up c<strong>on</strong>ferences, for a best including paper award. He has<br />

a best paper award in <strong>ICDE</strong> 2007, a best demo award in<br />

served <strong>on</strong> the program committee of several leading c<strong>on</strong>ferences<br />

EDBT 2006, and his CIKM 2009 paper was a runner-up for a best paper award. He has<br />

served <strong>on</strong> the program committee and workshops.<br />

of several leading c<strong>on</strong>ferences and workshops.<br />

Yannis Velegrakis is a faculty is a member faculty of member the Department of the of Department of<br />

Informati<strong>on</strong> <strong>Engineering</strong> and and Computer Computer Science Science of the of the Univer-<br />

University of Trento. He holds a PhD degree in Computer<br />

sity<br />

Science<br />

of Trento.<br />

from the<br />

He<br />

University<br />

holds a<br />

of<br />

PhD<br />

Tor<strong>on</strong>to.<br />

degree<br />

His research<br />

in Computer<br />

areas<br />

Science from<br />

the of University expertise include of Tor<strong>on</strong>to. informati<strong>on</strong> His research integrati<strong>on</strong>, areas mappings of expertise include<br />

informati<strong>on</strong> across heterogeneous integrati<strong>on</strong>, data mappings sources, across interoperability, heterogeneous data<br />

sources, keyword interoperability, searching, semantic keyword web, social searching, applicati<strong>on</strong>s, semantic and web, social<br />

large-scale data management. Prior to joining the<br />

applicati<strong>on</strong>s, and large-scale data management. Prior to joining<br />

University of Trento, he held a researcher positi<strong>on</strong> at AT&T<br />

the Research University Labs of in Trento, the US. he He held has also a researcher spent time positi<strong>on</strong> as a at AT&T<br />

Research visitor at the Labs University in the US. of California, He has Santa-Cruz, also spent the time IBM as a visitor at the<br />

University Almaden Research of California, Center, Santa-Cruz, and the Center the of IBM Advanced Almaden Research<br />

Center, Studies and of the the IBM Center Tor<strong>on</strong>to of Lab. Advanced He was a Studies member of the<br />

IBM Tor<strong>on</strong>to<br />

committee for the CIMI cultural profile of the ANSI/NISO Z39.50 standard. He has<br />

Lab. He was a member of the committee for the CIMI cultural<br />

served in many program committees of nati<strong>on</strong>al and internati<strong>on</strong>al c<strong>on</strong>ferences and as<br />

reviewer for numerous internati<strong>on</strong>al profile of journals. the ANSI/NISO He is a general Z39.50 co-chair standard. for VLDB He 2013 has served in many<br />

and a PC co-chair for WebDB program <strong>2012</strong>. He committees has also been of a nati<strong>on</strong>al general co-chair and internati<strong>on</strong>al for DESWEB c<strong>on</strong>ferences<br />

2010 and 2011 and for SWAE2007. and as reviewer He holds 2 for US numerous patents and internati<strong>on</strong>al has been a Marie journals. Curie He is a gen-<br />

Fellow for the period 2006-2008. eral co-chair for VLDB 2013 and a PC co-chair for WebDB <strong>2012</strong>.<br />

He has also been a general co-chair for DESWEB 2010 and 2011<br />

and for SWAE2007. He holds 2 US patents and has been a Marie<br />

Curie Fellow for the period 2006-2008.<br />

Seminar 2:<br />

DIScoverIng MultIple cluSterIng SolutI<strong>on</strong>S:<br />

groupIng objectS In DIfferent vIeWS of the <strong>Data</strong><br />

Seminar 2: Discovering Multiple Clustering Soluti<strong>on</strong>s: Grouping Objects in<br />

Different Views of the <strong>Data</strong><br />

emmanuel Emmanuel Müller müller is a senior is a senior researcher researcher at the institute at the for institute for<br />

program structures and and data data organizati<strong>on</strong> organizati<strong>on</strong> at the Karlsruhe at the Karlsruhe Insti-<br />

Institute of Technology (KIT), Germany. In the past years,<br />

tute<br />

he was<br />

of Technology<br />

a research assistant<br />

(KIT),<br />

in<br />

Germany.<br />

computer science<br />

In the<br />

at<br />

past<br />

the data<br />

years, he was a<br />

research management assistant and data in explorati<strong>on</strong> computer group science at RWTH at the Aachen data management<br />

and University, data explorati<strong>on</strong> Germany. His group research at RWTH interests Aachen cover efficient University, Germany.<br />

data His mining research in high interests dimensi<strong>on</strong>al cover data, efficient detecti<strong>on</strong> data of clusters mining in high di-<br />

in subspace projecti<strong>on</strong>s and outlier detecti<strong>on</strong>. Leading the<br />

mensi<strong>on</strong>al data, detecti<strong>on</strong> of clusters in subspace projecti<strong>on</strong>s and<br />

open-source initiative OpenSubspace he provides a general<br />

outlier c<strong>on</strong>tributi<strong>on</strong> detecti<strong>on</strong>. to the Leading research the community open-source especially initiative by a OpenSubspace<br />

repeatable he provides and comparable a general evaluati<strong>on</strong> c<strong>on</strong>tributi<strong>on</strong> study <strong>on</strong> recent to the data research community<br />

mining especially approaches. by a Dr. repeatable Müller received and his comparable Diplom (MSc) evaluati<strong>on</strong> in study<br />

<strong>on</strong> 2007 recent and his data PhD mining in 2010 approaches. from RWTH Aachen Dr. Müller University. received his Diplom<br />

He is active member of program committees such as SDM, ECML PKDD, and recent<br />

(MSc) in 2007 and his PhD in 2010 from RWTH Aachen Univer-<br />

MultiClust-Workshops.<br />

sity. He is active member of program committees such as SDM,<br />

ECML PKDD, and recent MultiClust-Workshops.<br />

Stephan Günnemann is a PhD student and research<br />

assistant in computer science at the data management and<br />

data explorati<strong>on</strong> group at RWTH Aachen University,<br />

Germany. His research interests include the mining of n<strong>on</strong>redundant<br />

and multiple clustering soluti<strong>on</strong>s for high<br />

dimensi<strong>on</strong>al and structured data. He c<strong>on</strong>tributes to the open<br />

source initiative OpenSubspace for the evaluati<strong>on</strong> and<br />

explorati<strong>on</strong> of subspace clustering algorithms. Stephan<br />

Günnemann received his Diplom (MSc) in 2008 from RWTH<br />

Aachen University.<br />

Page<br />

56<br />

Ines Färber is a PhD student and research assistant in computer<br />

science at the data management and data explorati<strong>on</strong> group at<br />

RWTH Aachen University, Germany. Her research interests


epeatable c<strong>on</strong>tributi<strong>on</strong> and to comparable the research evaluati<strong>on</strong> community study especially <strong>on</strong> recent data by a<br />

mining repeatable approaches. and comparable Dr. Müller evaluati<strong>on</strong> received his study Diplom <strong>on</strong> recent (MSc) data in<br />

2007 mining and approaches. his PhD in Dr. 2010 Müller from received RWTH Aachen his Diplom University. (MSc) in<br />

He is active member of program 2007 committees and his PhD such in 2010 as SDM, from ECML RWTH PKDD, Aachen and University. recent<br />

MultiClust-Workshops.<br />

He is active member of program committees such as SDM, ECML PKDD, and recent<br />

MultiClust-Workshops.<br />

associate editor.<br />

Seminars<br />

Stephan Günnemann is a PhD student and research<br />

stepHan günnemann is a PhD student and research assistant<br />

assistant Stephan in Günnemann computer science is a at PhD the data student management and research and<br />

in data computer assistant explorati<strong>on</strong> in computer science group at science the at data RWTH at the management data Aachen management University, and and data explorati<strong>on</strong><br />

Germany. data group explorati<strong>on</strong> His at research RWTH group Aachen interests at RWTH University, include the Aachen mining Germany. University, of n<strong>on</strong>His<br />

research<br />

interests redundant Germany. include His and research multiple the mining interests clustering of include n<strong>on</strong>-redundant soluti<strong>on</strong>s the mining for of and high n<strong>on</strong>multiple<br />

clustering dimensi<strong>on</strong>al redundant soluti<strong>on</strong>s and structured multiple for high data. clustering dimensi<strong>on</strong>al He c<strong>on</strong>tributes soluti<strong>on</strong>s and to the for structured open high data. He<br />

source dimensi<strong>on</strong>al initiative and OpenSubspace structured data. He for c<strong>on</strong>tributes the evaluati<strong>on</strong> to the and open<br />

c<strong>on</strong>tributes explorati<strong>on</strong> source initiative of to the subspace OpenSubspace<br />

open source clustering for<br />

initiative algorithms. the evaluati<strong>on</strong><br />

OpenSubspace Stephan and<br />

for the<br />

evaluati<strong>on</strong> Günnemann explorati<strong>on</strong> and received of explorati<strong>on</strong> subspace his Diplom clustering of (MSc) subspace in algorithms. 2008 clustering from Stephan RWTH algorithms.<br />

Stephan Aachen Günnemann University. Günnemann received his received Diplom (MSc) his Diplom in 2008 (MSc) from RWTH in 2008 from<br />

RWTH Aachen Aachen University. University.<br />

Ines Färber is a PhD student and research assistant in computer<br />

science Ines ines Färber at the is data a PhD is management a student PhD student and and research data and explorati<strong>on</strong> assistant research in group computer assistant at in comput-<br />

RWTH science er science Aachen at the at data University, the management data management Germany. and data Her explorati<strong>on</strong> research and data interests group explorati<strong>on</strong> at group<br />

include RWTH mining Aachen of alternative University, and Germany. multi-view Her clustering research soluti<strong>on</strong>s interests<br />

at RWTH Aachen University, Germany. Her research interests<br />

for include high dimensi<strong>on</strong>al mining of alternative data. She and c<strong>on</strong>tributes multi-view to the clustering OpenSubspace soluti<strong>on</strong>s<br />

initiative for include high for dimensi<strong>on</strong>al mining evaluati<strong>on</strong> of data. alternative and She explorati<strong>on</strong> c<strong>on</strong>tributes and of multi-view to multiple the OpenSubspace clustering<br />

soluti<strong>on</strong>s<br />

soluti<strong>on</strong>s. initiative for high Ines for dimensi<strong>on</strong>al evaluati<strong>on</strong> Färber received and data. explorati<strong>on</strong> her She Diplom c<strong>on</strong>tributes of (MSc) multiple in 2009 to clustering the from OpenSubspace<br />

RWTH soluti<strong>on</strong>s. initiative Aachen Ines for University. Färber evaluati<strong>on</strong> received and her explorati<strong>on</strong> Diplom (MSc) of in multiple 2009 from clustering<br />

RWTH soluti<strong>on</strong>s. Aachen Ines University. Färber received her Diplom (MSc) in 2009 from<br />

RWTH Aachen University.<br />

tHOmas<br />

Thomas Seidl<br />

seidl<br />

is<br />

is<br />

a<br />

a<br />

professor<br />

professor<br />

for<br />

for<br />

computer<br />

computer<br />

science<br />

science<br />

and<br />

and<br />

head<br />

head<br />

of the<br />

of data the management data management and data and data explorati<strong>on</strong> explorati<strong>on</strong> group group at RWTH at RWTH Aachen<br />

Aachen University, University, Germany. Germany. His research His research interests interests include include data mining data and<br />

mining database and technology database technology for multimedia for multimedia and spatio-temporal and spatio-tem- databases<br />

poral in engineering, databases in communicati<strong>on</strong> engineering, communicati<strong>on</strong> and life science and applicati<strong>on</strong>s. life science Prof.<br />

applicati<strong>on</strong>s. Seidl received Prof. his Seidl Diplom received (MSc) his in 1992 Diplom from (MSc) TU Muenchen in 1992 from and his<br />

TU PhD Muenchen (1997) and and venia his PhD legendi (1997) (2001) and venia from legendi LMU Muenchen. (2001) from He is<br />

LMU active Muenchen. member He of is several active member program of committees several program including commit- ACM<br />

tees SIGKDD, including <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> ACM <strong>ICDE</strong>, SIGKDD, SDM, <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> recent <strong>ICDE</strong>, 0MultiClust-Workshops SDM, recent 0MultiClust- and<br />

Workshops others. He is and member others. of He the is member editorial board of the of editorial The VLDB board Journal of as<br />

The VLDB Journal as associate editor.<br />

Seminar 3:<br />

eMergIng graph QuerIeS In lInkeD <strong>Data</strong><br />

Seminar 3: Emerging Graph Queries In Linked <strong>Data</strong><br />

and ’11.<br />

arijit Arijit kHan Khan is a PhD PhD student of the of the Department of Computer of Computer<br />

Science, University University of California, of California, Santa Santa Barbara Barbara (UCSB). (UCSB). He is cur- He is<br />

rently currently working working with Professor with Professor Xifeng Yan Xifeng in Graph Yan Mining. in Graph Arijit Mining.<br />

received<br />

Arijit received<br />

his Bachelor<br />

his<br />

degree<br />

Bachelor<br />

in Computer<br />

degree in<br />

Science<br />

Computer<br />

and<br />

Science<br />

Engineer-<br />

and<br />

ing from Jadavpur University, India in 2008. He is the recipient of<br />

<strong>Engineering</strong> from Jadavpur University, India in 2008. He is the<br />

the prestigious CITRIX GO-TO fellowship award for the academic<br />

recipient of the prestigious CITRIX GO-TO fellowship award for<br />

year 2008-2009 and P1 fellowship award for the Spring Quarter<br />

the academic year 2008-2009 and P1 fellowship award for the<br />

in 2009-10 from the Department of Computer Science, UCSB.<br />

He<br />

Spring<br />

was also<br />

Quarter<br />

awarded<br />

in<br />

gold<br />

2009-10<br />

medals<br />

from<br />

by<br />

the<br />

Tata<br />

Department<br />

C<strong>on</strong>sultancy<br />

of<br />

Services<br />

Computer<br />

Ltd Science, for being UCSB. the best He student was also of the awarded Department gold of medals Computer by Tata<br />

Science C<strong>on</strong>sultancy & <strong>Engineering</strong>, Services Jadavpur Ltd for University, being the for best 2008-2009. student He of the<br />

published Department papers of in Computer SIGMOD’10 Science and ’11. & <strong>Engineering</strong>, Jadavpur<br />

University, for 2008-2009. He published papers in SIGMOD’10<br />

Page<br />

57<br />

Yinghui Wu is a research scientist of the Department of<br />

Computer Science, University of California, Santa Barbara<br />

(UCSB). He is currently working with Professor Xifeng Yan


Department Spring Quarter of Computer in 2009-10 Science from the & Department <strong>Engineering</strong>, of Computer Jadavpur<br />

University, Science, UCSB. for 2008-2009. He was He also published awarded papers gold medals in SIGMOD’10 by Tata<br />

and ’11.<br />

C<strong>on</strong>sultancy Services Ltd for being the best student of the<br />

<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Department of Computer Science & <strong>Engineering</strong>, Jadavpur<br />

University, Yinghui for Wu 2008-2009. is a research He published scientist of papers the Department in SIGMOD’10 of<br />

and ’11.<br />

Computer Science, University of California, Santa Barbara<br />

(UCSB). He is currently working with Professor Xifeng Yan<br />

YingHui in Yinghui graph Wu Wu data is is management. a a research scientist Yinghui of got of the his Department PhD from the of of Computer<br />

University Computer<br />

Science, of Science, Edinburgh, University<br />

University UK of California, in 2010. of California, His Santa Barbara<br />

Santa research Barbara interests (UCSB).<br />

(UCSB). He is currently working with Professor Xifeng Yan<br />

He<br />

lie<br />

in<br />

is currently<br />

in the area<br />

graph data<br />

working<br />

of database<br />

management.<br />

with Professor<br />

theory and<br />

Yinghui got<br />

Xifeng<br />

graph<br />

his PhD<br />

Yan<br />

database<br />

from<br />

in graph<br />

management, with emphasis <strong>on</strong> graph database models and the<br />

data University management. of Edinburgh, Yinghui UK got in 2010. his PhD His from research the interests University of<br />

query languages. He published papers in SIGMOD, VLDB,<br />

Edinburgh, <strong>ICDE</strong> lie in and the UK ICDT. area in 2010. of database His research theory interests and graph lie database in the area of<br />

database management, theory with and emphasis graph database <strong>on</strong> graph database management, models with and emphasis<br />

<strong>on</strong> query graph languages. database He published models and papers query in SIGMOD, languages. VLDB, He published<br />

papers <strong>ICDE</strong> in and SIGMOD, ICDT. VLDB, <strong>ICDE</strong> and ICDT.<br />

Xifeng Yan is an assistant professor at the University of<br />

XiFeng<br />

California<br />

Yan is<br />

at<br />

an<br />

Santa<br />

assistant<br />

Barbara.<br />

professor<br />

He holds<br />

at the<br />

the<br />

University<br />

Venkatesh<br />

of California<br />

at Narayanamurti Santa Barbara. Chair He in holds Computer the Venkatesh Science. He Narayanamurti received Chair<br />

in Computer his Xifeng Ph.D. Yan Science. degree is an in assistant He Computer received professor Science his Ph.D. at from the degree the University in Computer of<br />

Science of California from Illinois the at University Santa Urbana-Champaign Barbara. of Illinois He in holds at 2006. Urbana-Champaign the He Venkatesh was a in<br />

2006. research Narayanamurti He was staff a research member Chair in staff at Computer the member IBM T. Science. J. at Wats<strong>on</strong> the He IBM Research received T. J. Wats<strong>on</strong> Research<br />

Center his<br />

Center<br />

Ph.D. between degree<br />

between 2006 in Computer<br />

2006 and 2008. and<br />

Science<br />

2008. He has from<br />

He been has<br />

the working University<br />

been <strong>on</strong> working <strong>on</strong><br />

of Illinois at Urbana-Champaign in 2006. He was a<br />

modeling, modeling, managing, managing, and and mining mining large-scale large-scale graphs graphs in bioinformat-<br />

in<br />

bioinformatics, research staff member social networks, at the IBM informati<strong>on</strong> T. J. Wats<strong>on</strong> networks, Research and<br />

ics, social<br />

Center<br />

networks,<br />

between 2006<br />

informati<strong>on</strong><br />

and 2008.<br />

networks,<br />

He has been<br />

and<br />

working<br />

computer<br />

<strong>on</strong><br />

systems.<br />

computer systems. His works were extensively<br />

His works<br />

referenced, modeling, were managing, extensively<br />

with over 5,000 and referenced, mining citati<strong>on</strong>s large-scale per<br />

with<br />

Google<br />

over graphs Scholar.<br />

5,000 in citati<strong>on</strong>s<br />

per Google He bioinformatics, received Scholar. NSF social He received networks, CAREER NSF informati<strong>on</strong> Award, CAREER IBM networks, Award, Inventi<strong>on</strong> and IBM Inventi<strong>on</strong><br />

Achievement computer systems. Award, ACM-SIGMOD His works Dissertati<strong>on</strong> were Dissertati<strong>on</strong> extensively Runner- Runner-Up<br />

Up Award, and <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> ICDM Award, 10-year referenced, and Highest <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> with ICDM Impact over 10-year Paper 5,000 Award. citati<strong>on</strong>s Highest per Impact Google Paper Scholar. Award.<br />

He received NSF CAREER Award, IBM Inventi<strong>on</strong><br />

Achievement Award, ACM-SIGMOD Dissertati<strong>on</strong> Runner-<br />

Up Seminar Award, and <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> 4: ICDM 10-year Highest Impact Paper Award.<br />

Seminar boolean 4: Boolean MatrIx Matrix DecoMpoSItI<strong>on</strong> Decompositi<strong>on</strong> probleM: Problem: Theory, Variatio<br />

Applicati<strong>on</strong>s theory, to varIatI<strong>on</strong>S <strong>Data</strong> <strong>Engineering</strong> anD applIcatI<strong>on</strong>S to<br />

<strong>Data</strong> engIneerIng<br />

Dr. Jaideep Vaidya is an Associate Professor of C<br />

dr. jaideep VaidYa is an Associate Professor of Computer Informati<strong>on</strong><br />

Informati<strong>on</strong> Systems at Rutgers Systems University. at Rutgers He received University. his Masters He rece<br />

and Ph.D. Masters from Purdue and University Ph.D. from and his Purdue Bachelors University degree from and his B<br />

the University of Mumbai. His research interests are in <strong>Data</strong> Min-<br />

degree from the University of Mumbai. His research<br />

ing, <strong>Data</strong> Management, Privacy, and Security. He has published<br />

over 60 are papers in in <strong>Data</strong> internati<strong>on</strong>al Mining, c<strong>on</strong>ferences <strong>Data</strong> Management, and archival journals, Privacy, and<br />

and has He received has three published best paper over awards 60 from papers the premier in internati<strong>on</strong>al c<strong>on</strong>- c<strong>on</strong><br />

ferences in data mining, databases, and digital government. He is<br />

and archival journals, and has received three be<br />

also the recipient of a NSF Career Award and a Rutgers Board of<br />

Trustees awards Research Fellowship from the for Scholarly premier Excellence. c<strong>on</strong>ferences in data<br />

databases, and digital government. He is also the recip<br />

NSF Career Award and a Rutgers Board of Trustees R<br />

Fellowship for Scholarly Excellence.<br />

Page<br />

58


Seminar 5:<br />

MInIng knoWleDge froM <strong>Data</strong>:<br />

an InforMatI<strong>on</strong> netWork analySIS approach<br />

Seminar 5: Mining Knowledge from <strong>Data</strong>: An Informati<strong>on</strong> Network Analysis<br />

Approach<br />

Seminars<br />

jiaWei Jiawei Han Han is is Abel Abel Bliss Bliss Professor in in <strong>Engineering</strong>, in the in the Depart-<br />

Department of Computer Science at the University of Illinois.<br />

ment of Computer Science at the University of Illinois. He has<br />

He has been researching into data mining, informati<strong>on</strong> network<br />

been analysis, researching and database into systems, data mining, with over informati<strong>on</strong> 600 publicati<strong>on</strong>s. network analysis,<br />

Seminar 5: Mining Knowledge and database from <strong>Data</strong>: An systems, Informati<strong>on</strong> with Network over Analysis<br />

Seminar 3: Emerging Graph He Queries served In as Linked the founding <strong>Data</strong> 600 publicati<strong>on</strong>s. He served as<br />

Approach<br />

Editor-in-Chief of ACM<br />

the Transacti<strong>on</strong>s founding <strong>on</strong> Editor-in-Chief Knowledge Discovery of ACM from Transacti<strong>on</strong>s <strong>Data</strong> (TKDD) <strong>on</strong> and Knowledge<br />

Jiawei Arijit Discovery <strong>on</strong> the Khan Han editorial is is from Abel a PhD boards Bliss <strong>Data</strong> student Professor (TKDD) of several of in and the other <strong>Engineering</strong>, Department <strong>on</strong> journals. the editorial in the Jiawei of Computer boards has of sev-<br />

Department Science, eral received other University of IBM journals. Computer Faculty of Science Jiawei California, Awards, at has the HP Santa received University Innovati<strong>on</strong> Barbara of IBM Illinois. Awards, (UCSB). Faculty ACM He Awards, is HP<br />

He currently SIGKDD has been<br />

Innovati<strong>on</strong><br />

working Innovati<strong>on</strong> researching<br />

Awards,<br />

with into Award<br />

ACM<br />

Professor data (2004), mining,<br />

SIGKDD<br />

Xifeng informati<strong>on</strong> <str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />

Innovati<strong>on</strong><br />

Yan Computer network in Graph Society<br />

Award<br />

Mining.<br />

analysis, (2004), <str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />

Arijit Technical received and Achievement database his Bachelor systems, Award with degree (2005), over in 600 Computer and publicati<strong>on</strong>s. <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Science Computer and<br />

He Computer Society served W. as Society Wallace the founding McDowell Technical Editor-in-Chief Award Achievement (2009), of and ACM<br />

<strong>Engineering</strong> from Jadavpur University, India in 2008.<br />

Award Daniel He<br />

(2005),<br />

is C. the<br />

and<br />

Transacti<strong>on</strong>s <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Drucker Computer Eminent<br />

<strong>on</strong> Knowledge<br />

Faculty Society Discovery<br />

Award W. Wallace (2011).<br />

from <strong>Data</strong><br />

He McDowell (TKDD)<br />

is a Fellow<br />

and Award of ACM (2009), and<br />

<strong>on</strong> recipient the editorial of the boards prestigious of several CITRIX other journals. GO-TO Jiawei fellowship has award for<br />

and a Fellow of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. He Daniel is currently C. Drucker the Director Eminent of Faculty Informati<strong>on</strong> Award Network (2011). Academic<br />

received the academic IBM Faculty year Awards, 2008-2009 HP Innovati<strong>on</strong> and P1 fellowship Awards, ACM award He for is the a Fellow of<br />

Research Center (INARC) SIGKDD Spring ACM supported and Innovati<strong>on</strong> Quarter a by Fellow the in Award 2009-10 Network of (2004), <str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. Science-Collaborative from He <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> the is Computer currently Department Society the Technology of Director Computer of Informa-<br />

Alliance (NS-CTA) program Technical Science, ti<strong>on</strong> Network<br />

of Achievement U.S. UCSB. Army<br />

Academic He Research Award was also (2005), Research<br />

Lab. awarded and His <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Center<br />

book gold Computer with medals (INARC)<br />

Micheline by supported Tata by<br />

Kamber and Jian Pei, "<strong>Data</strong> Society C<strong>on</strong>sultancy<br />

Mining: W. Wallace C<strong>on</strong>cepts<br />

the Network Services McDowell and<br />

Science-Collaborative Ltd<br />

Techniques" Award for being (2009), (Morgan<br />

the and<br />

Technology best Daniel Kaufmann)<br />

student C. has<br />

Alliance of the<br />

been used worldwide as<br />

(NS-CTA)<br />

Drucker a textbook. Eminent Faculty Award (2011). He is a Fellow of ACM<br />

Department program of of U.S. Computer Army Research Science Lab. & <strong>Engineering</strong>, His book with Jadavpur<br />

and a Fellow of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. He is currently the Director of Informati<strong>on</strong> Network Academic Micheline<br />

Research Center (INARC) University,<br />

Kamber supported Yizhou by for<br />

and Sun the 2008-2009.<br />

Jian Network is a Pei, Ph.D. Science-Collaborative He published papers<br />

“<strong>Data</strong> candidate Mining: at the C<strong>on</strong>cepts University Technology in SIGMOD’10<br />

of and Illinois Techniques”<br />

and Alliance ’11. (NS-CTA) program of<br />

(Morgan at U.S. Urbana-Champaign. Army Research Lab.<br />

Kaufmann) has Her His<br />

been principal book<br />

used research with Micheline<br />

worldwide interest as is a in textbook.<br />

Kamber and Jian Pei, "<strong>Data</strong> Mining: C<strong>on</strong>cepts and Techniques" (Morgan Kaufmann) has<br />

large-scale informati<strong>on</strong> and social networks, and more<br />

been used worldwide as a textbook. Yinghui<br />

generally<br />

Wu<br />

in data<br />

is a<br />

mining,<br />

research<br />

database<br />

scientist<br />

systems,<br />

of the<br />

applied<br />

Department of<br />

Computer statistics, machine Science, learning, University informati<strong>on</strong> of California, retrieval, Santa and Barbara<br />

Yizhou Sun is a Ph.D. candidate at the University of Illinois<br />

YizHOu (UCSB). network sun He science, is a currently Ph.D. with a candidate focus working <strong>on</strong> modeling with at Professor the novel University problems Xifeng of Yan Illinois at<br />

at Urbana-Champaign. Her principal research interest is in<br />

Urbana-Champaign. large-scale in and graph proposing informati<strong>on</strong> data management. scalable and Her social algorithms principal networks, Yinghui for research and large-scale, got more his PhD interest real- from is the in largescale<br />

generally University world informati<strong>on</strong> applicati<strong>on</strong>s. in data of Edinburgh, mining, and Yizhou database social UK has systems, in networks, over 2010. 30 applied His publicati<strong>on</strong>s and research more in interests generally in data<br />

mining, statistics, lie book in<br />

database chapters, the machine area learning, journals, systems,<br />

of database informati<strong>on</strong> and applied major theory retrieval, c<strong>on</strong>ferences statistics,<br />

and and graph such machine<br />

database as learning, in-<br />

network management,<br />

formati<strong>on</strong><br />

SIGKDD, science,<br />

retrieval,<br />

SIGMOD, with a emphasis focus<br />

and<br />

VLDB, <strong>on</strong> modeling<br />

network<br />

NIPS <strong>on</strong> graph and novel<br />

science,<br />

so database <strong>on</strong>, problems and<br />

with<br />

tutorials models and<br />

and a focus <strong>on</strong> model-<br />

query <strong>on</strong><br />

proposing<br />

"mining languages. scalable<br />

heterogeneous He algorithms published informati<strong>on</strong><br />

for large-scale, papers networks" in real- SIGMOD, in VLDB,<br />

ingworld novel applicati<strong>on</strong>s.<br />

premier<br />

problems Yizhou<br />

c<strong>on</strong>ferences.<br />

and has proposing over 30 publicati<strong>on</strong>s scalable in<br />

<strong>ICDE</strong> and ICDT.<br />

algorithms for largescale,<br />

book real-world chapters, journals, applicati<strong>on</strong>s. and major c<strong>on</strong>ferences Yizhou has such over as 30 publicati<strong>on</strong>s in<br />

SIGKDD, SIGMOD, VLDB, NIPS and so <strong>on</strong>, and tutorials<br />

book <strong>on</strong> "mining chapters, heterogeneous journals, informati<strong>on</strong> and major networks" c<strong>on</strong>ferences in such as SIGKDD,<br />

SIGMOD, premier Xifeng c<strong>on</strong>ferences. VLDB, Yan is NIPS an assistant and so professor <strong>on</strong>, and at tutorials the University <strong>on</strong> “mining of heterogeneous<br />

California informati<strong>on</strong> at Santa networks” Barbara. He in premier holds the c<strong>on</strong>ferences.<br />

Venkatesh<br />

Narayanamurti Chair in Computer Science. He received<br />

Xifeng his Yan Ph.D. is an degree assistant in professor Computer at the University Science of from the<br />

California University at Santa of Illinois Barbara. at Urbana-Champaign He holds the Venkatesh in 2006. He<br />

XiFeng Narayanamurti Xifeng Yan<br />

was Yan a research is Chair is an<br />

an assistant in assistant Computer<br />

staff member professor Science. professor He at<br />

at the at received the University of<br />

IBM the T. University J. Wats<strong>on</strong> of Califor-<br />

his California Ph.D. degree at Santa in Computer Barbara. Science He holds from the the Venkatesh<br />

nia University at Narayanamurti<br />

Research Santa of Barbara. Illinois<br />

Center<br />

Chair at<br />

between<br />

Urbana-Champaign He in holds Computer<br />

2006 the and Venkatesh Science. in<br />

2008.<br />

2006.<br />

He<br />

He He<br />

has Narayanamurti received<br />

been Chair<br />

in Computer was his<br />

working a Ph.D. research <strong>on</strong> Science. degree<br />

modeling, staff member in He Computer<br />

managing, received at the Science IBM and his T. mining Ph.D. J. from Wats<strong>on</strong> degree the<br />

large-scale<br />

University in Computer<br />

Science Research graphs<br />

from Center in<br />

the<br />

bioinformatics, between University 2006 of and social<br />

Illinois 2008. networks, He at has Urbana-Champaign been informati<strong>on</strong><br />

of Illinois at Urbana-Champaign in 2006. He was a in<br />

working networks, <strong>on</strong> modeling, and computer managing, and systems. mining large-scale His works were<br />

2006. research<br />

extensively<br />

He was staff a<br />

referenced,<br />

research member staff at the<br />

with<br />

member IBM T. J.<br />

over 5,000<br />

at Wats<strong>on</strong> the<br />

citati<strong>on</strong>s<br />

IBM Research T.<br />

per<br />

J. Wats<strong>on</strong> Re-<br />

graphs in bioinformatics, social networks, informati<strong>on</strong><br />

search networks, Center<br />

Google Center between and Scholar. between computer 2006<br />

He received 2006 and systems. 2008. and NSF His 2008. He<br />

CAREER works has He been were Award, has working been IBM<br />

<strong>on</strong> working <strong>on</strong><br />

Inventi<strong>on</strong> Achievement Award, modeling, extensively modeling,<br />

ACM-SIGMOD managing, referenced, managing,<br />

Dissertati<strong>on</strong> and with and mining over Runner-Up<br />

mining 5,000 large-scale citati<strong>on</strong>s large-scale<br />

Award, per graphs and<br />

graphs<br />

<str<strong>on</strong>g>IEEE</str<strong>on</strong>g> in bioinformat-<br />

in<br />

ICDM 10-year Highest Impact<br />

Google<br />

ics, social Paper bioinformatics, Scholar.<br />

networks, Award.<br />

He social received networks, NSF CAREER informati<strong>on</strong> Award, IBM networks, and<br />

informati<strong>on</strong> networks, and computer systems.<br />

Inventi<strong>on</strong> Achievement Award, ACM-SIGMOD computer Dissertati<strong>on</strong> systems. Runner-Up His works Award, and were <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> extensively<br />

ICDM 10-year Highest Impact His Paper works<br />

referenced, Award. were extensively<br />

with over 5,000<br />

referenced,<br />

citati<strong>on</strong>s per<br />

with<br />

Google<br />

over<br />

Scholar.<br />

5,000 citati<strong>on</strong>s<br />

per Google He received Scholar. NSF He received CAREER NSF Award, CAREER IBM Award, Inventi<strong>on</strong> IBM Inventi<strong>on</strong><br />

Achievement Award, ACM-SIGMOD Dissertati<strong>on</strong> Runner- Runner-Up<br />

Up Award, and <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> ICDM Award, 10-year and Highest <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> ICDM Impact 10-year Paper Award.<br />

Highest Impact Paper Award.<br />

Page<br />

59


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

pHilip Philip s. Yu S. received Yu received his Ph.D. his Ph.D. degree degree in E.E. in from E.E. Stanford from University.<br />

Stanford He is a University. Professor in He Computer is a Professor Science in at Computer the University<br />

of Illinois Science at Chicago at the University and also of holds Illinois the at Wexler Chicago Chair and in also Informati<strong>on</strong><br />

Technology. holds the Wexler Dr. Yu Chair spent in Informati<strong>on</strong> most of his Technology. career at IBM, Dr. where<br />

he was Yu manager spent most of the of Software his career Tools at IBM, and where Techniques he was group at<br />

the Wats<strong>on</strong> manager Research of the Software Center. His Tools research and Techniques interests include group data<br />

mining, at the database Wats<strong>on</strong> and Research privacy. Center. He has His published research more interests than 650<br />

papers<br />

include<br />

in refereed<br />

data mining,<br />

journals<br />

database<br />

and c<strong>on</strong>ferences.<br />

and privacy.<br />

He holds<br />

He has<br />

or has ap-<br />

published more than 650 papers in refereed journals<br />

plied for more than 350 US patents. Dr. Yu is a Fellow of the ACM<br />

and c<strong>on</strong>ferences. He holds or has applied for more than<br />

and the <str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. He is the Editor-in-Chief of ACM Transacti<strong>on</strong>s <strong>on</strong><br />

350 US patents. Dr. Yu is a Fellow of the ACM and the<br />

Knowledge Discovery from <strong>Data</strong>. He was the Editor-in-Chief of<br />

<str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. He is the Editor-in-Chief of ACM Transacti<strong>on</strong>s <strong>on</strong> Knowledge Discovery from<br />

<strong>Data</strong>. He was the Editor-in-Chief<br />

<str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Transacti<strong>on</strong>s<br />

of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />

<strong>on</strong><br />

Transacti<strong>on</strong>s<br />

Knowledge<br />

<strong>on</strong><br />

and<br />

Knowledge<br />

<strong>Data</strong> <strong>Engineering</strong><br />

and <strong>Data</strong><br />

(2001-<br />

<strong>Engineering</strong> (2001-2004). He<br />

2004).<br />

received<br />

He received<br />

a Research<br />

a Research<br />

C<strong>on</strong>tributi<strong>on</strong>s<br />

C<strong>on</strong>tributi<strong>on</strong>s<br />

Award from<br />

Award<br />

<str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />

from<br />

Intl.<br />

<str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />

<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong> <strong>Data</strong> Mining Intl. (2003). <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong> <strong>Data</strong> Mining (2003).<br />

Seminar 6:<br />

Seminar 6: Detecting Cl<strong>on</strong>es, Copying and Reuse <strong>on</strong> the Web<br />

DetectIng cl<strong>on</strong>eS, copyIng anD reuSe <strong>on</strong> the Web<br />

Seminar 6: Detecting Cl<strong>on</strong>es, Xin luna Copying dOng Xin and Reuse Luna is a researcher <strong>on</strong> D<strong>on</strong>g the Web is at AT&T a researcher Labs-Research. at AT&T She re- Labs-Rese<br />

ceived her Ph.D. from University of Washingt<strong>on</strong> in 2007, received<br />

Xin Luna D<strong>on</strong>g She is received a researcher her at AT&T Ph.D. Labs-Research. from University of Washing<br />

a Master’s Degree from Peking University in China in 2001, and<br />

She received 2007, her Ph.D. received from University a Master’s of Washingt<strong>on</strong> Degree in from Peking Univ<br />

received<br />

2007, received<br />

a Bachelor’s<br />

a Master’s<br />

Degree<br />

Degree<br />

from<br />

from<br />

Nankai<br />

Peking<br />

University<br />

University<br />

in China<br />

in 1998. Her research in China interests in 2001, include and databases, received informati<strong>on</strong> a Bachelor’s Degree<br />

in China in 2001, and received a Bachelor’s Degree from<br />

retrieval Nankai and University Nankai machine in China University learning, in 1998. with Her in an research China emphasis interests in <strong>on</strong> 1998. data Her integra- research inte<br />

ti<strong>on</strong>, include data cleaning, databases, include pers<strong>on</strong>al informati<strong>on</strong> databases, informati<strong>on</strong> retrieval informati<strong>on</strong> management, and machine retrieval and Web and ma<br />

search. learning, She with has led an the emphasis Solom<strong>on</strong> <strong>on</strong> project, data integrati<strong>on</strong>, whose goal data is to detect<br />

learning, with an emphasis <strong>on</strong> data integrati<strong>on</strong>,<br />

copying cleaning, between pers<strong>on</strong>al structured informati<strong>on</strong> sources management, and to leverage and Web the results<br />

in various<br />

search. She<br />

aspects cleaning, has led<br />

of<br />

the<br />

data<br />

Solom<strong>on</strong> pers<strong>on</strong>al integrati<strong>on</strong>,<br />

project, informati<strong>on</strong> and<br />

whose<br />

the<br />

goal<br />

Semex<br />

is to management, pers<strong>on</strong>al in- and<br />

detect copying between structured sources and to<br />

formati<strong>on</strong> management search. She system, has which led the w<strong>on</strong> Solom<strong>on</strong> the Best Demo project, award<br />

leverage the results in various aspects of data integrati<strong>on</strong>, whose goa<br />

(<strong>on</strong>e and of the top-3) Semex detect in pers<strong>on</strong>al Sigmod’05. copying informati<strong>on</strong> She has management between co-chaired system, structured WebDB’10 and sources an<br />

has which served w<strong>on</strong> in the program Best Demo committees award (<strong>on</strong>e of Sigmod’12, of top-3) in Sigmod’11,<br />

VLDB’11, PVLDB’10, leverage WWW’10, the results <strong>ICDE</strong>’10, in VLDB’09, various etc. aspects of data integr<br />

Sigmod’05. She has co-chaired WebDB’10 and has served in the program committees<br />

of Sigmod’12, Sigmod’11, VLDB’11, PVLDB’10, and the WWW’10, Semex <strong>ICDE</strong>’10, pers<strong>on</strong>al VLDB’09, informati<strong>on</strong> etc. management sy<br />

which w<strong>on</strong> the Best Demo award (<strong>on</strong>e of top-<br />

Sigmod’05. She has co-chaired WebDB’10 and has served in the program comm<br />

of Sigmod’12, Sigmod’11, VLDB’11, PVLDB’10, WWW’10, <strong>ICDE</strong>’10, VLDB’09, etc.<br />

Page<br />

60<br />

diVesH Divesh sriVastaVa Srivastava is the is the head head of the of <strong>Data</strong>base the <strong>Data</strong>base Research Research De-<br />

Department at AT&T Labs-Research. He received his<br />

partment at AT&T Labs-Research. He received his Ph.D. from the<br />

Ph.D. from the University of Wisc<strong>on</strong>sin, Madis<strong>on</strong>, and his<br />

University<br />

B.Tech<br />

of<br />

from<br />

Wisc<strong>on</strong>sin,<br />

the Indian<br />

Madis<strong>on</strong>,<br />

Institute<br />

and his<br />

of<br />

B.Tech<br />

Technology,<br />

from the Indian<br />

Institute Bombay. of Technology, His research Bombay. interests span His research a variety of interests topics span a<br />

variety in data of management.<br />

topics in data management.<br />

Divesh Srivastava is the head of the <strong>Data</strong>base Res<br />

Department at AT&T Labs-Research. He receive<br />

Ph.D. from the University of Wisc<strong>on</strong>sin, Madis<strong>on</strong>, an<br />

B.Tech from the Indian Institute of Techn<br />

Bombay. His research interests span a variety of<br />

in data management.


er Panel<br />

.D.<br />

t at<br />

. Over<br />

has<br />

ct at<br />

ftware<br />

uring<br />

ished<br />

wo<br />

nd<br />

abase<br />

and<br />

t.<br />

his<br />

hed in<br />

work<br />

arch<br />

nd<br />

ditor-<br />

w, a<br />

r of<br />

Panels<br />

PANEL 1: NSF <strong>ICDE</strong> <strong>2012</strong> CarEEr PaNEl<br />

PhiliP A. Bernstein, Ph.D. (Microsoft Research) is a Distinguished<br />

Scientist at Microsoft Corporati<strong>on</strong>. Over the past 35<br />

years, he has been a product architect at Microsoft and Digital<br />

Equipment Corp., a professor at Harvard University and Wang<br />

Institute of Graduate Studies, and a VP Software at Sequoia Systems.<br />

During that time, he has published over 150 papers and two<br />

books <strong>on</strong> the theory and implementati<strong>on</strong> of database systems,<br />

especially <strong>on</strong> transacti<strong>on</strong> processing and metadata management.<br />

The sec<strong>on</strong>d editi<strong>on</strong> of his book “Transacti<strong>on</strong> Processing” with Eric<br />

Newcomer was published in June 2009. His latest work focuses<br />

<strong>on</strong> database systems for cloud computing, <strong>on</strong> web search over<br />

structured data, and <strong>on</strong> object-to-relati<strong>on</strong>al mappings. He is an<br />

Editor-in-Chief of the VLDB Journal, an ACM Fellow, a winner<br />

of the ACM SIGMOD Innovati<strong>on</strong>s Award, and a member of the<br />

Washingt<strong>on</strong> State Academy of Sciences and the Nati<strong>on</strong>al Academy<br />

of <strong>Engineering</strong>. He received a B.S. degree from Cornell and<br />

M.Sc. and Ph.D. from University of Tor<strong>on</strong>to.<br />

Page<br />

61


.S.<br />

ty<br />

l of<br />

.D. from<br />

nto.<br />

h.D. the<br />

ialy)<br />

l<br />

g receive<br />

e<br />

, his M.S.<br />

ute of<br />

5, and<br />

niversity<br />

10, all of<br />

in<br />

of . His<br />

tly are in the<br />

mporal<br />

2005nal<br />

SF<br />

n and<br />

ship)<br />

ersity of<br />

currently<br />

in the<br />

BM<br />

t Nati<strong>on</strong>al<br />

ence<br />

at<br />

ia.<br />

h.D. (IBM<br />

h) ale<br />

a<br />

mber at<br />

earch er<br />

ry e &<br />

rge scale<br />

sity<br />

nd<br />

eived is her<br />

Science &<br />

University ent<br />

8. She is<br />

,<br />

ievement<br />

f<br />

ta<br />

el<br />

g.<br />

ard<br />

t<br />

nd<br />

g<br />

g<br />

<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

62<br />

JAmes m. KAng, Ph.D. (Nati<strong>on</strong>al Geospatial-Intelligence Agency)<br />

received his B.S. at Purdue University in 2000, his M.S. at Rochester<br />

Institute of Technology in 2005, and his Ph.D. at the University<br />

of Minnesota in 2010, all of his degrees were in Computer<br />

Science. His research interests are in the areas of Spatio-Temporal<br />

<strong>Data</strong> Mining and <strong>Data</strong>bases. From 2005-2007, he was an NSF<br />

IGERT (Integrative Graduate Educati<strong>on</strong> and Research Traineeship)<br />

Fellow at the University of Minnesota. He is currently a project<br />

scientist in the Basic and Applied Research Office at Nati<strong>on</strong>al<br />

Geospatial-Intelligence Agency (NGA) in Springfield, Virginia.<br />

YuAnYuAn tiAn, Ph.D. (IBM Almaden Research) is a Research<br />

Staff Member at IBM Almaden Research Center. Her primary<br />

research area is large scale data processing and analytics. She<br />

received her PhD in Computer Science & <strong>Engineering</strong> from University<br />

of Michigan in 2008. She is the recipient of Distinguished<br />

Achievement Award from University of Michigan in 2008 for her<br />

research and academic accomplishments.<br />

srinivAsAn PArthAsArAthY, Ph.D. (Ohio State University)<br />

Dr. Srinivasan Parthasarathy (PhD, University of Rochester), is<br />

currently a Professor of Computer Science and <strong>Engineering</strong> at the<br />

Ohio State University (OSU). His research interests are broadly in<br />

the areas of <strong>Data</strong> Mining, <strong>Data</strong>bases, Bioinformatics and Parallel<br />

and Distributed Computing. He is a recipient of an Ameritech<br />

Faculty Fellowship in 2001, a US Nati<strong>on</strong>al Science Foundati<strong>on</strong><br />

CAREER award in 2003, a US Department of Energy Early Career<br />

Award in 2004, multiple IBM Faculty Awards in 2007 and 2010,<br />

and a Google Research Award in 2009. His papers have received<br />

six best paper awards or similar h<strong>on</strong>ors from am<strong>on</strong>g ten nominati<strong>on</strong>s<br />

in leading c<strong>on</strong>ferences in the field, including <strong>on</strong>es at SIAM<br />

internati<strong>on</strong>al c<strong>on</strong>ference <strong>on</strong> data mining (SDM), <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> internati<strong>on</strong>al<br />

c<strong>on</strong>ference <strong>on</strong> data mining (ICDM), Intelligent Systems for<br />

Molecular Biology (ISMB), the Very Large <strong>Data</strong>bases <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

(VLDB) and at the ACM Knowledge Discovery and <strong>Data</strong> Mining<br />

(SIGKDD). He has served <strong>on</strong> the program, organizati<strong>on</strong>al and<br />

steering committees of leading c<strong>on</strong>ferences in the fields of data<br />

mining, databases, and high performance computing. He currently<br />

serves <strong>on</strong> the editorial boards of several journals including the<br />

<strong>Data</strong> Mining and Knowledge Discovery Journal (DMKDJ), the Distributed<br />

and Parallel <strong>Data</strong>bases Journal (DAPDJ), the Journal of<br />

Parallel and Distributed Computing (JPDC), and the ACM Transacti<strong>on</strong>s<br />

<strong>on</strong> Knowledge Discovery and <strong>Data</strong> Mining (ACM-TKDD).


Dr.<br />

Alexandros Labrinidis received<br />

his Ph.D degree in Computer<br />

Science<br />

from the University of Maryland,<br />

College Park in 2002. He is<br />

currently<br />

an associate professor at the<br />

Department of Computer Science<br />

of the<br />

University of Pittsburgh and codirector<br />

of the Advanced <strong>Data</strong><br />

Management Technologies Lab.<br />

He is also an adjunct associate<br />

professor<br />

at Carnegie Mell<strong>on</strong> University<br />

(CS Dept).<br />

Dr. Labrinidis' research<br />

focuses <strong>on</strong> user-centric<br />

data management for<br />

network-centric<br />

applicati<strong>on</strong>s, including webdatabases,<br />

data stream<br />

management systems,<br />

sensor networks, and<br />

scientific data management<br />

(with an emphasis <strong>on</strong> big<br />

data). He has published<br />

over 60 papers<br />

at peer-reviewed journals,<br />

c<strong>on</strong>ferences, and<br />

workshops; he is<br />

the<br />

recipient of an NSF<br />

CAREER award in 2008.<br />

Dr. Labrinidis is<br />

currently the<br />

Secretary/Treasurer for<br />

ACM SIGMOD, and has<br />

served<br />

as the Editor of SIGMOD<br />

Record, and in numerous<br />

program<br />

committees of internati<strong>on</strong>al<br />

c<strong>on</strong>ferences/workshops.<br />

Panels<br />

Dr. AlexAnDros lABriniDis received his Ph.D degree in Computer<br />

Science from the University of Maryland, College Park in<br />

2002. He is currently an associate professor at the Department of<br />

Computer Science of the University of Pittsburgh and co-director<br />

of the Advanced <strong>Data</strong> Management Technologies Lab. He is also<br />

an adjunct associate professor at Carnegie Mell<strong>on</strong> University<br />

(CS Dept). Dr. Labrinidis’ research focuses <strong>on</strong> user-centric data<br />

management for network-centric applicati<strong>on</strong>s, including webdatabases,<br />

data stream management systems, sensor networks,<br />

and scientific data management (with an emphasis <strong>on</strong> big data).<br />

He has published over 60 papers at peer-reviewed journals, c<strong>on</strong>ferences,<br />

and workshops; he is the recipient of an NSF CAREER<br />

award in 2008. Dr. Labrinidis is currently the Secretary/Treasurer<br />

for ACM SIGMOD, and has served as the Editor of SIGMOD<br />

Record, and in numerous program committees of internati<strong>on</strong>al<br />

c<strong>on</strong>ferences/workshops.<br />

PANEL 2: FuNDErS SESSIoN<br />

Dr. FrAnK olKen (C<strong>on</strong>sultant, Panel Organizer) is a veteran database researcher.<br />

He has a PhD. in Computer Science from Univ. of California Berkeley. He has<br />

worked <strong>on</strong> a variety of topic in scientific and statistical databases including random<br />

sampling from relati<strong>on</strong>al databases, bioinformatics, building energy management<br />

systems, power grid informatics, workflow management, file migrati<strong>on</strong>, metadata<br />

registries, etc. Most of his 35 year career was at Lawrence Berkeley Nati<strong>on</strong>al Laboratory.<br />

He has also worked <strong>on</strong> standards development for metadata registries, RDF<br />

and XML schema languages, etc.<br />

From 2006 to 2010 he was detailed to the at the U.S. Nati<strong>on</strong>al Science Foundati<strong>on</strong><br />

as a program director in the Computer and Informati<strong>on</strong> Science and <strong>Engineering</strong><br />

(CISE) Directorate, Informati<strong>on</strong> and Intelligent Systems (IIS) Divisi<strong>on</strong>, Informati<strong>on</strong><br />

Integrati<strong>on</strong> and Informatics (III) program, where he managed proposal reviews<br />

and awards in the areas of database management, graph database and mining,<br />

data intensive computing, etc. His current research interests include semantic web<br />

technologies, rule systems, graph data management and mining, electr<strong>on</strong>ic health<br />

records, and social science data management and analytics.<br />

He can reached at: frankolken@gmail.com, @frankolken <strong>on</strong> twitter, and <strong>on</strong> LinkeIn, Facebook<br />

and Google+.<br />

Dr. le gruenwAlD (Nati<strong>on</strong>al Science Foundati<strong>on</strong>) is a Program Director and the<br />

Cluster Lead of the Informati<strong>on</strong> Integrati<strong>on</strong> and Informatics (III) Program, in the<br />

Intelligent Informati<strong>on</strong> Systems (IIS) Divisi<strong>on</strong> of the Computer and Informati<strong>on</strong> Science<br />

and <strong>Engineering</strong> (CISE) Directorate at the Nati<strong>on</strong>al Science Foundati<strong>on</strong> (NSF).<br />

The IIS program supports research in areas such as <strong>Data</strong>bases, <strong>Data</strong> Mining, Informatics,<br />

Informati<strong>on</strong> Retrieval, and Social Media. She is also the Presidential and Dr.<br />

David W. Franke Professor in the School of Computer Science at The University of<br />

Oklahoma (OU). She received her Ph.D. in Computer Science from Southern Meth-<br />

Page<br />

63


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

odist University in 1990. Prior to joining OU, she was a Member of Technical Staff in<br />

the <strong>Data</strong>base Management Group at the Advanced Switching Laboratory of NEC,<br />

America, a Software Engineer at WRT, and a Lecturer in the Computer Science and<br />

<strong>Engineering</strong> Department at Southern Methodist University.<br />

Dr. Gruenwald’s major research interests include Mobile and Sensor <strong>Data</strong>bases, <strong>Data</strong><br />

Security, Privacy and C<strong>on</strong>fidentiality, Stream <strong>Data</strong> Management, <strong>Data</strong> Mining, Real-<br />

Time Distributed <strong>Data</strong>bases, Aut<strong>on</strong>omic <strong>Data</strong> Management, Multimedia <strong>Data</strong>bases<br />

and Web <strong>Data</strong>bases. She has published numerous technical papers in these areas.<br />

She can be reached at: lgruenwa@nsf.gov<br />

Dr. Ceren sust (Department of Enegry) joined Department of Energy (DOE)<br />

Office of Advanced Scientific Computing Research (ASCR) in January 2011 after<br />

completing a Nati<strong>on</strong>al Research Council Postdoctoral Fellowship at the Center for<br />

Nanoscale Science and Technology (CNST) at the Nati<strong>on</strong>al Institute of Standards<br />

and Technology (NIST).<br />

She has diverse research experience in chemistry, chemical engineering, materials<br />

science and applied physics. At ASCR, she currently manages the Scientific Discovery<br />

through Advanced Computing (SciDAC) portfolio.<br />

She can be reached at: ceren.susut-bennett@science.doe.gov<br />

Dr. olgA BrAzhniK (Nati<strong>on</strong>al Institutes of Health) has over 30 years of professi<strong>on</strong>al<br />

career in computati<strong>on</strong>al sciences and health, biomedical and clinical informatics.<br />

He started as a physicist, applying theoretical and computati<strong>on</strong>al methods<br />

in biology and medicine; and earned a Ph.D. in Computati<strong>on</strong>al Physics from Moscow<br />

State University, Russia. Researching and developing technologies for transforming<br />

data into knowledge, she worked at the University of Chicago, Virginia Tech,<br />

Virginia Bioinformatics Institute, and the US Air Force Surge<strong>on</strong> General Office.<br />

She joined the Nati<strong>on</strong>al Institutes of Health (NIH) in 2004. Over her years with NIH,<br />

she managed grants, cooperative agreements and c<strong>on</strong>tracts in areas of health and<br />

biomedical informatics, semantics, visualizati<strong>on</strong>, multi-media, knowledge engineering,<br />

social network analysis and collaborative technologies. Am<strong>on</strong>g her other duties,<br />

she currently directs Small Business Innovati<strong>on</strong> Research (SBIR) program at the<br />

Nati<strong>on</strong>al Center for Advancing Translati<strong>on</strong>al Sciences (NCATS).<br />

Following her passi<strong>on</strong> for developing collective intelligence about human health,<br />

Dr. Brazhnik keeps exploring ways in which ubiquitous computing and cutting edge<br />

technology enable us to employ individual creativity, wisdom of the crowds, art,<br />

holistic approaches and solid science to benefit human health and wellbeing. She<br />

introduced numerous novel informatics and collaborative technologies to the NIH<br />

community and recently organized Crowdsourcing: the Art and Science of Open<br />

Innovati<strong>on</strong> (http://videocast.nih.gov/summary.asp?live=10366).<br />

She can be reached at: brazhnik@mail.nih.gov<br />

Page<br />

64


Panels<br />

PANEL 3: THE FuTurE oF SCIENTIFIC DaTa BaSES<br />

Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique<br />

AnAstAsiA AilAmAKi is a Professor of Computer Sciences at the<br />

Federale de Lausanne (EPFL) in Switzerland. Her research interests are in designing robust<br />

Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland.<br />

systems to support Her data-intensive research interests applicati<strong>on</strong>s, are in designing and in particular robust systems (a) in maximizing to support the<br />

potential of multicore data-intensive hardware and applicati<strong>on</strong>s, solid-state and drive in particular storage for (a) scalable in maximizing query the and<br />

transacti<strong>on</strong> processing, potential and of (b) multicore in automating hardware physical and solid-state design to drive support storage demanding for<br />

scientific applicati<strong>on</strong>s. scalable She query has received and transacti<strong>on</strong> a European processing, Young and Investigator (b) in automat- Award from the<br />

ing physical design to support demanding scientific applicati<strong>on</strong>s.<br />

European Science Foundati<strong>on</strong> (2007), a Finmeccanica endowed chair from the Computer<br />

She has received a European Young Investigator Award from the<br />

Science Department European at Carnegie Science Mell<strong>on</strong> Foundati<strong>on</strong> (2007), (2007), an Alfred a Finmeccanica P. Sloan Research endowed Fellowship<br />

(2005), seven best-paper chair from awards the Computer at top c<strong>on</strong>ferences Science Department (2001-2011), at Carnegie and an NSF Mell<strong>on</strong> CAREER<br />

award (2002). She (2007), earned an her Alfred Ph.D. P. in Sloan Computer Research Science Fellowship from the (2005), University seven of Wisc<strong>on</strong>sin<br />

Madis<strong>on</strong> in 2000. She best-paper is a member awards of at <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> top c<strong>on</strong>ferences and ACM, (2001-2011), and has also and been an a NSF CRA-W mentor<br />

Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique<br />

Federale de Lausanne (EPFL) CAREER in award Switzerland. (2002). Her She research earned interests her Ph.D. are in designing Computer robust Science<br />

systems to support data-intensive from the University applicati<strong>on</strong>s, of Wisc<strong>on</strong>sin-Madis<strong>on</strong> and in particular (a) in in maximizing 2000. She the is a mem-<br />

potential of multicore hardware ber of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> and and solid-state ACM, and drive has storage also been for scalable a CRA-W query mentor. and<br />

transacti<strong>on</strong> processing, and (b) in automating physical design to support demanding<br />

scientific applicati<strong>on</strong>s. She has received a European Young Investigator Award from the<br />

European Science Foundati<strong>on</strong> JeremY (2007), KePner a Finmeccanica received a endowed B.A. with chair distincti<strong>on</strong> from the in Computer Astrophys-<br />

Science Department at ics Carnegie from Pom<strong>on</strong>a Mell<strong>on</strong> (2007), College an (Clarem<strong>on</strong>t, Alfred P. Sloan CA). Research After receiving Fellowship a DoE<br />

(2005), seven best-paper Computati<strong>on</strong>al awards at top c<strong>on</strong>ferences Science Graduate (2001-2011), Fellow and in an 1994 NSF he CAREER obtained his<br />

award (2002). She earned Ph.D. her from Ph.D. the in Computer Dept. of Astrophysics Science from the at Princet<strong>on</strong> University University of Wisc<strong>on</strong>sin- in<br />

Madis<strong>on</strong> in 2000. She is 1998 a member and then of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> joined and MIT. ACM, His and research has also is focused been a CRA-W <strong>on</strong> the mentor. development<br />

of advanced libraries for the applicati<strong>on</strong> of massively<br />

parallel computing to a variety of data intensive signal processing<br />

problems <strong>on</strong> which he has published many articles. Jeremy<br />

is most proud of the opportunity he has had to be the principal<br />

architect, PI or otherwise co-lead several very talented teams.<br />

These teams have produced a number of innovative technologies<br />

that have broken new ground in several domains.<br />

Jeremy Kepner received a B.A. with distincti<strong>on</strong> in Astrophysics from Pom<strong>on</strong>a College<br />

(Clarem<strong>on</strong>t, CA). After receiving a DoE Computati<strong>on</strong>al Science Graduate Fellow in 1994<br />

he obtained his Ph.D. from the Dept. of Astrophysics at Princet<strong>on</strong> University in 1998 and<br />

then joined MIT. His research is focused <strong>on</strong> the development of advanced libraries for the<br />

applicati<strong>on</strong> of massively parallel computing to a variety of data intensive signal processing<br />

problems <strong>on</strong> which he has published many articles. Jeremy is most proud of the opportunity<br />

he has had to be the principal architect, PI or otherwise co-lead several very talented teams.<br />

These teams have produced a number of innovative technologies that have broken new<br />

AlexAnDer szAlAY is the Alumni Centennial Professor of<br />

Astr<strong>on</strong>omy at the Johns Hopkins University, and Professor in the<br />

Department of Computer Science. He is a cosmologist, working<br />

<strong>on</strong> the statistical measures of the spatial distributi<strong>on</strong> of galaxies<br />

and galaxy formati<strong>on</strong>. He was born and educated in Hungary. He<br />

Jeremy Kepner received is the a B.A. architect with distincti<strong>on</strong> for the Science in Astrophysics Archive from of the Pom<strong>on</strong>a Sloan College Digital Sky<br />

(Clarem<strong>on</strong>t, ground in CA). several After domains. Survey. receiving His a papers DoE Computati<strong>on</strong>al cover areas from Science theoretical Graduate cosmology Fellow in 1994 to<br />

he obtained his Ph.D. from observati<strong>on</strong>al the Dept. of astr<strong>on</strong>omy, Astrophysics spatial at Princet<strong>on</strong> statistics University and computer in 1998 science. and<br />

then joined MIT. His research He is a Corresp<strong>on</strong>ding is focused <strong>on</strong> the Member development of the of Hungarian advanced libraries Academy for the of<br />

applicati<strong>on</strong> of massively Sciences, parallel computing and a Fellow to a of variety the American of data intensive Academy signal of processing Arts and Sci-<br />

problems <strong>on</strong> which he has ences. published In 2004 many he received articles. Jeremy an Alexander is most proud V<strong>on</strong> Humboldt of the opportunity Award in<br />

Alexander he has Szalay had to is be the the Alumni principal Physical Centennial architect, Sciences, Professor PI of or in Astr<strong>on</strong>omy otherwise 2007 the at co-lead Microsoft the Johns several Hopkins Jim very Gray talented Award. teams. In 2008<br />

University, and Professor in the<br />

These teams have produced he Department became of<br />

a number Doctor Computer<br />

of innovative H<strong>on</strong>oris Science.<br />

technologies Clausa He is a cosmologist, of the that Eötvös have broken University. new<br />

working <strong>on</strong> the statistical measures of the spatial distributi<strong>on</strong> of galaxies and galaxy<br />

formati<strong>on</strong>. ground He in was several born and domains. educated in Hungary. He is the architect for the Science<br />

Archive of the Sloan Digital Sky Survey. His papers cover areas from theoretical<br />

cosmology to observati<strong>on</strong>al astr<strong>on</strong>omy, spatial statistics and computer science. He is a<br />

Corresp<strong>on</strong>ding Member of the Hungarian Academy of Sciences, and a Fellow of the<br />

American Academy of Arts and Sciences. In 2004 he received an Alexander V<strong>on</strong> Humboldt<br />

Award in Physical Sciences, in 2007 the Microsoft Jim Gray Award. In 2008 he became<br />

Doctor H<strong>on</strong>oris Clausa of the Eötvös University.<br />

Page<br />

65


Archive of the Sloan Digital Sky Survey. His papers cover areas from theoretical<br />

cosmology to observati<strong>on</strong>al astr<strong>on</strong>omy, spatial statistics and computer science. He is a<br />

Corresp<strong>on</strong>ding Member of the Hungarian Academy of Sciences, and a Fellow of the<br />

American Academy of Arts and Sciences. In 2004 he received an Alexander V<strong>on</strong> Humboldt<br />

Award in Physical Sciences, in 2007 the Microsoft Jim Gray Award. In 2008 he became<br />

Doctor H<strong>on</strong>oris <strong>ICDE</strong> Clausa <strong>2012</strong> of the <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Eötvös University.<br />

miChAel st<strong>on</strong>eBrAKer has been a pi<strong>on</strong>eer of data base<br />

research and technology for more than a quarter of a century.<br />

He was the main architect of the INGRES relati<strong>on</strong>al DBMS, and<br />

the object-relati<strong>on</strong>al DBMS, POSTGRES. These prototypes were<br />

developed at the University of California at Berkeley where<br />

St<strong>on</strong>ebraker was a Professor of Computer Science for twenty five<br />

years. More recently at M.I.T. he was a co-architect of the Aurora/<br />

Borealis stream processing engine, the C-Store column-oriented<br />

DBMS, and the H-Store transacti<strong>on</strong> processing engine. Currently,<br />

he is working <strong>on</strong> science-oriented DBMSs, OLTP DBMSs, and<br />

search engines for accessing the deep web. He is the founder of<br />

five venture-capital backed startups, which commercialized his<br />

prototypes. Presently he serves as Chief Technology Officer of<br />

VoltDB, Paradigm4, Inc. and Goby.com.<br />

Dr. St<strong>on</strong>ebraker has been a pi<strong>on</strong>eer of data base research and technology for more than a<br />

quarter of a century. He was the main Professor architect of the St<strong>on</strong>ebraker INGRES relati<strong>on</strong>al DBMS, is the and author the of scores of research papers<br />

object-relati<strong>on</strong>al DBMS, POSTGRES. <strong>on</strong> These data prototypes base were technology, developed at the operating University<br />

systems and the architecture<br />

of system software services. He was awarded the ACM System<br />

Software Award in 1992, for his work <strong>on</strong> INGRES. Additi<strong>on</strong>ally,<br />

he was awarded the first annual Innovati<strong>on</strong> award by the ACM<br />

SIGMOD special interest group in 1994, and was elected to the<br />

Nati<strong>on</strong>al Academy of <strong>Engineering</strong> in 1997. He was awarded the<br />

<str<strong>on</strong>g>IEEE</str<strong>on</strong>g> John V<strong>on</strong> Neumann award in 2005, and is presently an Adjunct<br />

Professor of Computer Science at M.I.T.<br />

Page<br />

66


Awards<br />

InfluentIAl PAPer AwArd<br />

Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das:<br />

dBXplorer: A System for Keyword-Based Search over relati<strong>on</strong>al databases. ICde 2002.<br />

Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan<br />

Keyword Searching and Browsing in databases using BAnKS. ICde 2002.<br />

Citati<strong>on</strong><br />

together, these two papers from ICde 2002 laid the foundati<strong>on</strong>s for keyword search<br />

over relati<strong>on</strong>al databases, paving the way for a significant body of follow-<strong>on</strong> work in the<br />

area of Informati<strong>on</strong> retrieval and databases. the soluti<strong>on</strong>s presented in these papers<br />

are elegant and highly effective.<br />

Page<br />

67


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

BeSt PAPer AwArd<br />

Winner<br />

“Temporal Analytics <strong>on</strong> Big <strong>Data</strong> for Web Advertising”<br />

Badrish Chandramouli (Microsoft research) J<strong>on</strong>athan Goldstein (Microsoft Corporati<strong>on</strong>)<br />

S<strong>on</strong>gyun duan (IBM t. J. wats<strong>on</strong> research Center)<br />

Citati<strong>on</strong><br />

the paper beautifully combines the Map-reduce framework and ideas from data-stream<br />

management systems for scalable temporal analytics <strong>on</strong> big data for effective behavioral<br />

targeting <strong>on</strong> the web.<br />

rUnner-UP<br />

“Recomputing Materialized Instances after Changes to Mappings and <strong>Data</strong>”<br />

todd J. Green (university of California, davis) Zachary G. Ives (university of Pennsylvania)<br />

Citati<strong>on</strong><br />

the paper elegantly applies novel ideas for optimizing queries with materialized views<br />

to the practical problem of incrementally adapting declarative schema mappings in collaborative<br />

data sharing systems.<br />

Page<br />

68


Abstracts<br />

SeSSi<strong>on</strong> 1: PrivAcy<br />

Privacy in Social Networks: How Risky is Your Social Graph?<br />

cuneyt Gurcan Akcora (University of insubria)<br />

Barbara carminati (University of insubria)<br />

Elena Ferrari (University of insubria)<br />

Several efforts have been made for more privacy aware Online Social Networks<br />

(OSNs) to protect pers<strong>on</strong>al data against various privacy threats. However, despite<br />

the relevance of these proposals, we believe there is still the lack of a c<strong>on</strong>ceptual<br />

model <strong>on</strong> top of which privacy tools have to be designed. Central to this model<br />

should be the c<strong>on</strong>cept of risk. Therefore, in this paper, we propose a risk measure for<br />

OSNs. The aim is to associate a risk level with social network users in order to provide<br />

other users with a measure of how much it might be risky, in terms of disclosure<br />

of private informati<strong>on</strong>, to have interacti<strong>on</strong>s with them. We compute risk levels based<br />

<strong>on</strong> similarity and benefit measures, by also taking into account the user risk attitudes.<br />

In particular, we adopt an active learning approach for risk estimati<strong>on</strong>, where<br />

user risk attitude is learned from few required user interacti<strong>on</strong>s. The risk estimati<strong>on</strong><br />

process discussed in this paper has been developed into a Facebook applicati<strong>on</strong> and<br />

tested <strong>on</strong> real data. The experiments show the effectiveness of our proposal.<br />

Page<br />

69


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Differentially Private Spatial Decompositi<strong>on</strong>s<br />

Graham cormode (AT&T Labs – research)<br />

cecilia Procopiuc (AT&T Labs – research)<br />

Ent<strong>on</strong>g Shen (North carolina State University)<br />

Divesh Srivastava (AT&T Labs – research)<br />

Ting yu (North carolina State University)<br />

Differential privacy has recently emerged as the de facto standard for private data<br />

release. This makes it possible to provide str<strong>on</strong>g theoretical guarantees <strong>on</strong> the<br />

privacy and utility of released data. While it is well-understood how to release data<br />

based <strong>on</strong> counts and simple functi<strong>on</strong>s under this guarantee, it remains to provide<br />

general purpose techniques to release data that is useful for a variety of queries. In<br />

this paper, we focus <strong>on</strong> spatial data such as locati<strong>on</strong>s and more generally any multidimensi<strong>on</strong>al<br />

data that can be indexed by a tree structure. Directly applying existing<br />

differential privacy methods to this type of data simply generates noise. We<br />

propose instead the class of “private spatial decompositi<strong>on</strong>s’’: these adapt standard<br />

spatial indexing methods such as quadtrees and kd-trees to provide a private descripti<strong>on</strong><br />

of the data distributi<strong>on</strong>. Equipping such structures with differential privacy<br />

requires several steps to ensure that they provide meaningful privacy guarantees.<br />

Various basic steps, such as choosing splitting points and describing the distributi<strong>on</strong><br />

of points within a regi<strong>on</strong>, must be d<strong>on</strong>e privately, and the guarantees of the<br />

different building blocks composed to provide an overall guarantee. C<strong>on</strong>sequently,<br />

we expose the design space for private spatial decompositi<strong>on</strong>s, and analyze some<br />

key examples. A major c<strong>on</strong>tributi<strong>on</strong> of our work is to provide new techniques for<br />

parameter setting and post-processing the output to improve the accuracy of query<br />

answers. Our experimental study dem<strong>on</strong>strates that it is possible to build such<br />

decompositi<strong>on</strong>s efficiently, and use them to answer a variety of queries privately<br />

with high accuracy.<br />

Differentially Private Histogram Publicati<strong>on</strong><br />

Jia Xu (Northeastern University, china)<br />

Zhenjie Zhang (Advanced Digital Sciences center, illinois at Singapore Pte.)<br />

Xiaokui Xiao (Nanyang Technological University)<br />

yin yang (Advanced Digital Sciences center, illinois at Singapore Pte.)<br />

Ge yu (Northeastern University, china)<br />

Differential privacy (DP) is a promising scheme for releasing the results of statistical<br />

queries <strong>on</strong> sensitive data, with str<strong>on</strong>g privacy guarantees against adversaries with<br />

arbitrary background knowledge. Existing studies <strong>on</strong> DP mostly focus <strong>on</strong> simple aggregati<strong>on</strong>s<br />

such as counts. This paper investigates the publicati<strong>on</strong> of DP-compliant<br />

histograms, which is an important analytical tool for showing the distributi<strong>on</strong> of a<br />

random variable, e.g., hospital bill size for certain patients. Compared to simple aggregati<strong>on</strong>s<br />

whose results are purely numerical, a histogram query is inherently more<br />

complex, since it must also determine its structure, i.e., the ranges of the bins. As<br />

we dem<strong>on</strong>strate in the paper, a DP-compliant histogram with finer bins may actually<br />

lead to significantly lower accuracy than a coarser <strong>on</strong>e, since the former requires<br />

str<strong>on</strong>ger perturbati<strong>on</strong>s in order to satisfy DP. Moreover, the histogram structure itself<br />

may reveal sensitive informati<strong>on</strong>, which further complicates the problem. Motivated<br />

by this, we propose two novel algorithms, namely NoiseFirst and StructureFirst, for<br />

Page<br />

70


Abstracts<br />

computing DP-compliant histograms. Their main difference lies in the relative order<br />

of the noise injecti<strong>on</strong> and the histogram structure computati<strong>on</strong> steps. NoiseFirst<br />

has the additi<strong>on</strong>al benefit that it can improve the accuracy of an already published<br />

DP-complaint histogram computed using a naiive method. Going <strong>on</strong>e step further,<br />

we extend both soluti<strong>on</strong>s to answer arbitrary range queries. Extensive experiments,<br />

using several real data sets, c<strong>on</strong>firm that the proposed methods output highly accurate<br />

query answers, and c<strong>on</strong>sistently outperform existing competitors.<br />

Privacy-Preserving and C<strong>on</strong>tent-Protecting Locati<strong>on</strong> Based Queries<br />

russell Paulet (victoria University)<br />

Md. Golam Kaosar (victoria University)<br />

Xun yi (victoria University)<br />

Elisa Bertino (Purdue University)<br />

In this paper we present a soluti<strong>on</strong> to <strong>on</strong>e of the locati<strong>on</strong>-based query problems.<br />

This problem is defined as follows: (i) a user wants to query a database of locati<strong>on</strong><br />

data, known as Points Of Interest (POI), and does not want to reveal his/her locati<strong>on</strong><br />

to the server due to privacy c<strong>on</strong>cerns; (ii) the owner of the locati<strong>on</strong> data, that<br />

is, the locati<strong>on</strong> server, does not want to simply distribute its data to all users. The<br />

locati<strong>on</strong> server desires to have some c<strong>on</strong>trol over its data, since the data is its asset.<br />

Previous soluti<strong>on</strong>s have used a trusted an<strong>on</strong>ymiser to address privacy, but introduced<br />

the impracticality of trusting a third party. More recent soluti<strong>on</strong>s have used<br />

homomorphic encrypti<strong>on</strong> to remove this weakness. Briefly, the user submits his/her<br />

encrypted coordinates to the server and the server would determine the user’s locati<strong>on</strong><br />

homomorphically, and then the user would acquire the corresp<strong>on</strong>ding record<br />

using Private Informati<strong>on</strong> Retrieval techniques. We propose a major enhancement<br />

up<strong>on</strong> this result by introducing a similar two stage approach, where the homomorphic<br />

comparis<strong>on</strong> step is replaced with Oblivious Transfer to achieve a more secure<br />

soluti<strong>on</strong> for both parties. The soluti<strong>on</strong> we present is efficient and practical in many<br />

scenarios. We also include the results of a working prototype to illustrate the efficiency<br />

of our protocol.<br />

SeSSi<strong>on</strong> 2: WEB 2.0 APPLicATioNS<br />

GeoFeed: A Locati<strong>on</strong>-Aware News Feed<br />

Jie Bao (University of Minnesota at Twin cities)<br />

Mohamed F. Mokbel (University of Minnesota at Twin cities)<br />

chi-yin chow (city University of H<strong>on</strong>g K<strong>on</strong>g)<br />

This paper presents the GeoFeed system; a locati<strong>on</strong>-aware news feed system that<br />

provides a new platform for its users to get spatially related message updates from<br />

either their friends or favorite news sources. GeoFeed distinguishes itself from all<br />

existing news feed systems in that it takes into account the spatial extents of messages<br />

and user locati<strong>on</strong>s when deciding up<strong>on</strong> the selected news feed. GeoFeed<br />

is equipped with three different approaches for delivering the news feed to its<br />

users, namely, spatial pull, spatial push, and shared push. Then, the main challenge<br />

of GeoFeed is to decide <strong>on</strong> when to use each of these three approaches to which<br />

users. GeoFeed is equipped with a smart decisi<strong>on</strong> model that decides about using<br />

these approaches in a way that: (a) minimizes the system overhead for delivering<br />

Page<br />

71


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

the locati<strong>on</strong>-aware news feed, and (b) guarantees a certain resp<strong>on</strong>se time for each<br />

user to obtain the requested locati<strong>on</strong>-aware news feed. Experimental results, based<br />

<strong>on</strong> real and synthetic data, show that GeoFeed is favorable over existing news feed<br />

systems, with a minimal system overhead.<br />

Temporal Analytics <strong>on</strong> Big <strong>Data</strong> for Web Advertising<br />

Badrish chandramouli (Microsoft research)<br />

J<strong>on</strong>athan Goldstein (Microsoft corp.)<br />

S<strong>on</strong>gyun Duan (iBM T. J. Wats<strong>on</strong> research center)<br />

“Big <strong>Data</strong>” in map-reduce (M-R) clusters is often fundamentally temporal in nature,<br />

as are many analytics tasks over such data. For instance, display advertising uses<br />

Behavioral Targeting (BT) to select ads for users based <strong>on</strong> prior searches, page<br />

views, etc. Previous work <strong>on</strong> BT has focused <strong>on</strong> techniques that scale well for offline<br />

data using M-R. However, this approach has limitati<strong>on</strong>s for BT-style applicati<strong>on</strong>s that<br />

deal with temporal data: (1) many queries are temporal and not easily expressible in<br />

M-R, and moreover, the set-oriented nature of M-R fr<strong>on</strong>t-ends such as SCOPE is not<br />

suitable for temporal processing; (2) as commercial systems mature, they may need<br />

to also directly analyze and react to real-time data feeds since a high turnaround<br />

time can result in missed opportunities, but it is difficult for current soluti<strong>on</strong>s to<br />

naturally also operate over real-time streams. Our c<strong>on</strong>tributi<strong>on</strong>s are twofold. First,<br />

we propose a novel framework called TiMR (pr<strong>on</strong>ounced timer), that combines a<br />

time-oriented data processing system with a M-R framework. Users write and submit<br />

analysis algorithms as temporal queries - these queries are succinct, scale-outagnostic,<br />

and easy to write. They scale well <strong>on</strong> large-scale offline data using TiMR,<br />

and can work unmodified over real-time streams. We also propose new cost-based<br />

query fragmentati<strong>on</strong> and temporal partiti<strong>on</strong>ing schemes for improving efficiency<br />

with TiMR. Sec<strong>on</strong>d, we show the feasibility of this approach for BT, with new temporal<br />

algorithms that exploit new targeting opportunities. Experiments using real data<br />

from a commercial ad platform show that TiMR is very efficient and incurs ordersof-magnitude<br />

lower development effort. Our BT soluti<strong>on</strong> is easy and succinct, and<br />

performs up to several times better than current schemes in terms of memory,<br />

learning time, and click-through-rate/coverage.<br />

Entity Search Strategies for Mashup Applicati<strong>on</strong>s<br />

Stefan Endrullis (University of Leipzig)<br />

Andreas Thor (University of Leipzig)<br />

Erhard rahm (University of Leipzig)<br />

Programmatic data integrati<strong>on</strong> approaches such as mashups have become a viable<br />

approach to dynamically integrate web data at runtime. Key data sources<br />

for mashups include entity search engines and hidden databases that need to be<br />

queried via source-specific search interfaces or web forms. Current mashups are<br />

typically restricted to simple query approaches such as using keyword search. Such<br />

approaches may need a high number of queries if many objects have to be found.<br />

Furthermore, the effectiveness of the queries may be limited, i.e., they may miss<br />

relevant results. We therefore propose more advanced search strategies that aim at<br />

finding a set of entities with high efficiency and high effectiveness. Our strategies<br />

use different kinds of queries that are determined by source-specific query genera-<br />

Page<br />

72


Abstracts<br />

tors. Furthermore, the queries are selected based <strong>on</strong> the characteristics of input<br />

entities. We introduce a flexible model for entity search strategies that includes<br />

a ranking of candidate queries determined by different query generators. We<br />

describe different query generators and outline their use within four entity search<br />

strategies. These strategies apply different query ranking and selecti<strong>on</strong> approaches<br />

to optimize efficiency and effectiveness. We evaluate our search strategies in detail<br />

for two domains: product search and publicati<strong>on</strong> search. The comparis<strong>on</strong> with a<br />

standard keyword search shows that the proposed search strategies provide significant<br />

improvements in both domains.<br />

CI-Rank: Ranking Keyword Search Results Based <strong>on</strong> Collective Importance<br />

Xiaohui yu (york University & Shand<strong>on</strong>g University)<br />

Huxia Shi (york University)<br />

Keyword search over databases, popularized by keyword search in WWW, allows<br />

ordinary users to access database informati<strong>on</strong> without the knowledge of structured<br />

query languages and database schemas. Most of the previous studies in this area<br />

use IR-style ranking, which fail to c<strong>on</strong>sider the importance of the query answers. In<br />

this paper, we propose Ci-Rank, a new approach for keyword search in databases,<br />

which c<strong>on</strong>siders the importance of individual nodes in a query answer and the<br />

cohesiveness of the result structure in a balanced way. Ci-Rank is built up<strong>on</strong> a carefully<br />

designed model call Random Walk with Message Passing that helps capture<br />

the relati<strong>on</strong>ships between different nodes in the query answer. We develop a branch<br />

and bound algorithm to support the efficient generati<strong>on</strong> of top-k query answers.<br />

Indexing methods are also introduced to further speed up the run-time processing<br />

of queries. Extensive experiments c<strong>on</strong>ducted <strong>on</strong> two real data sets with a real user<br />

query log c<strong>on</strong>firm the effectiveness and efficiency of Ci-Rank.<br />

SeSSi<strong>on</strong> 3: STorAGE MANAGEMENT<br />

Lookup Tables: Fine-Grained Partiti<strong>on</strong>ing for Distributed <strong>Data</strong>bases<br />

Aubrey L. Tatarowicz (MiT)<br />

carlo curino (MiT)<br />

Evan P. c. J<strong>on</strong>es (MiT)<br />

Sam Madden (MiT)<br />

The standard way to get linear scaling in a distributed OLTP DBMS is to horiz<strong>on</strong>tally<br />

partiti<strong>on</strong> data across several nodes. Ideally, this partiti<strong>on</strong>ing will result in each query<br />

being executed at just <strong>on</strong>e node, to avoid the overheads of distributed transacti<strong>on</strong>s<br />

and allow nodes to be added without increasing the amount of required coordinati<strong>on</strong>.<br />

For some applicati<strong>on</strong>s, simple strategies, such as hashing <strong>on</strong> primary key, provide<br />

this property. Unfortunately, for many applicati<strong>on</strong>s, including social networking<br />

and order-fulfillment, many-to-many relati<strong>on</strong>ships cause simple strategies to result<br />

in a large fracti<strong>on</strong> of distributed queries. Instead, what is needed is a fine-grained<br />

partiti<strong>on</strong>ing, where related individual tuples (e.g., cliques of friends) are co-located<br />

together in the same partiti<strong>on</strong>. Maintaining such a fine-grained partiti<strong>on</strong>ing requires<br />

the database to store a large amount of metadata about which partiti<strong>on</strong> each tuple<br />

resides in. We call such metadata a lookup table, and present the design of a data<br />

distributi<strong>on</strong> layer that efficiently stores these tables and maintains them in the<br />

Page<br />

73


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

presence of inserts, deletes, and updates. We show that such tables can provide<br />

scalability for several difficult to partiti<strong>on</strong> database workloads, including Wikipedia,<br />

Twitter, and TPC-E. Our implementati<strong>on</strong> provides 40% to 300% better performance<br />

<strong>on</strong> these workloads than either simple range or hash partiti<strong>on</strong>ing and shows greater<br />

potential for further scale-out.<br />

Temporal Support for Persistent Stored Modules<br />

richard T. Snodgrass (University of Ariz<strong>on</strong>a)<br />

Dengfeng Gao (iBM Silic<strong>on</strong> valley Lab)<br />

rui Zhang (University of Ariz<strong>on</strong>a)<br />

Stephen W. Thomas (Queen’s University, Kingst<strong>on</strong>)<br />

We show how to extend temporal support of SQL to the Turing-complete porti<strong>on</strong><br />

of SQL, that of persistent stored modules (PSM). Our approach requires minor new<br />

syntax bey<strong>on</strong>d that already in SQL/Temporal to define and to invoke PSM routines,<br />

thereby extending the current, sequenced, and n<strong>on</strong>-sequenced semantics of<br />

queries to PSM routines. Temporal upward compatibility (existing applicati<strong>on</strong>s work<br />

as before when <strong>on</strong>e or more tables are rendered temporal) is ensured. We provide<br />

a transformati<strong>on</strong> that c<strong>on</strong>verts Temporal SQL/PSM to c<strong>on</strong>venti<strong>on</strong>al SQL/PSM. To<br />

support sequenced evaluati<strong>on</strong> of PSM routines, we define two different slicing approaches,<br />

maximal slicing and per-statement slicing. We compare these approaches<br />

empirically using a comprehensive benchmark and provide a heuristic for choosing<br />

between them.<br />

Energy Efficient Storage Management Cooperated with Large <strong>Data</strong><br />

Intensive Applicati<strong>on</strong>s<br />

Norifumi Nishikawa (The University of Tokyo)<br />

Miyuki Nakano (The University of Tokyo)<br />

Masaru Kitsuregawa (The University of Tokyo)<br />

Power, especially that c<strong>on</strong>sumed for storing data, and cooling costs for datacenters<br />

have increased rapidly. The main applicati<strong>on</strong>s running at datacenters are data intensive<br />

applicati<strong>on</strong>s such as large file servers or database systems. Recently, power<br />

management of the data intensive applicati<strong>on</strong>s has been emphasized in the literature.<br />

Such reports discuss the importance of power savings. However, these reports<br />

lack research <strong>on</strong> power management models for the efficient use of data intensive<br />

applicati<strong>on</strong>s’ I/O behaviors. This paper proposes a novel energy efficient storage<br />

management system that m<strong>on</strong>itors both applicati<strong>on</strong>- and device-level I/O patterns<br />

at run time, and uses not <strong>on</strong>ly the device-level I/O pattern but also applicati<strong>on</strong>level<br />

patterns. First, the design of the proposed model combined with such large data<br />

intensive applicati<strong>on</strong>s will be shown. The key features of the model are i) classifying<br />

applicati<strong>on</strong>-level I/O into four patterns using run-time access behaviors such as the<br />

length of idle time and read/write frequency, and ii) adopting an appropriate power-saving<br />

method-based <strong>on</strong> these applicati<strong>on</strong> level I/O patterns. Next, the proposed<br />

method is quantitatively evaluated with typical data intensive applicati<strong>on</strong>s such as<br />

file servers, OLTP, and DSS. It is shown that energy efficient storage management<br />

is effective in achieving large power savings compared with traditi<strong>on</strong>al approaches<br />

while an applicati<strong>on</strong> is running.<br />

Page<br />

74


Abstracts<br />

ISOBAR Prec<strong>on</strong>diti<strong>on</strong>er for Effective and High-throughput Lossless<br />

<strong>Data</strong> Compressi<strong>on</strong><br />

Eric r. Schendel (North carolina State University)<br />

ye Jin (North carolina State University)<br />

Neil Shah (North carolina State University)<br />

Jackie chen (Sandia Nati<strong>on</strong>al Laboratory)<br />

c.S. chang (Princet<strong>on</strong> Plasma Physics Laboratory, Princet<strong>on</strong>, NJ 08543, USA)<br />

Seung-Hoe Ku (New york University)<br />

Stephane Ethier (Princet<strong>on</strong> Plasma Physics Laboratory)<br />

Scott Klasky (oak ridge Nati<strong>on</strong>al Laboratory)<br />

robert Latham (Arg<strong>on</strong>ne Nati<strong>on</strong>al Laboratory)<br />

robert ross (Arg<strong>on</strong>ne Nati<strong>on</strong>al Laboratory)<br />

Nagiza F. Samatova (North carolina State University & oak ridge Nati<strong>on</strong>al Laboratory)<br />

Efficient handling of large volumes of data is a necessity for exascale scientific applicati<strong>on</strong>s<br />

and database systems. To address the growing imbalance between the<br />

amount of available storage and the amount of data being produced by high speed<br />

(FLOPS) processors <strong>on</strong> the system, data must be compressed to reduce the total<br />

amount of data placed <strong>on</strong> the file systems. General-purpose lossless compressi<strong>on</strong><br />

frameworks, such as zlib and bzlib2, are comm<strong>on</strong>ly used <strong>on</strong> datasets requiring lossless<br />

compressi<strong>on</strong>. Quite often, however, many scientific data sets compress poorly,<br />

referred to as hard-to-compress datasets, due to the negative impact of highly entropic<br />

c<strong>on</strong>tent represented within the data. An important problem in better lossless<br />

data compressi<strong>on</strong> is to identify the hard-to-compress informati<strong>on</strong> and subsequently<br />

optimize the compressi<strong>on</strong> techniques at the byte-level. To address this challenge,<br />

we introduce the In-Situ Orthog<strong>on</strong>al Byte Aggregate Reducti<strong>on</strong> Compressi<strong>on</strong><br />

(ISOBAR-compress) methodology as a prec<strong>on</strong>diti<strong>on</strong>er of lossless compressi<strong>on</strong> to<br />

identify and optimize the compressi<strong>on</strong> efficiency and throughput of hard-to-compress<br />

datasets.<br />

SeSSi<strong>on</strong> 4: DATA STrEAMS ProcESSiNG<br />

Physically Independent Stream Merging<br />

Badrish chandramouli (Microsoft research)<br />

David Maier (Portland State University)<br />

J<strong>on</strong>athan Goldstein (Microsoft corp.)<br />

A facility for merging equivalent data streams can support multiple capabilities<br />

in a data stream management system (DSMS), such as query-plan switching and<br />

high availability. One can logically view a data stream as a temporal table of events,<br />

each associated with a lifetime (time interval) over which the event c<strong>on</strong>tributes to<br />

output. In many applicati<strong>on</strong>s, the “same” logical stream may present itself physically<br />

in multiple physical forms, for example, due to disorder arising in transmissi<strong>on</strong> or<br />

from combining multiple sources; and modificati<strong>on</strong>s of earlier events. Merging such<br />

streams correctly is challenging when the streams may differ physically in timing,<br />

order, and compositi<strong>on</strong>. This paper introduces a new stream operator called Logical<br />

Merge (LMerge) that takes multiple logically c<strong>on</strong>sistent streams as input and<br />

outputs a single stream that is compatible with all of them. LMerge can handle the<br />

Page<br />

75


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

dynamic attachment and detachment of input streams. We present a range of algorithms<br />

for LMerge that can exploit compile-time stream properties for efficiency.<br />

Experiments with StreamInsight, a commercial DSMS, show that LMerge is sometimes<br />

orders-of-magnitude more efficient than enforcing determinism <strong>on</strong> inputs,<br />

and that there is benefit to using specialized algorithms when stream variability<br />

is limited. We also show that LMerge and its extensi<strong>on</strong>s can provide performance<br />

benefits in several real-world applicati<strong>on</strong>s.<br />

On Computing Correlated Aggregates over a <strong>Data</strong> Stream<br />

Srikanta Tirthapura (iowa State University)<br />

David P. Woodruff (iBM Almaden research center)<br />

On a stream of two dimensi<strong>on</strong>al data items (x,y) where x is an item identifier, and<br />

y is a numerical attribute, a correlated aggregate query requires us to first apply<br />

a selecti<strong>on</strong> predicate al<strong>on</strong>g the sec<strong>on</strong>d (y) dimensi<strong>on</strong>, followed by an aggregati<strong>on</strong><br />

al<strong>on</strong>g the first (x) dimensi<strong>on</strong>. For selecti<strong>on</strong> predicates of the form (y < c) or (y > c),<br />

where parameter c is provided at query time, we present new streaming algorithms<br />

and lower bounds for estimating statistics of the resulting substream of elements<br />

that satisfy the predicate. We provide the first sublinear space algorithms for a large<br />

family of statistics in this model, including frequency moments. We experimentally<br />

validate our algorithms, showing that their memory requirements are significantly<br />

smaller than existing linear storage schemes for large datasets, while simultaneously<br />

achieving fast per-record processing time. We also study the problem when<br />

the items have weights. Allowing negative weights allows for analyzing values which<br />

occur in the symmetric difference of two datasets. We give a str<strong>on</strong>g space lower<br />

bound which holds even if the algorithm is allowed up to a logarithmic number of<br />

passes over the data(before the query is presented). We complement this with a<br />

small space algorithm which uses a logarithmic number of passes.<br />

Accuracy-Aware Uncertain Stream <strong>Data</strong>bases<br />

Tingjian Ge (University of Kentucky)<br />

Fujun Liu (University of Kentucky)<br />

Previous work has introduced probability distributi<strong>on</strong>s as first-class comp<strong>on</strong>ents in<br />

uncertain stream database systems. A lacking element is the fact of how accurate<br />

these probability distributi<strong>on</strong>s are. This indeed has a profound impact <strong>on</strong> the accuracy<br />

of query results presented to end users. While there is some previous work<br />

that studies unreliable intermediate query results in the tuple uncertainty model,<br />

to the best of our know-ledge, we are the first to c<strong>on</strong>sider an uncertain stream<br />

database in which accuracy is taken into c<strong>on</strong>siderati<strong>on</strong> all the way from the learned<br />

distributi<strong>on</strong>s based <strong>on</strong> raw data samples to the query results. We perform an initial<br />

study of various comp<strong>on</strong>ents in an accuracy-aware uncertain stream database<br />

system, including the representati<strong>on</strong> of accuracy informati<strong>on</strong> and how to obtain<br />

query results’ accuracy. In additi<strong>on</strong>, we propose novel predicates based <strong>on</strong> hypothesis<br />

testing for decisi<strong>on</strong>-making using data with limited accuracy. We augment our<br />

study with a comprehensive set of experimental evaluati<strong>on</strong>s.<br />

Page<br />

76


On Discovery of Traveling Compani<strong>on</strong>s from Streaming Trajectories<br />

Lu-An Tang (UiUc)<br />

yu Zheng (MSrA)<br />

Jing yuan (MSrA)<br />

Jiawei Han (UiUc)<br />

Alice Leung (BBN)<br />

chih-chieh Hung (yahoo!)<br />

Wen-chih Peng (NcTU)<br />

Abstracts<br />

The advance of object tracking technologies leads to huge volumes of spatio-temporal<br />

data collected in the form of trajectory data stream. In this study, we investigate<br />

the problem of discovering object groups that travel together (i.e., traveling<br />

compani<strong>on</strong>s) from trajectory stream. Such technique has broad applicati<strong>on</strong>s in the<br />

areas of scientific study, transportati<strong>on</strong> management and military surveillance. To<br />

discover traveling compani<strong>on</strong>s, the m<strong>on</strong>itoring system should cluster the objects<br />

of each snapshot and intersect the clustering results to retrieve moving-together<br />

objects. Since both clustering and intersecti<strong>on</strong> steps involve high computati<strong>on</strong>al<br />

overhead, the key issue of compani<strong>on</strong> discovery is to improve the algorithm’s efficiency.<br />

We propose the models of closed compani<strong>on</strong> candidates and smart intersecti<strong>on</strong><br />

to accelerate data processing. A new data structure termed traveling buddy<br />

is designed to facilitate scalable and flexible compani<strong>on</strong> discovery <strong>on</strong> trajectory<br />

stream. The traveling buddies are micro-groups of objects that are tightly bound together.<br />

By <strong>on</strong>ly storing the object relati<strong>on</strong>ships rather than their spatial coordinates,<br />

the buddies can be dynamically maintained al<strong>on</strong>g trajectory stream with low cost.<br />

Based <strong>on</strong> traveling buddies, the system can discover compani<strong>on</strong>s without accessing<br />

the object details. The proposed methods are evaluated with extensive experiments<br />

<strong>on</strong> both real and synthetic datasets. The buddy-based method is an order of<br />

magnitude faster than existing methods. It also outperforms other competitors with<br />

higher precisi<strong>on</strong> and recall in compani<strong>on</strong> discovery.<br />

SeSSi<strong>on</strong> 5: GrAPHS<br />

Iterative Graph Feature Mining for Graph Indexing<br />

Dayu yuan (Penn State University)<br />

Prasenjit Mitra (Penn State University)<br />

Huiwen yu (Penn State University)<br />

c. Lee Giles (Penn State University)<br />

Subgraph search is a popular query scenario <strong>on</strong> graph databases. Given a query<br />

graph q, the subgraph search algorithm returns all database graphs having q as a<br />

subgraph. In order to quickly process the subgraph search, subgraph features are<br />

mined to index the graph database. Many subgraph feature mining approaches<br />

have been proposed. They are all mine-at- <strong>on</strong>ce algorithms in which the whole<br />

feature set is mined with <strong>on</strong>e run of the mining before building a stable graph index.<br />

However, due to the change of the envir<strong>on</strong>ments (such as the update of the graph<br />

database and the increase of available memory), the index need to be updated to<br />

accommodate those changes. Since most of the “mine-at-<strong>on</strong>ce” algorithms involve<br />

frequent subgraph or subtree mining over the whole graph database, and c<strong>on</strong>-<br />

Page<br />

77


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

structing and deploying a new index involve expensive disk operati<strong>on</strong>s, it is not efficient<br />

to re-mine the features and rebuild the index from scratch. We observe that,<br />

under most cases, it is sufficient to update a small part of the graph index. In this<br />

paper, we propose an “iterative subgraph mining” algorithm, finding <strong>on</strong>e feature<br />

to insert into (or remove from) the index iteratively. Since the majority of indexing<br />

features and the index structure are not changed, the algorithm can be frequently<br />

invoked. We first introduce the objective functi<strong>on</strong> that guides the feature mining.<br />

Then, a basic branch and bound algorithm is proposed to mine the features. Finally,<br />

we design an advanced search algorithm, which quickly finds a near-optimum<br />

subgraph feature and reduces the search space. Experiments show that our feature<br />

mining algorithm is 5 times faster than GIndex <strong>on</strong> updating the graph index, and<br />

features mined by the iterative algorithm have high filtering rate <strong>on</strong> the subgraph<br />

search problem.<br />

An Efficient Graph Indexing Method<br />

Xiaoli Wang (Nati<strong>on</strong>al University of Singapore)<br />

Xiaofeng Ding (Huazh<strong>on</strong>g University of Science and Technology)<br />

Anth<strong>on</strong>y K.H. Tung (Nati<strong>on</strong>al University of Singapore)<br />

Shanshan ying (Nati<strong>on</strong>al University of Singapore)<br />

Hai Jin (Huazh<strong>on</strong>g University of Science and Technology)<br />

Graphs are popular models for representing complex structure data and similarity<br />

search for graphs has become a fundamental research problem. Many techniques<br />

have been proposed to support similarity search based <strong>on</strong> the graph edit distance.<br />

However, they all suffer from certain drawbacks: high computati<strong>on</strong>al complexity,<br />

poor scalability in terms of database size, or not taking full advantage of indexes. To<br />

address these problems, in this paper, we propose SEGOS, an indexing and query<br />

processing framework for graph similarity search. First, an effective two-level index<br />

is c<strong>on</strong>structed off-line based <strong>on</strong> sub-unit decompositi<strong>on</strong> of graphs. Then, a novel<br />

search strategy based <strong>on</strong> the index is proposed. Two algorithms adapted from TA<br />

and CA methods are seamlessly integrated into the proposed strategy to enhance<br />

graph search. More specially, the proposed framework is easy to be pipelined to<br />

support c<strong>on</strong>tinuous graph pruning. Extensive experiments are c<strong>on</strong>ducted <strong>on</strong> two<br />

real datasets to evaluate the effectiveness and scalability of our approaches.<br />

PRAGUE: Towards Blending Practical Visual Subgraph Query<br />

Formulati<strong>on</strong> and Query Processing<br />

changjiu Jin (Nanyang Technological University)<br />

Sourav S. Bhowmick (Nanyang Technological University)<br />

Byr<strong>on</strong> choi (H<strong>on</strong>g K<strong>on</strong>g Baptist University)<br />

Shuigeng Zhou (Fudan University)<br />

In a previous paper, we laid out the visi<strong>on</strong> of a novel graph query processing paradigm<br />

where instead of processing a visual query graph after its c<strong>on</strong>structi<strong>on</strong>, it interleaves<br />

visual query formulati<strong>on</strong> and processing by exploiting the latency offered<br />

by the GUI to filter irrelevant matches and prefetch partial query results [8]. Our<br />

first attempt at implementing this visi<strong>on</strong>, called GBLENDER [8], shows significant<br />

improvement in system resp<strong>on</strong>se time (SRT) for subgraph c<strong>on</strong>tainment queries.<br />

However, GBLENDER suffers from two key drawbacks, namely inability to handle<br />

Page<br />

78


Abstracts<br />

visual subgraph similarity queries and inefficient support for visual query modificati<strong>on</strong>,<br />

limiting its usage in practical envir<strong>on</strong>ment. In this paper, we propose a novel<br />

algorithm called PRAGUE (PRactical visuAl Graph QUery blEnder), that addresses<br />

these limitati<strong>on</strong>s by exploiting a novel data structure called spindle-shaped graphs<br />

(SPIG). A SPIG succinctly records various informati<strong>on</strong> related to the set of supergraphs<br />

of a newly added edge in the visual query fragment. Specifically, PRAGUE<br />

realizes a unified visual framework to support SPIG-based processing of modificati<strong>on</strong>-efficient<br />

subgraph c<strong>on</strong>tainment and similarity queries. Extensive experiments<br />

<strong>on</strong> real-world and synthetic datasets dem<strong>on</strong>strate effectiveness of PRAGUE.<br />

Ego-centric Graph Pattern Census<br />

Walaa Eldin Moustafa (University of Maryland, college Park)<br />

Amol Deshpande (University of Maryland, college Park)<br />

Lise Getoor (University of Maryland, college Park)<br />

There is increasing interest in analyzing networks of all types including social, biological,<br />

sensor, computer, and transportati<strong>on</strong> networks. Broadly speaking, we may<br />

be interested in global network-wide analysis (e.g., centrality analysis, community<br />

detecti<strong>on</strong>) where the properties of the entire network are of interest, or local egocentric<br />

analysis where the focus is <strong>on</strong> studying the properties of nodes (egos) by<br />

analyzing their neighborhood subgraphs. In this paper we propose and study egocentric<br />

pattern census queries, a new type of graph analysis query, where a given<br />

structural pattern is searched for in every node’s neighborhood and the counts are<br />

reported or used in further analysis. This kind of analysis is useful in many domains<br />

in social network analysis including opini<strong>on</strong> leader identificati<strong>on</strong>, node classificati<strong>on</strong>,<br />

link predicti<strong>on</strong>, and role identificati<strong>on</strong>. We propose an SQL-based declarative<br />

language to support this class of queries, and develop a series of efficient query<br />

evaluati<strong>on</strong> algorithms for it. We evaluate our algorithms <strong>on</strong> a variety of synthetically<br />

generated graphs. We also show an applicati<strong>on</strong> of our language in a real-world<br />

scenario for predicting future collaborati<strong>on</strong>s from DBLP data.<br />

SeSSi<strong>on</strong> 6: UNcErTAiN AND ProBABiLiSTic DATABASES<br />

Searching Uncertain <strong>Data</strong> Represented by N<strong>on</strong>-Axis Parallel Gaussian<br />

Mixture Models<br />

Katrin Haegler (University of Munich)<br />

Frank Fiedler (University of Munich)<br />

christian Böhm (University of Munich)<br />

Efficient similarity search in uncertain data is a central problem in many modern<br />

applicati<strong>on</strong>s such as biometric identificati<strong>on</strong>, stock market analysis, sensor networks,<br />

medical imaging, etc. In such applicati<strong>on</strong>s, the feature vector of an object<br />

is not exactly known but is rather defined by a probability density functi<strong>on</strong> like a<br />

Gaussian Mixture Model (GMM). Previous work is limited to axis-parallel Gaussian<br />

distributi<strong>on</strong>s, hence, correlati<strong>on</strong>s between different features are not c<strong>on</strong>sidered in<br />

the similarity search. In this paper, we propose a novel, efficient similarity search<br />

technique for general GMMs without independence assumpti<strong>on</strong> for the attributes,<br />

named SUDN, which approximates the actual comp<strong>on</strong>ents of a GMM in a c<strong>on</strong>servative<br />

but tight way. A filter-refinement architecture guarantees no false dismissals,<br />

Page<br />

79


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

due to c<strong>on</strong>servativity, as well as a good filter selectivity, due to the tightness of<br />

our approximati<strong>on</strong>s. An extensive experimental evaluati<strong>on</strong> of SUDN dem<strong>on</strong>strates<br />

a c<strong>on</strong>siderable speed-up of similarity queries <strong>on</strong> general GMMs and an increase in<br />

accuracy compared to existing approaches.<br />

Aggregate Query Answering <strong>on</strong> Possibilistic <strong>Data</strong> with<br />

Cardinality C<strong>on</strong>straints<br />

Graham cormode (AT&T Labs – research)<br />

Ent<strong>on</strong>g Shen (North carolina State University)<br />

Divesh Srivastava (AT&T Labs – research)<br />

Ting yu (North carolina State University)<br />

Uncertainties in data can arise for a number of reas<strong>on</strong>s: when data is incomplete,<br />

c<strong>on</strong>tains c<strong>on</strong>flicting informati<strong>on</strong> or has been deliberately perturbed or coarsened to<br />

remove sensitive details. An important case which arises in many real applicati<strong>on</strong>s<br />

is when the data describes a set of possibilities, but with cardinality c<strong>on</strong>straints.<br />

These c<strong>on</strong>straints represent correlati<strong>on</strong>s between tuples encoding, e.g. that at most<br />

two possible records are correct, or that there is an (unknown) <strong>on</strong>e-to-<strong>on</strong>e mapping<br />

between a set of tuples and attribute values. Although there has been much effort to<br />

handle uncertain data, current systems are not equipped to handle such correlati<strong>on</strong>s,<br />

bey<strong>on</strong>d simple mutual exclusi<strong>on</strong> and co-existence c<strong>on</strong>straints. Vitally, they have little<br />

support for efficiently handling aggregate queries <strong>on</strong> such data. In this paper, we aim<br />

to address some of these deficiencies, by introducing LICM (Linear Integer C<strong>on</strong>straint<br />

Model), which can succinctly represent many types of tuple correlati<strong>on</strong>s, particularly<br />

a class of cardinality c<strong>on</strong>straints. We motivate and explain the model with<br />

examples from data cleaning and masking sensitive data, to show that it enables<br />

modeling and querying such data, which was not previously possible. We develop an<br />

efficient strategy to answer c<strong>on</strong>junctive and aggregate queries <strong>on</strong> possibilistic data<br />

by describing how to implement relati<strong>on</strong>al operators over data in the model. LICM<br />

compactly integrates the encoding of correlati<strong>on</strong>s, query answering and lineage<br />

recording. In combinati<strong>on</strong> with off-the-shelf linear integer programming solvers, our<br />

approach provides exact bounds for aggregate queries. Our prototype implementati<strong>on</strong><br />

dem<strong>on</strong>strates that query answering with LICM can be effective and scalable.<br />

Discovering Threshold-based Frequent Closed Itemsets over<br />

Probabilistic <strong>Data</strong><br />

y<strong>on</strong>gxin T<strong>on</strong>g (H<strong>on</strong>g K<strong>on</strong>g Univeristy of Science and <strong>Engineering</strong>)<br />

Lei chen (H<strong>on</strong>g K<strong>on</strong>g Univeristy of Science and <strong>Engineering</strong>)<br />

Bolin Ding (University of illinois at Urbana-champaign)<br />

In recent years, many new applicati<strong>on</strong>s, such as sensor network m<strong>on</strong>itoring and<br />

moving object search, show a growing amount of importance of uncertain data<br />

management and mining. In this paper, we study the problem of discovering<br />

threshold-based frequent closed itemsets over probabilistic data. Frequent itemset<br />

mining over probabilistic database has attracted much attenti<strong>on</strong> recently. However,<br />

existing soluti<strong>on</strong>s may lead an exp<strong>on</strong>ential number of results due to the downward<br />

closure property over probabilistic data. Moreover, it is hard to directly extend the<br />

successful experiences from mining exact data to a probabilistic envir<strong>on</strong>ment due<br />

to the inherent uncertainty of data. Thus, in order to obtain a reas<strong>on</strong>able result set<br />

Page<br />

80


Abstracts<br />

with small size, we study discovering frequent closed itemsets over probabilistic<br />

data. We prove that even a sub-problem of this problem, computing the frequent<br />

closed probability of an itemset, is #P-Hard. Therefore, we develop an efficient<br />

mining algorithm based <strong>on</strong> depth-first search strategy to obtain all probabilistic<br />

frequent closed itemsets. To reduce the search space and avoid redundant computati<strong>on</strong>,<br />

we further design several probabilistic pruning and bounding techniques.<br />

Finally, we verify the effectiveness and efficiency of the proposed methods through<br />

extensive experiments.<br />

Ranking Query Answers in Probabilistic <strong>Data</strong>bases: Complexity and<br />

Efficient Algorithms<br />

Dan olteanu (oxford)<br />

H<strong>on</strong>gkai Wen (oxford)<br />

In many applicati<strong>on</strong>s of probabilistic databases, the probabilities are mere degrees<br />

of uncertainty in the data and are not otherwise meaningful to the user. Often, users<br />

care <strong>on</strong>ly about the ranking of answers in decreasing order of their probabilities<br />

or about a few most likely answers. In this paper, we investigate the problem of<br />

ranking query answers in probabilistic databases. We give a dichotomy for ranking<br />

in case of c<strong>on</strong>junctive queries without repeating relati<strong>on</strong> symbols: it is either<br />

in polynomial time or \#P-hard. Surprisingly, our syntactic characterisati<strong>on</strong> of<br />

tractable queries is not the same as for probability computati<strong>on</strong>. The key observati<strong>on</strong><br />

is that there are queries for which probability computati<strong>on</strong> is \#P-hard, yet<br />

ranking can be computed in polynomial time. This is possible whenever probability<br />

computati<strong>on</strong> for distinct answers has a comm<strong>on</strong> factor that is hard to compute but<br />

irrelevant for ranking. We complement this tractability analysis with an effective<br />

ranking technique for c<strong>on</strong>junctive queries. Given a query, we c<strong>on</strong>struct a share plan,<br />

which exposes subqueries whose probability computati<strong>on</strong> can be shared or ignored<br />

across query answers. Our technique combines share plans with incremental approximate<br />

probability computati<strong>on</strong> of subqueries. We implemented our technique<br />

in the SPROUT query engine and report <strong>on</strong> performance gains of orders of magnitude<br />

over M<strong>on</strong>te Carlo simulati<strong>on</strong> using FPRAS and exact probability computati<strong>on</strong><br />

based <strong>on</strong> knowledge compilati<strong>on</strong>.<br />

SeSSi<strong>on</strong> 7: DATA iNTEGrATioN AND EXTrAcTioN<br />

Joint Entity Resoluti<strong>on</strong><br />

Steven Euij<strong>on</strong>g Whang (Stanford University)<br />

Hector Garcia-Molina (Stanford University)<br />

Entity resoluti<strong>on</strong> (ER) is the problem of identifying which records in a database<br />

represent the same entity. Often, records of different types are involved (e.g.,<br />

authors, publicati<strong>on</strong>s, instituti<strong>on</strong>s, venues), and resolving records of <strong>on</strong>e type can<br />

impact the resoluti<strong>on</strong> of other types of records. In this paper we propose a flexible,<br />

modular resoluti<strong>on</strong> framework where existing ER algorithms developed for a given<br />

record type can be plugged in and used in c<strong>on</strong>cert with other ER algorithms. Our<br />

approach also makes it possible to run ER <strong>on</strong> subsets of similar records at a time,<br />

important when the full data is too large to resolve together. We study the scheduling<br />

and coordinati<strong>on</strong> of the individual ER algorithms in order to resolve the full data<br />

Page<br />

81


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

set. We then evaluate our joint ER techniques <strong>on</strong> synthetic and real data and show<br />

the scalability of our approach.<br />

A Self-C<strong>on</strong>figuring Schema Matching System<br />

Eric Peukert (SAP research Dresden)<br />

Julian Eberius (Dresden University of Technology)<br />

Erhard rahm (University of Leipzig)<br />

Mapping complex metadata structures is crucial in a number of domains such as<br />

data integrati<strong>on</strong>, <strong>on</strong>tology alignment or model management. To speed up the generati<strong>on</strong><br />

of such mappings, automatic matching systems were developed to compute<br />

mapping suggesti<strong>on</strong>s that can be corrected by a user. However, c<strong>on</strong>structing and<br />

tuning match strategies still requires a high manual effort by matching experts as<br />

well as correct mappings to evaluate generated mappings. We therefore propose<br />

a self-c<strong>on</strong>figuring schema matching system that is able to automatically adapt to<br />

the given mapping problem at hand. Our approach is based <strong>on</strong> analyzing the input<br />

schemas as well as intermediate matching results. A variety of matching rules use<br />

the analysis results to automatically c<strong>on</strong>struct and adapt an underlying matching<br />

process for a given match task. We comprehensively evaluate our approach <strong>on</strong><br />

different mapping problems from the schema, <strong>on</strong>tology and model management<br />

domains. The evaluati<strong>on</strong> shows that our system is able to robustly return good quality<br />

mappings across different mapping problems and domains.<br />

Incremental Detecti<strong>on</strong> of Inc<strong>on</strong>sistencies in Distributed <strong>Data</strong><br />

Wenfei Fan (University of Edinburgh)<br />

Jianzh<strong>on</strong>g Li (Harbin institute of Technology)<br />

Nan Tang (University of Edinburgh & Qatar computing research institute)<br />

Wenyuan yu (University of Edinburgh)<br />

This paper investigates the problem of incremental detecti<strong>on</strong> of errors in distributed<br />

data. Given a distributed database D, a set Σ of c<strong>on</strong>diti<strong>on</strong>al functi<strong>on</strong>al dependencies<br />

(CFDs), the set V of violati<strong>on</strong>s of the CFDs in D, and updates Δ D to D, it<br />

is to find, with minimum data shipment, changes Δ V to V in resp<strong>on</strong>se to Δ D. The<br />

need for the study is evident since real-life data is often dirty, distributed and is<br />

frequently updated. It is often prohibitively expensive to recompute the entire set<br />

of violati<strong>on</strong>s when D is updated. We show that the incremental detecti<strong>on</strong> problem<br />

is NP-complete for D partiti<strong>on</strong>ed either vertically or horiz<strong>on</strong>tally, even when Σ and D<br />

are fixed. Nevertheless, we show that it is bounded and better still, actually optimal:<br />

there exist algorithms to detect errors such that their computati<strong>on</strong>al cost and<br />

data shipment are both linear in the size of Δ D and Δ V, independent of the size of<br />

the database D. We provide such incremental algorithms for vertically partiti<strong>on</strong>ed<br />

data, and show that the algorithms are optimal. We further propose optimizati<strong>on</strong><br />

techniques for the incremental algorithm over vertical partiti<strong>on</strong>s to reduce data<br />

shipment. We verify experimentally, using real-life data <strong>on</strong> Amaz<strong>on</strong> Elastic Compute<br />

Cloud (EC2), that our algorithms substantially outperform their batch counterparts<br />

even when Δ V is reas<strong>on</strong>ably large.<br />

Page<br />

82


Abstracts<br />

Recomputing Materialized Instances after Changes to Mappings and <strong>Data</strong><br />

Todd J. Green (University of california, Davis)<br />

Zachary G. ives (University of Pennsylvania)<br />

A major challenge faced by today’s informati<strong>on</strong> systems is that of evoluti<strong>on</strong> as<br />

data usage evolves or new data resources become available. Modern organizati<strong>on</strong>s<br />

sometimes exchange data with <strong>on</strong>e another via declarative mappings am<strong>on</strong>g<br />

their databases, as in data exchange and collaborative data sharing systems. Such<br />

mappings are frequently revised and refined as new data becomes available, new<br />

cross-reference tables are created, and correcti<strong>on</strong>s are made. A fundamental questi<strong>on</strong><br />

is how to handle changes to these mapping definiti<strong>on</strong>s, when the organizati<strong>on</strong>s<br />

each materialize the results of applying the mappings to the available data. We<br />

c<strong>on</strong>sider how to incrementally recompute these database instances in this setting,<br />

reusing (if possible) previously computed instances to speed up computati<strong>on</strong>. We<br />

develop a principled soluti<strong>on</strong> that performs cost-based explorati<strong>on</strong> of recomputati<strong>on</strong><br />

versus reuse, and simultaneously handles updates to source data and mapping<br />

definiti<strong>on</strong>s through a single, unified mechanism. Our soluti<strong>on</strong> also takes advantage<br />

of provenance informati<strong>on</strong>, when present, to speed up computati<strong>on</strong> even further.<br />

We present an implementati<strong>on</strong> that takes advantage of an off-the-shelf DBMS’s<br />

query processing system, and we show experimentally that our approach provides<br />

substantial performance benefits.<br />

SeSSi<strong>on</strong> 8: SPATio-TEMPorAL DATA MANAGEMENT<br />

SWST: A Disk Based Index for Sliding Window Spatio-Temporal <strong>Data</strong><br />

Manish Singh (University of Michigan, Ann Arbor)<br />

Qiang Zhu (University of Michigan, Dearborn)<br />

H.v. Jagadish (University of Michigan, Ann Arbor)<br />

Numerous applicati<strong>on</strong>s such as wireless communicati<strong>on</strong> and telematics need to<br />

keep track of evoluti<strong>on</strong> of spatio-temporal data for a limited past. Limited retenti<strong>on</strong><br />

may even be required by regulati<strong>on</strong>s. In general, each data entry can have its own<br />

user specified lifetime. It is desired that expired entries are automatically removed<br />

by the system through some garbage collecti<strong>on</strong> mechanism. This kind of limited<br />

retenti<strong>on</strong> can be achieved by using a sliding window semantics similar to that from<br />

stream data processing. However, due to the large volume and relatively l<strong>on</strong>g lifetime<br />

of data in the aforementi<strong>on</strong>ed applicati<strong>on</strong>s (in c<strong>on</strong>trast to the real-time transient<br />

streaming data), the sliding window here needs to be maintained for data <strong>on</strong><br />

disk rather than in memory. It is a new challenge to provide fast access to the informati<strong>on</strong><br />

from the recent past and, at the same time, facilitate efficient deleti<strong>on</strong> of the<br />

expired entries. In this paper, we propose a disk based, two-layered, sliding window<br />

indexing scheme for discretely moving spatio-temporal data. Our index can support<br />

efficient processing of standard timeslice and interval queries and delete expired<br />

entries with almost no overhead. In existing historical spatio-temporal indexing<br />

techniques, deleti<strong>on</strong> is either infeasible or very inefficient. Our sliding window based<br />

processing model can support both current and past entries, while many existing<br />

historical spatio-temporal indexing techniques cannot keep these two types of data<br />

together in the same index. Our experimental comparis<strong>on</strong> with the best known historical<br />

index (i.e., the MV3R tree) for discretely moving spatio-temporal data shows<br />

Page<br />

83


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

that our index is about five times faster in terms of inserti<strong>on</strong> time and comparable<br />

in terms of search performance. MV3R follows a partial persistency model, whereas<br />

our index can support very efficient deleti<strong>on</strong> and update.<br />

Querying Uncertain Spatio-Temporal <strong>Data</strong><br />

Tobias Emrich (Ludwig-Maximilians-Universität München)<br />

Hans-Peter Kriegel (Ludwig-Maximilians-Universität München)<br />

Nikos Mamoulis (University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Matthias renz (Ludwig-Maximilians-Universität München)<br />

Andreas Züfle (Ludwig-Maximilians-Universität München)<br />

The problem of modeling and managing uncertain data has received a great deal<br />

of interest, due to its manifold applicati<strong>on</strong>s in spatial, temporal, multimedia and<br />

sensor databases. There exists a wide range of work covering spatial uncertainty in<br />

the static (snapshot) case, where <strong>on</strong>ly <strong>on</strong>e point of time is c<strong>on</strong>sidered. In c<strong>on</strong>trast,<br />

the problem of modeling and querying uncertain spatio-temporal data has <strong>on</strong>ly<br />

been treated as a simple extensi<strong>on</strong> of the spatial case, disregarding time dependencies<br />

between c<strong>on</strong>secutive timestamps. In this work, we present a framework for<br />

efficiently modeling and querying uncertain spatio-temporal data. The key idea of<br />

our approach is to model possible object trajectories by stochastic processes. This<br />

approach has three major advantages over previous work. First it allows answering<br />

queries in accordance with the possible worlds model. Sec<strong>on</strong>d, dependencies<br />

between object locati<strong>on</strong>s at c<strong>on</strong>secutive points in time are taken into account. And<br />

third it is possible to reduce all queries <strong>on</strong> this model to simple matrix multiplicati<strong>on</strong>s.<br />

Based <strong>on</strong> these c<strong>on</strong>cepts we propose efficient soluti<strong>on</strong>s for different probabilistic<br />

spatio-temporal queries. In an experimental evaluati<strong>on</strong> we show that our approaches<br />

are several order of magnitudes faster than state-of-the-art competitors.<br />

The Min-dist Locati<strong>on</strong> Selecti<strong>on</strong> Query<br />

Jianzh<strong>on</strong>g Qi (University of Melbourne)<br />

rui Zhang (University of Melbourne)<br />

Lars Kulik (University of Melbourne)<br />

Dan Lin (Missouri University of Science and Technology)<br />

yuan Xue (University of Melbourne)<br />

We propose and study a new type of locati<strong>on</strong> optimizati<strong>on</strong> problem: given a set of<br />

clients and a set of existing facilities, we select a locati<strong>on</strong> from a given set of potential<br />

locati<strong>on</strong>s for establishing a new facility so that the average distance between a<br />

client and her nearest facility is minimized. We call this problem the min-dist locati<strong>on</strong><br />

selecti<strong>on</strong> problem, which has a wide range of applicati<strong>on</strong>s in urban development<br />

simulati<strong>on</strong>, massively multiplayer <strong>on</strong>line games, and decisi<strong>on</strong> support systems.<br />

We explore two comm<strong>on</strong> approaches to locati<strong>on</strong> optimizati<strong>on</strong> problems and propose<br />

methods based <strong>on</strong> those approaches for solving this new problem. However,<br />

those methods either need to maintain an extra index or fall short in efficiency. To<br />

address their drawbacks, we propose a novel method (named MND), which has<br />

very close performance to the fastest method but does not need an extra index.<br />

We provide a detailed comparative cost analysis <strong>on</strong> the various algorithms. We also<br />

perform extensive experiments to evaluate their empirical performance and validate<br />

the efficiency of the MND method.<br />

Page<br />

84


Abstracts<br />

Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computati<strong>on</strong><br />

Jia Pan (UNc chapel Hill)<br />

Dinesh Manocha (UNc chapel Hill)<br />

We present a new Bi-level LSH algorithm to perform approximate k-nearest neighbor<br />

search in high dimensi<strong>on</strong>al spaces. Our formulati<strong>on</strong> is based <strong>on</strong> a two-level<br />

scheme. In the first level, we use a RP-tree that divides the dataset into sub-groups<br />

with bounded aspect ratios and is used to distinguish well-separated clusters. During<br />

the sec<strong>on</strong>d level, we compute a single LSH hash table for each sub-group al<strong>on</strong>g<br />

with a hierarchical structure based <strong>on</strong> space-filling curves. Given a query, we first<br />

determine the sub-group that it bel<strong>on</strong>gs to and perform k-nearest neighbor search<br />

within the suitable buckets in the LSH hash table corresp<strong>on</strong>ding to the sub-group.<br />

Our algorithm also maps well to current GPU architectures and can improve the<br />

quality of approximate KNN queries as compared to prior LSH-based algorithms.<br />

We highlight its performance <strong>on</strong> two large, high-dimensi<strong>on</strong>al image datasets. Given<br />

a runtime budget, Bi-level LSH can provide better accuracy in terms of recall or<br />

error rati<strong>on</strong>. Moreover, our formulati<strong>on</strong> reduces the variati<strong>on</strong> in runtime cost or the<br />

quality of results.<br />

SeSSi<strong>on</strong> 9: QUEry ProcESSiNG<br />

Learning-based Query Performance Modeling and Predicti<strong>on</strong><br />

Mert Akdere (Brown University)<br />

Ugur cetintemel (Brown University)<br />

Matteo ri<strong>on</strong>dato (Brown University)<br />

Eli Upfal (Brown University)<br />

Stanley B. Zd<strong>on</strong>ik (Brown University)<br />

Accurate query performance predicti<strong>on</strong> (QPP) is central to effective resource management,<br />

query optimizati<strong>on</strong> and query scheduling. Analytical cost models, used in<br />

current generati<strong>on</strong> of query optimizers, have been successful in comparing the costs<br />

of alternative query plans, but they are poor predictors of executi<strong>on</strong> latency. As a<br />

more promising approach to QPP, this paper studies the practicality and utility of<br />

sophisticated learning-based models, which have recently been applied to a variety<br />

of predictive tasks with great success, in both static (i.e., fixed) and dynamic query<br />

workloads. We propose and evaluate predictive modeling techniques that learn query<br />

executi<strong>on</strong> behavior at different granularities, ranging from coarse-grained planlevel<br />

models to fine-grained operator-level models. We dem<strong>on</strong>strate that these two<br />

extremes offer a tradeoff between high accuracy for static workload queries and<br />

generality to unforeseen queries in dynamic workloads, respectively, and introduce a<br />

hybrid approach that combines their respective strengths by selectively composing<br />

them in the process of QPP. We discuss how we can use a training workload to (i)<br />

pre-build and materialize such models offline, so that they are readily available for<br />

future predicti<strong>on</strong>s, and (ii) build new models <strong>on</strong>line as new predicti<strong>on</strong>s are needed.<br />

All predicti<strong>on</strong> models are built using <strong>on</strong>ly static features (available prior to query<br />

executi<strong>on</strong>) and the performance values obtained from the offline executi<strong>on</strong> of the<br />

training workload. We fully implemented all these techniques and extensi<strong>on</strong>s <strong>on</strong> top<br />

of PostgreSQL and evaluated them experimentally by quantifying their effectiveness<br />

over analytical workloads, represented by well-established TPC-H data and queries.<br />

Page<br />

85


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

The results provide quantitative evidence that learning-based modeling for QPP is<br />

both feasible and effective for both static and dynamic workload scenarios.<br />

Parametric Plan Caching Using Density-Based Clustering<br />

Gunes Aluc (University of Waterloo)<br />

David E. DeHaan (Sybase, an SAP company)<br />

ivan T. Bowman (Sybase, an SAP company)<br />

Query plan caching eliminates the need for repeated query optimizati<strong>on</strong>; hence, it<br />

has str<strong>on</strong>g practical implicati<strong>on</strong>s for relati<strong>on</strong>al database management systems (RD-<br />

BMSs). Unfortunately, existing approaches c<strong>on</strong>sider <strong>on</strong>ly the query plan generated at<br />

the expected values of parameters that characterize the query, data and the current<br />

state of the system, while these parameters may take different values during the lifetime<br />

of a cached plan. A better alternative is to harvest the optimizer’s plan choice<br />

for different parameter values, populate the cache with promising query plans, and<br />

select a cached plan based up<strong>on</strong> current parameter values. To address this challenge,<br />

we propose a parametric plan caching (PPC) framework that uses an <strong>on</strong>line plan<br />

space clustering algorithm. The clustering algorithm is density-based, and it exploits<br />

locality-sensitive hashing as a pre-processing step so that clusters in the plan spaces<br />

can be efficiently stored in database histograms and queried in c<strong>on</strong>stant time. We<br />

experimentally validate that our approach is precise, efficient in space-and-time and<br />

adaptive, requiring no eager explorati<strong>on</strong> of the plan spaces of the optimizer.<br />

Effective and Robust Pruning for Top-Down Join<br />

Enumerati<strong>on</strong> Algorithms<br />

Pit Fender (Mannheim University)<br />

Guido Moerkotte (Mannheim University)<br />

Thomas Neumann (Technical University of Munich)<br />

viktor Leis (Technical University of Munich)<br />

Finding the optimal executi<strong>on</strong> order of join operati<strong>on</strong>s is a crucial task of today’s<br />

cost-based query optimizers. There are two approaches to identify the best plan:<br />

bottom-up and top-down join enumerati<strong>on</strong>. For both optimizati<strong>on</strong> strategies efficient<br />

algorithms have been published. However, <strong>on</strong>ly the top-down approach allows<br />

for branch-and-bound pruning. Two pruning techniques can be found in the literature.<br />

We add six new <strong>on</strong>es. Combined, they improve performance roughly by an<br />

average factor of 2-5. Even more important, our techniques improve the worst case<br />

by two orders of magnitude. Additi<strong>on</strong>ally, we introduce a new, very efficient, and<br />

easy to implement top-down join enumerati<strong>on</strong> algorithm. This algorithm, together<br />

with our improved pruning techniques, yields a performance which is by an average<br />

factor of 6-9 higher than the performance of the original top-down enumerati<strong>on</strong><br />

algorithm with the original pruning methods.<br />

Towards Preference-aware Relati<strong>on</strong>al <strong>Data</strong>bases<br />

Anastasios Arvanitis (Nati<strong>on</strong>al Technical University of Athens)<br />

Georgia Koutrika (iBM Almaden research center)<br />

In implementing preference-aware query processing, a straightforward opti<strong>on</strong> is<br />

Page<br />

86


Abstracts<br />

to build a plug-in <strong>on</strong> top of the database engine. However, treating the DBMS as<br />

a black box affects both the expressivity and performance of queries with preferences.<br />

In this paper, we argue that preference-aware query processing needs to be<br />

pushed closer to the DBMS. We present a preference-aware relati<strong>on</strong>al data model<br />

that extends database tuples with preferences and an extended algebra that captures<br />

the essence of processing queries with preferences. A key novelty of our preference<br />

model itself is that it defines a preference in three dimensi<strong>on</strong>s showing the<br />

tuples affected, their preference scores and the credibility of the preference. Our<br />

query processing strategies push preference evaluati<strong>on</strong> inside the query plan and<br />

leverage its algebraic properties for finer-grained query optimizati<strong>on</strong>. We experimentally<br />

evaluate the proposed strategies. Finally, we compare our framework to a<br />

pure plug-in implementati<strong>on</strong> and we show its feasibility and advantages.<br />

SeSSi<strong>on</strong> 10: LocATioN AWArE DATA ProcESSiNG<br />

A Foundati<strong>on</strong> for Efficient Indoor Distance-Aware Query Processing<br />

Hua Lu (Aalborg University)<br />

Xin cao (Nanyang Technological University)<br />

christian S. Jensen (Aarhus University)<br />

Indoor spaces accommodate large numbers of spatial objects, e.g., points of interest<br />

(POIs), and moving populati<strong>on</strong>s. A variety of services, e.g., locati<strong>on</strong>-based<br />

services and security c<strong>on</strong>trol, are relevant to indoor spaces. Such services can be<br />

improved substantially if they are capable of utilizing indoor distances. However, existing<br />

indoor space models do not account well for indoor distances. To address this<br />

shortcoming, we propose a data management infrastructure that captures indoor<br />

distance and facilitates distance-aware query processing. In particular, we propose<br />

a distance-aware indoor space model that integrates indoor distance seamlessly. To<br />

enable the use of the model as a foundati<strong>on</strong> for query processing, we develop accompanying,<br />

efficient algorithms that compute indoor distances for different indoor<br />

entities like doors as well as locati<strong>on</strong>s. We also propose an indexing framework<br />

that accommodates indoor distances that are pre-computed using the proposed<br />

algorithms. On top of this foundati<strong>on</strong>, we develop efficient algorithms for typical<br />

indoor, distance-aware queries. The results of an extensive experimental evaluati<strong>on</strong><br />

dem<strong>on</strong>strate the efficacy of the proposals.<br />

LARS: A Locati<strong>on</strong>-Aware Recommender System<br />

Justin J. Levandoski (Microsoft research)<br />

Mohamed Sarwat (University of Minnesota)<br />

Ahmed Eldawy (University of Minnesota)<br />

Mohamed F. Mokbel (University of Minnesota)<br />

This paper proposes LARS, a locati<strong>on</strong>-aware recommender system that uses locati<strong>on</strong>-based<br />

ratings to produce recommendati<strong>on</strong>s. Traditi<strong>on</strong>al recommender systems<br />

do not c<strong>on</strong>sider spatial properties of users nor items; LARS, <strong>on</strong> the other hand, supports<br />

a tax<strong>on</strong>omy of three novel classes of locati<strong>on</strong>-based ratings, namely, spatial<br />

ratings for n<strong>on</strong>-spatial items, n<strong>on</strong>-spatial ratings for spatial items, and spatial ratings<br />

for spatial items. LARS exploits user rating locati<strong>on</strong>s through user partiti<strong>on</strong>ing, a<br />

technique that influences recommendati<strong>on</strong>s with ratings spatially close to querying<br />

Page<br />

87


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

users in a manner that maximizes system scalability while not sacrificing recommendati<strong>on</strong><br />

quality. LARS exploits item locati<strong>on</strong>s using travel penalty, a technique that favors<br />

recommendati<strong>on</strong> candidates closer in travel distance to querying users in a way<br />

that avoids exhaustive access to all spatial items. LARS can apply these techniques<br />

separately, or in c<strong>on</strong>cert, depending <strong>on</strong> the type of locati<strong>on</strong>-based rating available.<br />

Experimental evidence using large-scale real-world data from both the Foursquare<br />

locati<strong>on</strong>-based social network and the MovieLens movie recommendati<strong>on</strong> system<br />

reveals that LARS is efficient, scalable, and capable of producing recommendati<strong>on</strong>s<br />

twice as accurate compared to existing recommendati<strong>on</strong> approaches.<br />

Approximate Shortest Distance Computing: A Query-Dependent Local<br />

Landmark Scheme<br />

Miao Qiao (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

H<strong>on</strong>g cheng (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Lijun chang (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Jeffrey Xu yu (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Shortest distance query between two nodes is a fundamental operati<strong>on</strong> in largescale<br />

networks. Most existing methods in the literature take a landmark embedding<br />

approach, which selects a set of graph nodes as landmarks and computes the<br />

shortest distances from each landmark to all nodes as an embedding. To handle a<br />

shortest distance query between two nodes, the precomputed distances from the<br />

landmarks to the query nodes are used to compute an approximate shortest distance<br />

based <strong>on</strong> the triangle inequality. In this paper, we analyze the factors that affect<br />

the accuracy of the distance estimati<strong>on</strong> in the landmark embedding approach.<br />

In particular we find that a globally selected, query-independent landmark set plus<br />

the triangulati<strong>on</strong> based distance estimati<strong>on</strong> introduces a large relative error, especially<br />

for nearby query nodes. To address this issue, we propose a query-dependent<br />

local landmark scheme, which identifies a local landmark close to the specific query<br />

nodes and provides a more accurate distance estimati<strong>on</strong> than the traditi<strong>on</strong>al global<br />

landmark approach. Specifically, a local landmark is defined as the least comm<strong>on</strong><br />

ancestor of the two query nodes in the shortest path tree rooted at a global landmark.<br />

We propose efficient local landmark indexing and retrieval techniques, which<br />

are crucial to achieve low offline indexing complexity and <strong>on</strong>line query complexity.<br />

Two optimizati<strong>on</strong> techniques <strong>on</strong> graph compressi<strong>on</strong> and graph <strong>on</strong>line search are<br />

also proposed, with the goal to further reduce index size and improve query accuracy.<br />

Our experimental results <strong>on</strong> large-scale social networks and road networks<br />

dem<strong>on</strong>strate that the local landmark scheme reduces the shortest distance estimati<strong>on</strong><br />

error significantly when compared with global landmark embedding.<br />

Desks: Directi<strong>on</strong>-Aware Spatial Keyword Search<br />

Guoliang Li (Tsinghua University)<br />

Jianhua Feng (Tsinghua University)<br />

Jing Xu (Tsinghua University)<br />

Locati<strong>on</strong>-based services (LBS) have been widely accepted by mobile users. Many<br />

LBS users have directi<strong>on</strong>-aware search requirement that answers must be in<br />

the search directi<strong>on</strong>. However to the best of our knowledge there is not yet any<br />

research available that investigates directi<strong>on</strong>-aware search. A straightforward<br />

Page<br />

88


Abstracts<br />

method first finds candidates without c<strong>on</strong>sidering the directi<strong>on</strong> c<strong>on</strong>straint, and then<br />

generates the answers by pruning those candidates which invalidate the directi<strong>on</strong><br />

c<strong>on</strong>straint. However this method is rather expensive as it involves a lot of useless<br />

computati<strong>on</strong> <strong>on</strong> many unnecessary directi<strong>on</strong>s. To address this problem, we propose<br />

a directi<strong>on</strong>-aware spatial keyword search method which inherently supports<br />

directi<strong>on</strong>-aware search. We devise novel directi<strong>on</strong>-aware indexing structures to<br />

prune unnecessary directi<strong>on</strong>s. We develop effective pruning techniques and search<br />

algorithms to efficiently answer a directi<strong>on</strong>-aware query. As users may dynamically<br />

change their search directi<strong>on</strong>s, we propose to incrementally answer a query. Experimental<br />

results <strong>on</strong> real datasets show that our method achieves high performance<br />

and outperforms existing methods significantly.<br />

SeSSi<strong>on</strong> 11: MAP-rEDUcE BASED DATA ProcESSiNG<br />

Extending Map-Reduce for Efficient Predicate-Based Sampling<br />

raman Grover (University of california, irvine)<br />

Michael carey (University of california, irvine)<br />

In this paper we address the problem of using MapReduce to sample a massive<br />

data set in order to produce a fixed-size sample whose c<strong>on</strong>tents satisfy a given<br />

predicate. While it is simple to express this computati<strong>on</strong> using MapReduce, its<br />

default Hadoop executi<strong>on</strong> is dependent <strong>on</strong> the input size and is wasteful of cluster<br />

resources. This is unfortunate, as sampling queries are fairly comm<strong>on</strong> (e.g., for<br />

exploratory data analysis at Facebook), and the resulting waste can significantly<br />

impact the performance of a shared cluster. To address such use cases, we present<br />

the design, implementati<strong>on</strong> and evaluati<strong>on</strong> of a Hadoop executi<strong>on</strong> model extensi<strong>on</strong><br />

that supports incremental job expansi<strong>on</strong>. Under this model, a job c<strong>on</strong>sumes input<br />

as required and can dynamically govern its resource c<strong>on</strong>sumpti<strong>on</strong> while producing<br />

the required results. The proposed mechanism is able to support a variety of policies<br />

regarding job growth rates as they relate to cluster capacity and current load.<br />

We have implemented the mechanism in Hadoop, and we present results from an<br />

experimental performance study of different job growth policies under both single-<br />

and multi-user workloads.<br />

Fuzzy Joins Using MapReduce<br />

Foto Afrati (Nati<strong>on</strong>al Technical University Athens)<br />

Anish Das Sarma (Google, inc. - work initiated at yahoo! research)<br />

David Menestrina (Google, inc.)<br />

Aditya Parameswaran (Stanford University)<br />

Jeffrey D. Ullman (Stanford University)<br />

Fuzzy/similarity joins have been widely studied in the research community and extensively<br />

used in real-world applicati<strong>on</strong>s. This paper proposes and evaluates several<br />

algorithms for finding all pairs of elements from an input set that meet a similarity<br />

threshold. The computati<strong>on</strong> model is a single MapReduce job. Because we allow <strong>on</strong>ly<br />

<strong>on</strong>e MapReduce round, the Reduce functi<strong>on</strong> must be designed so a given output pair<br />

is produced by <strong>on</strong>ly <strong>on</strong>e task; for many algorithms, satisfying this c<strong>on</strong>diti<strong>on</strong> is <strong>on</strong>e of<br />

the biggest challenges. We break the cost of an algorithm into three comp<strong>on</strong>ents: the<br />

executi<strong>on</strong> cost of the mappers, the executi<strong>on</strong> cost of the reducers, and the communi-<br />

Page<br />

89


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

cati<strong>on</strong> cost from the mappers to reducers. The algorithms are presented first in terms<br />

of Hamming distance, but extensi<strong>on</strong>s to edit distance and Jaccard distance are shown<br />

as well. We find that there are many different approaches to the similarity-join problem<br />

using MapReduce, and n<strong>on</strong>e dominates the others when both communicati<strong>on</strong><br />

and reducer costs are c<strong>on</strong>sidered. Our cost analyses enable applicati<strong>on</strong>s to pick the<br />

optimal algorithm based <strong>on</strong> their communicati<strong>on</strong>, memory, and cluster requirements.<br />

Parallel Top-K Similarity Join Algorithms Using MapReduce<br />

youngho<strong>on</strong> Kim (Seoul Nati<strong>on</strong>al University)<br />

Kyuseok Shim (Seoul Nati<strong>on</strong>al University)<br />

There is a wide range of applicati<strong>on</strong>s that require finding the top-k most similar<br />

pairs of records in a given database. However, computing such top-k similarity joins<br />

is a challenging problem today, as there is an increasing trend of applicati<strong>on</strong>s that<br />

expect to deal with vast amounts of data. For such data-intensive applicati<strong>on</strong>s,<br />

parallel executi<strong>on</strong>s of programs <strong>on</strong> a large cluster of commodity machines using<br />

the MapReduce paradigm have recently received a lot of attenti<strong>on</strong>. In this paper, we<br />

investigate how the top-k similarity join algorithms can get benefits from the popular<br />

MapReduce framework. We first develop the divide-and-c<strong>on</strong>quer and branchand-bound<br />

algorithms. We next propose the all pair partiti<strong>on</strong>ing and essential pair<br />

partiti<strong>on</strong>ing methods to minimize the amount of data transfers between map and<br />

reduce functi<strong>on</strong>s. We finally perform the experiments with not <strong>on</strong>ly synthetic but<br />

also real-life data sets. Our performance study c<strong>on</strong>firms the effectiveness and scalability<br />

of our MapReduce algorithms.<br />

Load Balancing in MapReduce Based <strong>on</strong> Scalable Cardinality Estimates<br />

Benjamin Gufler (Technische Universität München)<br />

Nikolaus Augsten (Free University of Bolzano-Bozen)<br />

Angelika reiser (Technische Universität München)<br />

Alf<strong>on</strong>s Kemper (Technische Universität München)<br />

MapReduce has emerged as a popular tool for distributed and scalable processing<br />

of massive data sets and is being used increasingly in e-science applicati<strong>on</strong>s. Unfortunately,<br />

the performance of MapReduce systems str<strong>on</strong>gly depends <strong>on</strong> an even<br />

data distributi<strong>on</strong> while scientific data sets are often highly skewed. The resulting<br />

load imbalance, which raises the processing time, is even amplified by high runtime<br />

complexity of the reducer tasks. An adaptive load balancing strategy is required for<br />

appropriate skew handling. In this paper, we address the problem of estimating the<br />

cost of the tasks that are distributed to the reducers based <strong>on</strong> a given cost model.<br />

An accurate cost estimati<strong>on</strong> is the basis for adaptive load balancing algorithms and<br />

requires to gather statistics from the mappers. This is challenging: (a) Since the<br />

statistics from all mappers must be integrated, the mapper statistics must be small.<br />

(b) Although each mapper sees <strong>on</strong>ly a small fracti<strong>on</strong> of the data, the integrated<br />

statistics must capture the global data distributi<strong>on</strong>. (c) The mappers terminate after<br />

sending the statistics to the c<strong>on</strong>troller, and no sec<strong>on</strong>d round is possible. Our soluti<strong>on</strong><br />

to these challenges c<strong>on</strong>sists of two comp<strong>on</strong>ents. First, a m<strong>on</strong>itoring comp<strong>on</strong>ent<br />

executed <strong>on</strong> every mapper captures the local data distributi<strong>on</strong> and identifies<br />

its most relevant subset for cost estimati<strong>on</strong>. Sec<strong>on</strong>d, an integrati<strong>on</strong> comp<strong>on</strong>ent<br />

aggregates these subsets approximating the global data distributi<strong>on</strong>.<br />

Page<br />

90


SeSSi<strong>on</strong> 12: SociAL MEDiA<br />

Community Detecti<strong>on</strong> with Edge C<strong>on</strong>tent in Social Media Networks<br />

Guo-Jun Qi (University of illinois at Urbana-champaign)<br />

charu c. Aggarwal (iBM T. J. Wats<strong>on</strong> research center)<br />

Thomas S. Huang (University of illinois at Urbana-champaign)<br />

Abstracts<br />

The problem of community detecti<strong>on</strong> in social media has been widely studied in<br />

the social networking community in the c<strong>on</strong>text of the structure of the underlying<br />

graphs. Most community detecti<strong>on</strong> algorithms use the links between the nodes in<br />

order to determine the dense regi<strong>on</strong>s in the graph. These dense regi<strong>on</strong>s are the<br />

communities of social media in the graph. Such methods are typically based purely<br />

<strong>on</strong> the linkage structure of the underlying social media network. However, in many<br />

recent applicati<strong>on</strong>s, edge c<strong>on</strong>tent is available in order to provide better supervisi<strong>on</strong><br />

to the community detecti<strong>on</strong> process. Many natural representati<strong>on</strong>s of edges in social<br />

interacti<strong>on</strong>s such as shared images and videos, user tags and comments are naturally<br />

associated with c<strong>on</strong>tent <strong>on</strong> the edges. While some work has been d<strong>on</strong>e <strong>on</strong> utilizing<br />

node c<strong>on</strong>tent for community detecti<strong>on</strong>, the presence of edge c<strong>on</strong>tent presents<br />

unprecedented opportunities and flexibility for the community detecti<strong>on</strong> process.<br />

We will show that such edge c<strong>on</strong>tent can be leveraged in order to greatly improve<br />

the effectiveness of the community detecti<strong>on</strong> process in social media networks. We<br />

present experimental results illustrating the effectiveness of our approach.<br />

Cross Domain Search by Exploiting Wikipedia<br />

chen Liu (Nati<strong>on</strong>al University of Singapore)<br />

Sai Wu (Nati<strong>on</strong>al University of Singapore)<br />

Shouxu Jiang (Harbin institute of Technology)<br />

Anth<strong>on</strong>y K.H. Tung (Nati<strong>on</strong>al University of Singapore)<br />

The abundance of Web 2.0 resources in various media formats calls for better<br />

resource integrati<strong>on</strong> to enrich user experience. This naturally leads to a new crossmodal<br />

resource search requirement, in which a query is a resource in <strong>on</strong>e modal<br />

and the results are closely related resources in other modalities. With cross-modal<br />

search, we can better exploit existing resources. Tags associated with Web 2.0<br />

resources are intuitive medium to link resources with different modality together.<br />

However, tagging is by nature an ad hoc activity. They often c<strong>on</strong>tain noises and are<br />

affected by the subjective inclinati<strong>on</strong> of the tagger. C<strong>on</strong>sequently, linking resources<br />

simply by tags will not be reliable. In this paper, we propose an approach for linking<br />

tagged resources to c<strong>on</strong>cepts extracted from Wikipedia, which has become a fairly<br />

reliable reference over the last few years. Compared to the tags, the c<strong>on</strong>cepts are<br />

therefore of higher quality. We develop effective methods for cross-modal search<br />

based <strong>on</strong> the c<strong>on</strong>cepts associated with resources. Extensive experiments were c<strong>on</strong>ducted,<br />

and the results show that our soluti<strong>on</strong> achieves good performance.<br />

Page<br />

91


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Provenance-based Indexing Support in Micro-blog Platforms<br />

Junjie yao (Peking University)<br />

Bin cui (Peking University)<br />

Zijun Xue (Peking University)<br />

Qingyun Liu (Peking University)<br />

Recently, lots of micro-blog message sharing applicati<strong>on</strong>s have emerged <strong>on</strong> the<br />

web. Users can publish short messages freely and get notified by the subscripti<strong>on</strong>s<br />

instantly. Prominent examples include Twitter, Facebook’s statuses, and Sina Weibo<br />

in China. The Micro-blog platform becomes a useful service for real time informati<strong>on</strong><br />

creati<strong>on</strong> and propagati<strong>on</strong>. However, these messages’ short length and dynamic<br />

characters have posed great challenges for effective c<strong>on</strong>tent understanding. Additi<strong>on</strong>ally,<br />

the noise and fragments make it difficult to discover the temporal propagati<strong>on</strong><br />

trail to explore development of micro-blog messages. In this paper, we propose<br />

a provenance model to capture c<strong>on</strong>necti<strong>on</strong>s between micro-blog messages. Provenance<br />

refers to data origin identificati<strong>on</strong> and transformati<strong>on</strong> logging, dem<strong>on</strong>strating<br />

of great value in recent database and workflow systems. To cope with the real time<br />

micro-message deluge, we utilize a novel message grouping approach to encode<br />

and maintain the provenance informati<strong>on</strong>. Furthermore, we adopt a summary index<br />

and several adaptive pruning strategies to implement efficient provenance updating.<br />

Based <strong>on</strong> the index, our provenance soluti<strong>on</strong> can support rich query retrieval<br />

and intuitive message tracking for effective message organizati<strong>on</strong>. Experiments<br />

c<strong>on</strong>ducted <strong>on</strong> a real dataset verify the effectiveness and efficiency of our approach.<br />

Provenance refers to data origin identificati<strong>on</strong> and transformati<strong>on</strong> m<strong>on</strong>itoring, which<br />

has been dem<strong>on</strong>strated of great value in database and workflow systems. In this<br />

paper, we propose a provenance model in micro-blog platforms, and design an indexing<br />

scheme to support provenance-based message discovery and maintenance,<br />

which can capture the interacti<strong>on</strong>s of messages for effective message organizati<strong>on</strong>.<br />

To cope with the real time micro-message tornadoes, we introduce a novel virtual<br />

annotati<strong>on</strong> grouping approach to encode and maintain the provenance informati<strong>on</strong>.<br />

Furthermore, we design a summary index and adaptive pruning strategies to facilitate<br />

efficient message update. Based <strong>on</strong> this provenance index, our approach can<br />

support query and message tracking in micro-blog systems. Experiments c<strong>on</strong>ducted<br />

<strong>on</strong> real datasets verify the effectiveness and efficiency of our approach.<br />

Learning Stochastic Models of Informati<strong>on</strong> Flow<br />

Luke Dickens (imperial college L<strong>on</strong>d<strong>on</strong>)<br />

ian Molloy (iBM T. J. Wats<strong>on</strong> research center)<br />

Jorge Lobo (iBM T. J. Wats<strong>on</strong> research center)<br />

Pau-chen cheng (iBM T. J. Wats<strong>on</strong> research center)<br />

Alessandra russo (imperial college L<strong>on</strong>d<strong>on</strong>)<br />

An understanding of informati<strong>on</strong> flow has many applicati<strong>on</strong>s, including for maximizing<br />

marketing impact <strong>on</strong> social media, limiting malware propagati<strong>on</strong>, and managing<br />

undesired disclosure of sensitive informati<strong>on</strong>. This paper presents scalable methods<br />

for both learning models of informati<strong>on</strong> flow in networks from data, based<br />

<strong>on</strong> the Independent Cascade Model; and predicting probabilities of unseen flow<br />

from these models. Our approach is based <strong>on</strong> a principled probabilistic c<strong>on</strong>structi<strong>on</strong><br />

and results compare favourably with existing methods in terms of accuracy of<br />

Page<br />

92


Abstracts<br />

predicti<strong>on</strong> and scalable evaluati<strong>on</strong>, with the additi<strong>on</strong> that we are able to evaluate a<br />

broader range of queries than previously shown, including probability of joint and/<br />

or c<strong>on</strong>diti<strong>on</strong>al flow, as well as reflecting model uncertainty. Exact evaluati<strong>on</strong> of flow<br />

probabilities is exp<strong>on</strong>ential in the number of edges and naive sampling can also<br />

be expensive, so we propose sampling in an efficient Markov-Chain M<strong>on</strong>te-Carlo<br />

fashi<strong>on</strong> using the Metropolis-Hastings algorithm — details described in the paper.<br />

We identify two types of data, those where the paths of past flows are known — attributed<br />

data, and those where <strong>on</strong>ly the endpoints are known — unattributed data.<br />

Both data types are addressed in this paper, including training methods, example<br />

real world data sets, and experimental evaluati<strong>on</strong>. In particular, we investigate<br />

flow data from the Twitter micro-blogging service, exploring the flow of messages<br />

through retweets (tweet forwards) for the attributed case, and the propagati<strong>on</strong> of<br />

hashtags (metadata tags) and urls for the unattributed case.<br />

SeSSi<strong>on</strong> 13: P2P AND DiSTriBUTED ProcESSiNG<br />

BestPeer++: A Peer-to-Peer based Large-scale <strong>Data</strong> Processing<br />

Gang chen (NetEase.com inc. & Zhejiang University)<br />

Tianlei Hu (NetEase.com inc. & Zhejiang University)<br />

Dawei Jiang (Nati<strong>on</strong>al University of Singapore)<br />

Peng Lu (Nati<strong>on</strong>al University of Singapore)<br />

Kian-Lee Tan (Nati<strong>on</strong>al University of Singapore)<br />

Hoang Tam vo (Nati<strong>on</strong>al University of Singapore)<br />

Sai Wu (BestPeer Pte. Ltd. & Nati<strong>on</strong>al University of Singapore)<br />

The corporate network is often used for sharing informati<strong>on</strong> am<strong>on</strong>g the participating<br />

companies and facilitating collaborati<strong>on</strong> in a certain industry sector where companies<br />

share a comm<strong>on</strong> interest. It can effectively help the companies to reduce<br />

their operati<strong>on</strong>al costs and increase the revenues. However, the inter-company data<br />

sharing and processing poses unique challenges to such a data management system<br />

including scalability, performance, throughput, and security. In this paper, we<br />

present BestPeer++, a system which delivers elastic data sharing services for corporate<br />

network applicati<strong>on</strong>s in the cloud based <strong>on</strong> BestPeer — a peer-to-peer (P2P)<br />

based data management platform. By integrating cloud computing, database, and<br />

P2P technologies into <strong>on</strong>e system, BestPeer++ provides an ec<strong>on</strong>omical, flexible and<br />

scalable platform for corporate network applicati<strong>on</strong>s and delivers data sharing services<br />

to participants based <strong>on</strong> the widely accepted pay-as-you-go business model.<br />

We evaluate BestPeer++ <strong>on</strong> Amaz<strong>on</strong> EC2 Cloud platform. The benchmarking results<br />

show that BestPeer++ outperforms HadoopDB, a recently proposed large-scale<br />

data processing system, in performance when both systems are employed to handle<br />

typical corporate network workloads. The benchmarking results also dem<strong>on</strong>strate<br />

that BestPeer++ achieves near linear scalability for throughput with respect to the<br />

number of peer nodes.<br />

Page<br />

93


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Effective <strong>Data</strong> Density Estimati<strong>on</strong> in Ring-based P2P Networks<br />

Minqi Zhou (East china Normal University)<br />

Heng Tao Shen (The University of Queensland)<br />

Xiaofang Zhou (The University of Queensland)<br />

Weining Qian (East china Normal University)<br />

Aoying Zhou (East china Normal University)<br />

Estimating the global data distributi<strong>on</strong> in Peer-to-Peer (P2P) networks is an important<br />

issue and has yet to be well addressed. It can benefit many P2P applicati<strong>on</strong>s,<br />

such as load balancing analysis, query processing, and data mining. Inspired by the<br />

inversi<strong>on</strong> method for random variate generati<strong>on</strong>, in this paper we present a novel<br />

model named distributi<strong>on</strong>-free data density estimati<strong>on</strong> for dynamic ring-based P2P<br />

networks to achieve high estimati<strong>on</strong> accuracy with low estimati<strong>on</strong> cost regardless<br />

of distributi<strong>on</strong> models of the underlying data. It generates random samples for any<br />

arbitrary distributi<strong>on</strong> by sampling the global cumulative distributi<strong>on</strong> functi<strong>on</strong> and is<br />

free from sampling bias. In P2P networks, the key idea for distributi<strong>on</strong>-free estimati<strong>on</strong><br />

is to sample a small subset of peers for estimating the global data distributi<strong>on</strong><br />

over the data domain. Algorithms <strong>on</strong> computing and sampling the global cumulative<br />

distributi<strong>on</strong> functi<strong>on</strong> based <strong>on</strong> which global data distributi<strong>on</strong> is estimated are<br />

introduced with detailed theoretical analysis. Our extensive performance study c<strong>on</strong>firms<br />

the effectiveness and efficiency of our methods in ring-based P2P networks.<br />

Processing of Rank Joins in Highly Distributed Systems<br />

christos Doulkeridis (Norwegian University of Science and Technology (NTNU))<br />

Akrivi vlachou (Norwegian University of Science and Technology (NTNU))<br />

Kjetil Nørvåg (Norwegian University of Science and Technology (NTNU))<br />

yannis Kotidis (Athens University of Ec<strong>on</strong>omics and Business (AUEB))<br />

Neoklis Polyzotis (Uc Santa cruz (UcSc))<br />

In this paper, we study efficient processing of rank joins in highly distributed<br />

systems, where servers store fragments of relati<strong>on</strong>s in an aut<strong>on</strong>omous manner.<br />

Existing rank-join algorithms exhibit poor performance in this setting due to excessive<br />

communicati<strong>on</strong> costs or high latency. We propose a novel distributed rank-join<br />

framework that employs data statistics, maintained as histograms, to determine the<br />

subset of each relati<strong>on</strong>al fragment that needs to be fetched to generate the top-k<br />

join results. At the heart of our framework lies a distributed score bound estimati<strong>on</strong><br />

algorithm that produces sufficient score bounds for each relati<strong>on</strong>, that guarantee<br />

the correctness of the rank-join result set, when the histograms are accurate. Furthermore,<br />

we propose a generalizati<strong>on</strong> of our framework that supports approximate<br />

statistics, in the case that the exact statistical informati<strong>on</strong> is not available. An extensive<br />

experimental study validates the efficiency of our framework and dem<strong>on</strong>strates<br />

its advantages over existing methods.<br />

Page<br />

94


Load Balancing for MapReduce-based Entity Resoluti<strong>on</strong><br />

Lars Kolb (University of Leipzig)<br />

Andreas Thor (University of Leipzig)<br />

Erhard rahm (University of Leipzig)<br />

Abstracts<br />

The effectiveness and scalability of MapReduce-based implementati<strong>on</strong>s of complex data-intensive<br />

tasks depend <strong>on</strong> an even redistributi<strong>on</strong> of data between map and reduce<br />

tasks. In the presence of skewed data, sophisticated redistributi<strong>on</strong> approaches thus<br />

become necessary to achieve load balancing am<strong>on</strong>g all reduce tasks to be executed<br />

in parallel. For the complex problem of entity resoluti<strong>on</strong>, we propose and evaluate<br />

two approaches for such skew handling and load balancing. The approaches support<br />

blocking techniques to reduce the search space of entity resoluti<strong>on</strong>, utilize a preprocessing<br />

MapReduce job to analyze the data distributi<strong>on</strong>, and distribute the entities of<br />

large blocks am<strong>on</strong>g multiple reduce tasks. The evaluati<strong>on</strong> <strong>on</strong> a real cloud infrastructure<br />

shows the value and effectiveness of the proposed load balancing approaches.<br />

SeSSi<strong>on</strong> 14: XML AND rDF DATA MANAGEMENT<br />

Mapping XML to a Wide Sparse Table<br />

Liang Jeff chen (UcSD)<br />

Philip A. Bernstein (Microsoft corp.)<br />

Peter carlin (Microsoft corp.)<br />

Dimitrije Filipovic (Microsoft corp.)<br />

Michael rys (Microsoft corp.)<br />

Nikita Shamgunov (Facebook inc.)<br />

James F. Terwilliger (Microsoft corp.)<br />

Milos Todic (Microsoft corp.)<br />

Sasa Tomasevic (Microsoft corp.)<br />

Dragan Tomic (Microsoft corp.)<br />

XML is comm<strong>on</strong>ly supported by SQL database systems. However, existing mappings<br />

of XML to tables can <strong>on</strong>ly deliver satisfactory query performance for limited use<br />

cases. In this paper, we propose a novel mapping of XML data into <strong>on</strong>e wide table<br />

whose columns are sparsely populated. This mapping provides good performance<br />

for document types and queries that are observed in enterprise applicati<strong>on</strong>s but are<br />

not supported efficiently by existing work. XML queries are evaluated by translating<br />

them into SQL queries over the wide sparsely-populated table. We show how to<br />

translate full XPath 1.0 into SQL. Based <strong>on</strong> the characteristics of the new mapping,<br />

we present rewriting optimizati<strong>on</strong>s that minimize the number of joins. Experiments<br />

dem<strong>on</strong>strate that query evaluati<strong>on</strong> over the new mapping delivers c<strong>on</strong>siderable<br />

improvements over existing techniques for the target use cases.<br />

Querying XML <strong>Data</strong>: As You Shape It<br />

curtis E. Dyres<strong>on</strong> (Utah State University)<br />

Sourav S. Bhowmick (Nanyang Technological University)<br />

A limitati<strong>on</strong> of XQuery is that a programmer has to be familiar with the shape of the<br />

data to query it effectively. And if that shape changes, or if the shape is other than<br />

Page<br />

95


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

what the programmer expects, the query may fail. One way to avoid this limitati<strong>on</strong><br />

is to transform the data into a desired shape. A data transformati<strong>on</strong> is a rearrangement<br />

of data into a new shape. In this paper, we present the semantics and implementati<strong>on</strong><br />

of XMorph 2.0, a shape-polymorphic data transformati<strong>on</strong> language for<br />

XML. An XMorph program can act as a query guard. The guard both transforms<br />

data to the shape needed by the query and determines whether and how the transformati<strong>on</strong><br />

potentially loses informati<strong>on</strong>; a transformati<strong>on</strong> that loses informati<strong>on</strong><br />

may lead to a query yielding an inaccurate result. This paper describes how to use<br />

XMorph as a query guard, gives a formal semantics for shape-to-shape transformati<strong>on</strong>s,<br />

documents how XMorph determines how a transformati<strong>on</strong> potentially loses<br />

informati<strong>on</strong>, and describes the XMorph implementati<strong>on</strong>.<br />

Branch Code: A Labeling Scheme for Efficient Query Answering <strong>on</strong> Trees<br />

yanghua Xiao (Fudan University)<br />

Ji H<strong>on</strong>g (Fudan University)<br />

Wanyun cui (Fudan University)<br />

Zhenying He (Fudan University)<br />

Wei Wang (Fudan University)<br />

Guod<strong>on</strong>g Feng (Fudan University)<br />

Labeling schemes lie at the core of query processing for many tree-structured data<br />

such as XML data that is flooding the web. A labeling scheme that can simultaneously<br />

and efficiently support various relati<strong>on</strong>ship queries <strong>on</strong> trees (such as parent/<br />

children, descendant/ancestor, etc.), computati<strong>on</strong> of lowest comm<strong>on</strong> ancestors<br />

(LCA) and update of trees, is desired for effective and efficient management of<br />

tree-structured data. Although a variety of labeling schemes such as prefix-based<br />

labeling, interval-based labeling and prime-based labeling as well as their variants<br />

have been available to us for encoding static and dynamic trees, these labeling<br />

schemes usually show weakness in <strong>on</strong>e aspect or another. In this paper, we propose<br />

an integer-based labeling scheme branch code as well as its compressed versi<strong>on</strong><br />

as our major soluti<strong>on</strong> to simultaneously support efficient query processing <strong>on</strong> both<br />

static and dynamic ordered trees with affordable storage cost. The proposed branch<br />

code can answer comm<strong>on</strong> queries <strong>on</strong> ordered trees in c<strong>on</strong>stant time, which comes<br />

at the cost of c<strong>on</strong>suming O(Nlog N) storage. To reduce storage cost to O(N), a compressed<br />

branch code is further developed. We also give a relati<strong>on</strong>ship determinati<strong>on</strong><br />

algorithm purely using compressed branch code, which is of quite low possibility to<br />

produce false positive results as verified by experimental results. With the support<br />

of splay trees, branch code can also support dynamic trees so that updates and<br />

queries can be implemented with O(log N) amortized cost. All the results above are<br />

either theoretically proved or verified by experimental studies.<br />

Scalable Multi-Query Optimizati<strong>on</strong> for SPARQL<br />

Wangchao Le (University of Utah)<br />

Anastasios Kementsietsidis (iBM T. J. Wats<strong>on</strong> research center)<br />

S<strong>on</strong>gyun Duan (iBM T. J. Wats<strong>on</strong> research center)<br />

Feifei Li (University of Utah)<br />

This paper revisits the classical problem of multi-query optimizati<strong>on</strong> in the c<strong>on</strong>text<br />

of RDF/SPARQL. We show that the techniques developed for relati<strong>on</strong>al and<br />

Page<br />

96


Abstracts<br />

semi-structured data/query languages are hard, if not impossible, to be extended<br />

to account for RDF data model and graph query patterns expressed in SPARQL. In<br />

light of the NP-hardness of the multi-query optimizati<strong>on</strong> for SPARQL, we propose<br />

heuristic algorithms that partiti<strong>on</strong> the input batch of queries into groups such that<br />

each group of queries can be optimized together. An essential comp<strong>on</strong>ent of the<br />

optimizati<strong>on</strong> incorporates an efficient algorithm to discover the comm<strong>on</strong> substructures<br />

of multiple SPARQL queries and an effective cost model to compare<br />

candidate executi<strong>on</strong> plans. Since our optimizati<strong>on</strong> techniques do not make any<br />

assumpti<strong>on</strong> about the underlying SPARQL query engine, they have the advantage<br />

of being portable across different RDF stores. The extensive experimental studies,<br />

performed <strong>on</strong> three popular RDF stores, show that the proposed techniques are<br />

effective, efficient and scalable.<br />

SeSSi<strong>on</strong> 15: PErForMANcE<br />

GSLPI: a Cost-based Query Progress Indicator<br />

Jiexing Li (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

rimma v. Nehme (Microsoft Jim Gray Systems Lab)<br />

Jeffrey Naught<strong>on</strong> (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Progress indicators for SQL queries were first published in 2004 with the simultaneous<br />

and independent proposals from Chaudhuri et al. and Luo et al. In this paper,<br />

we implement both progress indicators in the same commercial RDBMS to investigate<br />

their performance. We summarize comm<strong>on</strong> cases in which they are both accurate<br />

and cases in which they fail to provide reliable estimates. Although there are<br />

differences in their performance, much more striking is the similarity in the errors<br />

they make due to a comm<strong>on</strong> simplifying uniform future speed assumpti<strong>on</strong>. While<br />

the developers of these progress indicators were aware that this assumpti<strong>on</strong> could<br />

cause errors, they neither explored how large the errors might be nor did they<br />

investigate the feasibility of removing the assumpti<strong>on</strong>. To rectify this we propose a<br />

new query progress indicator, similar to these early progress indicators but without<br />

the uniform speed assumpti<strong>on</strong>. Experiments show that <strong>on</strong> the TPC-H benchmark,<br />

<strong>on</strong> queries for which the original progress indicators have errors up to 30X the<br />

query running time, the new progress indicator is accurate to within 10 percent. We<br />

also discuss the sources of the errors that still remain and shed some light <strong>on</strong> what<br />

would need to be d<strong>on</strong>e to eliminate them.<br />

Micro-Specializati<strong>on</strong> in DBMSes<br />

rui Zhang (The University of Ariz<strong>on</strong>a)<br />

richard T. Snodgrass (The University of Ariz<strong>on</strong>a)<br />

Saumya Debray (The University of Ariz<strong>on</strong>a)<br />

Relati<strong>on</strong>al database management systems are general in the sense that they can<br />

handle arbitrary schemas, queries, and modificati<strong>on</strong>s; this generality is implemented<br />

using runtime metadata lookups and tests that ensure that c<strong>on</strong>trol is channelled<br />

to the appropriate code in all cases. Unfortunately, these lookups and tests are<br />

carried out even when informati<strong>on</strong> is available that renders some of these operati<strong>on</strong>s<br />

superfluous, leading to unnecessary runtime overheads. This paper introduces<br />

micro-specializati<strong>on</strong>, an approach that uses relati<strong>on</strong>- and query-specific<br />

Page<br />

97


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

informati<strong>on</strong> to specialize the DBMS code at runtime and thereby eliminate some of<br />

these overheads. We develop a tax<strong>on</strong>omy of approaches and specializati<strong>on</strong> times<br />

and propose a general architecture that isolates most of the creati<strong>on</strong> and executi<strong>on</strong><br />

of the specialized code sequences in a separate DBMS-independent module.<br />

Through three illustrative types of micro-specializati<strong>on</strong>s applied to PostgreSQL,<br />

we show that this approach requires minimal changes to a DBMS and can improve<br />

the performance simultaneously across a wide range of queries, modificati<strong>on</strong>s, and<br />

bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C<br />

benchmarks.<br />

Towards Multi-Tenant Performance SLOs<br />

Willis Lang (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Srinath Shankar (Microsoft Jim Gray Systems Lab)<br />

Jignesh M. Patel (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Ajay Kalhan (Microsoft corp.)<br />

As traditi<strong>on</strong>al and missi<strong>on</strong>-critical relati<strong>on</strong>al database workloads migrate to the<br />

cloud in the form of <strong>Data</strong>base- as-a-Service (DaaS), there is an increasing motivati<strong>on</strong><br />

to provide performance goals in Service Level Objectives (SLOs). Providing<br />

such performance goals is challenging for DaaS providers as they must balance the<br />

performance that they can deliver to tenants and the data center’s operating costs.<br />

In general, aggressively aggregating tenants <strong>on</strong> each server reduces the operating<br />

costs but degrades performance for the tenants, and vice versa. In this paper, we<br />

present a framework that takes as input the tenant workloads, their performance<br />

SLOs, and the server hardware that is available to the DaaS provider, and outputs<br />

a cost- effective recipe that specifies how much hardware to provisi<strong>on</strong> and how<br />

to schedule the tenants <strong>on</strong> each hardware resource. We evaluate our method and<br />

show that it produces effective soluti<strong>on</strong>s that can reduce the costs for the DaaS<br />

provider while meeting performance goals.<br />

Multi-Versi<strong>on</strong> C<strong>on</strong>currency via Timestamp Range C<strong>on</strong>flict Management<br />

David Lomet (Microsoft research)<br />

Alan Fekete (University of Sydney)<br />

rui Wang (Microsoft research)<br />

Peter Ward (University of Sydney)<br />

A database supporting multiple versi<strong>on</strong>s of records may use the versi<strong>on</strong>s to support<br />

queries of the past or to increase c<strong>on</strong>currency by enabling reads and writes to<br />

be c<strong>on</strong>current. We introduce a new c<strong>on</strong>currency c<strong>on</strong>trol approach that enables all<br />

SQL isolati<strong>on</strong> levels including serializability to utilize multiple versi<strong>on</strong>s to increase<br />

c<strong>on</strong>currency while also supporting transacti<strong>on</strong> time database functi<strong>on</strong>ality. The<br />

key insight is to manage a range of possible timestamps for each transacti<strong>on</strong> that<br />

captures the impact of c<strong>on</strong>flicts that have occurred. Using these ranges as c<strong>on</strong>straints<br />

often permits c<strong>on</strong>current access where lock based c<strong>on</strong>currency c<strong>on</strong>trol<br />

would block. This can also allow blocking instead of some aborts that are comm<strong>on</strong><br />

in earlier multi-versi<strong>on</strong> c<strong>on</strong>currency techniques. Also, timestamp ranges can be<br />

used to c<strong>on</strong>servatively find deadlocks without graph based cycle detecti<strong>on</strong>. Thus,<br />

our multi-versi<strong>on</strong> support can enhance performance of current time data access via<br />

improved c<strong>on</strong>currency, while supporting transacti<strong>on</strong> time functi<strong>on</strong>ality.<br />

Page<br />

98


SeSSi<strong>on</strong> 16: DATA EXTrAcTioN AND QUALiTy<br />

Abstracts<br />

Automatic Extracti<strong>on</strong> of Structured Web <strong>Data</strong> with Domain Knowledge<br />

Nora Derouiche (Télécom ParisTech – cNrS LTci)<br />

Bogdan cautis (Télécom ParisTech – cNrS LTci)<br />

Talel Abdessalem (Télécom ParisTech – cNrS LTci)<br />

We present in this paper a novel approach for extracting structured data from the<br />

Web, whose goal is to harvest real-world items from template-based HTML pages<br />

(the structured Web). It illustrates a two-phase querying of the Web, in which an<br />

intenti<strong>on</strong>al descripti<strong>on</strong> of the data that is targeted is first provided, in a flexible and<br />

widely applicable manner. The extracti<strong>on</strong> process leverages then both the input<br />

descripti<strong>on</strong> and the source structure. Our approach is domain-independent, in the<br />

sense that it applies to any relati<strong>on</strong>, either flat or nested, describing real-world<br />

items. Extensive experiments <strong>on</strong> five different domains and comparis<strong>on</strong> with the<br />

main state of the art extracti<strong>on</strong> systems from literature illustrate its flexibility and<br />

precisi<strong>on</strong>. We advocate via our technique that automatic extracti<strong>on</strong> and integrati<strong>on</strong><br />

of complex structured data can be d<strong>on</strong>e fast and effectively, when the redundancy<br />

of the Web meets knowledge over the to-be-extracted data.<br />

Discovering C<strong>on</strong>servati<strong>on</strong> Rules<br />

Lukasz Golab (University of Waterloo)<br />

Howard Karloff (AT&T Labs–research)<br />

Flip Korn (AT&T Labs–research)<br />

Barna Saha (AT&T Labs–research)<br />

Divesh Srivastava (AT&T Labs–research)<br />

Many applicati<strong>on</strong>s process data in which there exists a ``c<strong>on</strong>servati<strong>on</strong> law’’ between<br />

related quantities. For example, in traffic m<strong>on</strong>itoring, every incoming event, such as<br />

a packet’s entering a router or a car’s entering an intersecti<strong>on</strong>, should ideally have<br />

an immediate outgoing counterpart. We propose a new class of c<strong>on</strong>straints—-C<strong>on</strong>servati<strong>on</strong><br />

Rules—-that express the semantics and characterize the data quality of<br />

such applicati<strong>on</strong>s. We give c<strong>on</strong>fidence metrics that quantify how str<strong>on</strong>gly a c<strong>on</strong>servati<strong>on</strong><br />

rule holds and present approximati<strong>on</strong> algorithms (with error guarantees) for<br />

the problem of discovering a c<strong>on</strong>cise summary of subsets of the data that satisfy a<br />

given c<strong>on</strong>servati<strong>on</strong> rule. Using real data, we dem<strong>on</strong>strate the utility of c<strong>on</strong>servati<strong>on</strong><br />

rules and we show order-of-magnitude performance improvements of our discovery<br />

algorithms over naive approaches.<br />

Answering Why-not Questi<strong>on</strong>s <strong>on</strong> Top-k Queries<br />

Zhian He (H<strong>on</strong>g K<strong>on</strong>g Polytechnic University),<br />

Eric Lo (H<strong>on</strong>g K<strong>on</strong>g Polytechnic University)<br />

After decades of effort working <strong>on</strong> database performance, the quality and the<br />

usability of database systems have received more attenti<strong>on</strong> in recent years. In<br />

particular, the feature of explaining missing tuples in a query result, or the so-called<br />

“why-not” questi<strong>on</strong>s, has recently become an active topic. In this paper, we study<br />

the problem of answering why-not questi<strong>on</strong>s <strong>on</strong> top-k queries. Our motivati<strong>on</strong> is<br />

Page<br />

99


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

that we know many users love to use top-k queries when they are making multi-criteria<br />

decisi<strong>on</strong>s. However, they often feel frustrated when they are asked to quantify<br />

their feeling as a set of numeric weightings, and feel even more frustrated after they<br />

see the query results do not include their expected answers. In this paper, we use<br />

the query-refinement method to approach the problem. Given as inputs the original<br />

top-k query and a set of missing tuples, our algorithm returns to the user a refined<br />

top-k query that includes the missing tuples. A case study and experimental results<br />

show that our approach returns high quality explanati<strong>on</strong>s to users efficiently.<br />

An Efficient Trie-based Method for Approximate Entity Extracti<strong>on</strong> with<br />

Edit-Distance C<strong>on</strong>straints<br />

D<strong>on</strong>g Deng (Tsinghua University)<br />

Guoliang Li (Tsinghua University)<br />

Jianhua Feng (Tsinghua University)<br />

Dicti<strong>on</strong>ary-based entity extracti<strong>on</strong> has attracted much attenti<strong>on</strong> from the database<br />

community recently, which locates substrings in a document into predefined entities<br />

(e.g., pers<strong>on</strong> names or locati<strong>on</strong>s). To improve extracti<strong>on</strong> recall, a recent trend is<br />

to provide approximate matching between substrings of the document and entities<br />

by tolerating minor errors. In this paper we study dicti<strong>on</strong>ary-based approximate<br />

entity extracti<strong>on</strong> with edit-distance c<strong>on</strong>straints. Existing methods have several<br />

limitati<strong>on</strong>s. First, they need to tune many parameters to achieve high performance.<br />

Sec<strong>on</strong>d, they are inefficient for large edit-distance thresholds. We propose a triebased<br />

method to address these problems. We first partiti<strong>on</strong> each entity into a set of<br />

segments, and then use a trie structure to index segments. To extract similar entities,<br />

we search segments from the document, and extend the matching segments<br />

in both entities and the document to find similar pairs. We develop an extensi<strong>on</strong>based<br />

method to efficiently find similar string pairs by extending the matching<br />

segments. We optimize our partiti<strong>on</strong> scheme and select the best partiti<strong>on</strong> strategy<br />

to improve the extracti<strong>on</strong> performance. Experimental results show that our method<br />

achieves much higher performance compared with state-of-the-art studies.<br />

SeSSi<strong>on</strong> 17: ToP-K ProcESSiNG<br />

On Top-k Structural Similarity Search<br />

Pei Lee (University of British columbia)<br />

Laks v.S. Lakshmanan (University of British columbia)<br />

Jeffrey Xu yu (chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Search for objects similar to a given query object in a network has numerous applicati<strong>on</strong>s<br />

including web search and collaborative filtering. We use the noti<strong>on</strong> of<br />

structural similarity to capture the comm<strong>on</strong>ality of two objects in a network, e.g.,<br />

if two nodes are referenced by the same node, they may be similar. Meeting-based<br />

methods including SimRank and P-Rank capture structural similarity very well.<br />

Deriving inspirati<strong>on</strong> from PageRank, SimRank has gained popularity by a natural<br />

intuiti<strong>on</strong> and domain independence. Since it’s computati<strong>on</strong>ally expensive, subsequent<br />

work has focused <strong>on</strong> optimizing and approximating the computati<strong>on</strong> of<br />

SimRank. In this paper, we approach SimRank from a top-k querying perspective<br />

where given a query node v, we are interested in finding the top-k nodes that have<br />

Page<br />

100


Abstracts<br />

the highest SimRank score w.r.t. v. The <strong>on</strong>ly known approaches for answering such<br />

queries are either a naive algorithm of computing the similarity matrix for all node<br />

pairs or computing the similarity vector by comparing the query node v with each<br />

other node independently, and then picking the top-k. N<strong>on</strong>e of these approaches<br />

can handle top-k structural similarity search efficiently by scaling to very large<br />

graphs c<strong>on</strong>sisting of milli<strong>on</strong>s of nodes. We propose an algorithmic framework called<br />

TopSim based <strong>on</strong> transforming the top-k SimRank problem <strong>on</strong> a graph G to <strong>on</strong>e<br />

of finding the top-k nodes with highest authority <strong>on</strong> the product graph G G. We<br />

further accelerate TopSim by merging similarity paths and develop a more efficient<br />

algorithm called TopSim-SM. Two heuristic algorithms, Trun-TopSim-SM and Prio-<br />

TopSim-SM, are also proposed to approximate TopSim- SM <strong>on</strong> scale-free graphs to<br />

trade accuracy for speed, based <strong>on</strong> truncated random walk and prioritizing propagati<strong>on</strong><br />

respectively. We analyze the accuracy and performance of TopSim family<br />

algorithms and report the results of a detailed experimental study.<br />

Relevance Matters: Capitalizing <strong>on</strong> Less (Top-k Matching in<br />

Publish/Subscribe)<br />

Mohammad Sadoghi (University of Tor<strong>on</strong>to)<br />

Hans-Arno Jacobsen (University of Tor<strong>on</strong>to)<br />

The efficient processing of large collecti<strong>on</strong>s of Boolean expressi<strong>on</strong>s plays a central<br />

role in major data intensive applicati<strong>on</strong>s ranging from user-centric processing<br />

and pers<strong>on</strong>alizati<strong>on</strong> to real-time data analysis. Emerging applicati<strong>on</strong>s such<br />

as computati<strong>on</strong>al advertising and selective informati<strong>on</strong> disseminati<strong>on</strong> demand<br />

determining and presenting to an end-user <strong>on</strong>ly the most relevant c<strong>on</strong>tent that is<br />

both user-c<strong>on</strong>sumable and suitable for limited screen real estate of target devices.<br />

To retrieve the most relevant c<strong>on</strong>tent, we present BE*-Tree, a novel indexing data<br />

structure designed for effective hierarchical top-k pattern matching, which as its<br />

by-product also reduces the operati<strong>on</strong>al cost of processing milli<strong>on</strong>s of patterns. To<br />

further reduce processing cost, BE*-Tree employs an adaptive and n<strong>on</strong>-rigid spacecutting<br />

technique designed to efficiently index Boolean expressi<strong>on</strong>s over a highdimensi<strong>on</strong>al<br />

c<strong>on</strong>tinuous space. At the core of BE*-Tree lie two innovative ideas: (1)<br />

a bi-directi<strong>on</strong>al tree expansi<strong>on</strong> build as a top-down (data and space clustering) and<br />

a bottom-up growths (space clustering), which together enable indexing <strong>on</strong>ly n<strong>on</strong>empty<br />

c<strong>on</strong>tinuous sub-spaces, and (2) an overlap-free splitting strategy. Finally, the<br />

performance of BE*-Tree is proven through a comprehensive experimental comparis<strong>on</strong><br />

against state-of-the-art index structures for matching Boolean expressi<strong>on</strong>s.<br />

Efficiently M<strong>on</strong>itoring Top-k Pairs over Sliding Windows<br />

Zhitao Shen (UNSW)<br />

Muhammad Aamir cheema (UNSW)<br />

Xuemin Lin (UNSW & EcNU)<br />

Wenjie Zhang (UNSW)<br />

Haixun Wang (Microsoft research Asia)<br />

Top-k pairs queries have received significant attenti<strong>on</strong> by the research community.<br />

k-closest pairs queries, k-furthest pairs queries and their variants are am<strong>on</strong>g the<br />

most well studied special cases of the top-k pairs queries. In this paper, we present<br />

the first approach to answer a broad class of top-k pairs queries over sliding<br />

Page<br />

101


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

windows. Our framework handles multiple top-k pairs queries and each query is<br />

allowed to use a different scoring functi<strong>on</strong>, a different value of k and a different size<br />

of the sliding window. Although the number of possible pairs in the sliding window<br />

is quadratic to the number of objects N in the sliding window, we efficiently answer<br />

the top-k pairs query by maintaining a small subset of pairs called K-skyband which<br />

is expected to c<strong>on</strong>sist of O(K log(N/K)) pairs. For all the queries that use the same<br />

scoring functi<strong>on</strong>, we need to maintain <strong>on</strong>ly <strong>on</strong>e K-skyband. We present efficient<br />

techniques for the K-skyband maintenance and query answering. We c<strong>on</strong>duct a<br />

detailed complexity analysis and show that the expected cost of our approach is<br />

reas<strong>on</strong>ably close to the lower bound cost. We experimentally verify this by comparing<br />

our approach with a specially designed supreme algorithm that assumes the<br />

existence of an oracle and meets the lower bound cost.<br />

Processing and Notifying Range Top-k Subscripti<strong>on</strong>s<br />

Albert yu (Duke University)<br />

Pankaj K. Agarwal (Duke University)<br />

Jun yang (Duke University)<br />

We c<strong>on</strong>sider how to support a large number of users over a wide-area network<br />

whose interests are characterised by range top-k c<strong>on</strong>tinuous queries. Given an<br />

object update, we need to notify users whose top-k results are affected. Simple<br />

soluti<strong>on</strong>s include using a c<strong>on</strong>tent-driven network to notify all users whose interest<br />

ranges c<strong>on</strong>tain the update (ignoring top-k), or using a server to compute <strong>on</strong>ly the<br />

affected queries and notifying them individually. The former soluti<strong>on</strong> generates too<br />

much network traffic, while the latter overwhelms the server. We present a geometric<br />

framework for the problem that allows us to describe the set of affected queries<br />

succinctly with messages that can be efficiently disseminated using c<strong>on</strong>tent-driven<br />

networks. We give fast algorithms to reformulate each update into a set of messages<br />

whose number is provably optimal, with or without knowing all user interests.<br />

We also present extensi<strong>on</strong>s to our soluti<strong>on</strong>, including an approximate algorithm that<br />

trades off between the cost of server-side reformulati<strong>on</strong> and that of user-side postprocessing,<br />

as well as efficient techniques for batch updates.<br />

SESSioN 18: SiMiLAriTy<br />

Efficient Exact Similarity Searches using Multiple Token Orderings<br />

J<strong>on</strong>gik Kim (ch<strong>on</strong>buk Nati<strong>on</strong>al University)<br />

H<strong>on</strong>grae Lee (Google inc.)<br />

Similarity searches are essential in many applicati<strong>on</strong>s including data cleaning and near<br />

duplicate detecti<strong>on</strong>. Many similarity search algorithms first generate candidate records,<br />

and then identify true matches am<strong>on</strong>g them. A major focus of those algorithms has<br />

been <strong>on</strong> how to reduce the number of candidate records in the early stage of similarity<br />

query processing. One of the most comm<strong>on</strong>ly used techniques to reduce the candidate<br />

size is the prefix filtering principle, which exploits the document frequency ordering of<br />

tokens. In this paper, we propose a novel partiti<strong>on</strong>ing technique that c<strong>on</strong>siders multiple<br />

token orderings based <strong>on</strong> token co-occurrence statistics. Experimental results show<br />

that the proposed technique is effective in reducing the number of candidate records<br />

and as a result improves the performance of existing algorithms significantly.<br />

Page<br />

102


Abstracts<br />

Efficient Graph Similarity Joins with Edit Distance C<strong>on</strong>straints<br />

Xiang Zhao (The University of New South Wales & NicTA)<br />

chuan Xiao (The University of New South Wales)<br />

Xuemin Lin (The University of New South Wales & East china Normal University)<br />

Wei Wang (The University of New South Wales)<br />

Graphs are widely used to model complicated data semantics in many applicati<strong>on</strong>s<br />

in bioinformatics, chemistry, social networks, pattern recogniti<strong>on</strong>, etc. A recent trend<br />

is to tolerate noise arising from various sources, such as err<strong>on</strong>eous data entry, and<br />

find similarity matches. In this paper, we study the graph similarity join problem that<br />

returns pairs of graphs such that their edit distances are no larger than a threshold.<br />

Inspired by the q-gram idea for string similarity problem, our soluti<strong>on</strong> extracts<br />

paths from graphs as features for indexing. We establish a lower bound of comm<strong>on</strong><br />

features to generate candidates. An efficient algorithm is proposed to exploit both<br />

matching and mismatching features to improve the filtering and verificati<strong>on</strong> <strong>on</strong> candidates.<br />

We dem<strong>on</strong>strate the proposed algorithm significantly outperforms existing<br />

approaches with extensive experiments <strong>on</strong> publicly available datasets.<br />

Parameter-Free Determinati<strong>on</strong> of Distance Thresholds for Metric<br />

Distance C<strong>on</strong>straints<br />

Shaoxu S<strong>on</strong>g (Tsinghua University)<br />

Lei chen (The H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology)<br />

H<strong>on</strong>g cheng (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

The importance of introducing distance c<strong>on</strong>straints to data dependencies, such as<br />

differential dependencies (DDs) [28], has recently been recognized. The metric distance<br />

c<strong>on</strong>straints are tolerant to small variati<strong>on</strong>s, which enable them apply to wide<br />

data quality checking applicati<strong>on</strong>s, such as detecting data violati<strong>on</strong>s. However, the<br />

determinati<strong>on</strong> of distance thresholds for the metric distance c<strong>on</strong>straints is n<strong>on</strong>-trivial.<br />

It often relies <strong>on</strong> a truth data instance which embeds the distance c<strong>on</strong>straints.<br />

To find useful distance threshold patterns from data, there are several guidelines<br />

of statistical measures to specify, e.g., support, c<strong>on</strong>fidence and dependent quality.<br />

Unfortunately, given a data instance, users might not have any knowledge about<br />

the data distributi<strong>on</strong>, thus it is very challenging to set the right parameters. In<br />

this paper, we study the determinati<strong>on</strong> of distance thresholds for metric distance<br />

c<strong>on</strong>straints, in a parameter-free style. Specifically, we compute an expected utility<br />

based <strong>on</strong> the statistical measures from the data. According to our analysis as well<br />

as experimental verificati<strong>on</strong>, distance threshold patterns with higher expected<br />

utility could offer better usage in real applicati<strong>on</strong>s, such as violati<strong>on</strong> detecti<strong>on</strong>. We<br />

then develop efficient algorithms to determine the distance thresholds having the<br />

maximum expected utility. Finally, our extensive experimental evaluati<strong>on</strong> dem<strong>on</strong>strates<br />

the effectiveness and efficiency of the proposed methods.<br />

Page<br />

103


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Random Error Reducti<strong>on</strong> in Similarity Search <strong>on</strong> Time Series:<br />

A Statistical Approach<br />

Wush chi-Hsuan Wu (Academia Sinica)<br />

Mi-yen yeh (Academia Sinica)<br />

Jian Pei (Sim<strong>on</strong> Fraser University)<br />

Errors in measurement can be categorized into two types: systematic errors that<br />

are predictable, and random errors that are inherently unpredictable and have null<br />

expected value. Random error is always present in a measurement. More often<br />

than not, readings in time series may c<strong>on</strong>tain inherent random errors due to causes<br />

like dynamic error, drift, noise, hysteresis, digitalizati<strong>on</strong> error and limited sampling<br />

frequency. Random errors may affect the quality of time series analysis substantially.<br />

Unfortunately, most of the existing time series mining and analysis methods,<br />

such as similarity search, clustering, and classificati<strong>on</strong> tasks, do not address random<br />

errors, possibly because random error in a time series, which can be modeled as<br />

a random variable of unknown distributi<strong>on</strong>, is hard to handle. In this paper, we<br />

tackle this challenging problem. Taking similarity search as an example, which is an<br />

essential task in time series analysis, we develop MISQ, a statistical approach for<br />

random error reducti<strong>on</strong> in time series analysis. The major intuiti<strong>on</strong> in our method is<br />

to use <strong>on</strong>ly the readings at different time instants in a time series to reduce random<br />

errors. We achieve a highly desirable property in MISQ: it can ensure that the recall<br />

is above a user-specified threshold. An extensive empirical study <strong>on</strong> 20 benchmark<br />

real data sets clearly shows that our method can lead to better performance than<br />

the baseline method without random error reducti<strong>on</strong> in real applicati<strong>on</strong>s such as<br />

classificati<strong>on</strong>. Moreover, MISQ achieves good quality in similarity search.<br />

SeSSi<strong>on</strong> 19: TEXT AND STriNGS<br />

Optimizing Statistical Informati<strong>on</strong> Extracti<strong>on</strong> Programs Over<br />

Evolving Text<br />

Fei chen (HP Labs china)<br />

Xixuan Feng (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

christopher re (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Min Wang (HP Labs china)<br />

Statistical informati<strong>on</strong> extracti<strong>on</strong> (IE) programs are increasingly used to build realworld<br />

IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical<br />

IE approaches c<strong>on</strong>sider the text corpora underlying the extracti<strong>on</strong> program to be<br />

static. However, many real-world text corpora are dynamic (documents are inserted,<br />

modified, and removed). As the corpus evolves, and IE programs must be applied<br />

repeatedly to c<strong>on</strong>secutive corpus snapshots to keep extracted informati<strong>on</strong> up to<br />

date. Applying IE from scratch to each snapshot may be inefficient: a pair of c<strong>on</strong>secutive<br />

snapshots may change very little, but unaware of this, the program must<br />

run again from scratch. In this paper, we present \crflex, a system that efficiently<br />

executes such repeated statistical IE, by recycling previous IE results to enable incremental<br />

update. We focus <strong>on</strong> statistical IE programs which use a leading statistical<br />

model, C<strong>on</strong>diti<strong>on</strong>al Random Fields (CRFs). We show how to model properties<br />

of the CRF inference algorithms for incremental update and how to exploit them<br />

Page<br />

104


Abstracts<br />

to correctly recycle previous inference results. Then we show how to efficiently<br />

capture and store intermediate results of IE programs for subsequent recycling.<br />

We find that there is a tradeoff between the I/O cost spent <strong>on</strong> reading and writing<br />

intermediate results, and CPU cost we can save from recycling those intermediate<br />

results. Therefore we present a cost-based soluti<strong>on</strong> to determine the most efficient<br />

recycling approach for any given CRF-based IE program and an evolving corpus.<br />

We present extensive experiments with CRF-based IE programs for 3 IE tasks over<br />

a real-world data set to dem<strong>on</strong>strate the utility of our approach.<br />

Approximate String Membership Checking: A Multiple Filter,<br />

Optimizati<strong>on</strong>-Based Approach<br />

ch<strong>on</strong>g Sun (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Jeffrey F. Naught<strong>on</strong> (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

Siddharth Barman (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />

We c<strong>on</strong>sider the approximate string membership checking (ASMC) problem of extracting<br />

all the strings or substrings in a document that approximately match some<br />

string in a given dicti<strong>on</strong>ary. To solve this problem, the current state-of-art approach<br />

involves first applying an approximate, fast filter, then applying a more expensive<br />

exact verificati<strong>on</strong> algorithm to the strings that pass the filter. Corresp<strong>on</strong>dingly,<br />

many string filters have been proposed. We note that different filters are good at<br />

eliminating different strings, depending <strong>on</strong> the characteristics of the strings in both<br />

the documents and the dicti<strong>on</strong>ary. We suspect that no single filter will dominate all<br />

other filters everywhere. Given an ASMC problem instance and a set of string filters,<br />

we need to select the optimal filter to maximize the performance. Furthermore, in<br />

our experiments we found that in some cases a sequence of filters dominates any<br />

of the filters of the sequence in isolati<strong>on</strong>, and that the best set of filters and their<br />

ordering depend up<strong>on</strong> the specific problem instance encountered. Accordingly, we<br />

propose that the approximate match problem be viewed as an optimizati<strong>on</strong> problem,<br />

and evaluate a number of techniques for solving this optimizati<strong>on</strong> problem.<br />

On Text Clustering with Side Informati<strong>on</strong><br />

charu c. Aggarwal (iBM T. J. Wats<strong>on</strong> research center)<br />

yuchen Zhao (University of illinois at chicago)<br />

Philip S. yu (University of illinois at chicago)<br />

Text clustering has become an increasingly important problem in recent years<br />

because of the tremendous amount of unstructured data which is available in various<br />

forms in <strong>on</strong>line forums such as the web, social networks, and other informati<strong>on</strong><br />

networks. In most cases, the data is not purely available in text form. A lot of side-informati<strong>on</strong><br />

is available al<strong>on</strong>g with the text documents. Such side-informati<strong>on</strong> may be of<br />

different kinds, such as the links in the document, user-access behavior from web logs,<br />

or other n<strong>on</strong>-textual attributes which are embedded into the text document. Such<br />

attributes may c<strong>on</strong>tain a tremendous amount of informati<strong>on</strong> for clustering purposes.<br />

However, the relative importance of this side-informati<strong>on</strong> may be difficult to estimate,<br />

especially when some of the informati<strong>on</strong> is noisy. In such cases, it can be risky to<br />

incorporate side-informati<strong>on</strong> into the clustering process, because it can either improve<br />

the quality of the representati<strong>on</strong> for clustering, or can add noise to the process. Therefore,<br />

we need a principled way to perform the clustering process, so as to maximize<br />

Page<br />

105


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

the advantages from using this side informati<strong>on</strong>. In this paper, we design an algorithm<br />

which combines classical partiti<strong>on</strong>ing algorithms with probabilistic models in order to<br />

create an effective clustering approach. We present experimental results <strong>on</strong> a number<br />

of real data sets in order to illustrate the advantages of using such an approach.<br />

Fast SLCA and ELCA Computati<strong>on</strong> for XML Keyword Queries based <strong>on</strong><br />

Set Intersecti<strong>on</strong><br />

Junfeng Zhou (yanshan University)<br />

Zhifeng Bao (Nati<strong>on</strong>al University of Singapore)<br />

Wei Wang (The University of New South Wales)<br />

Tok Wang Ling (Nati<strong>on</strong>al University of Singapore)<br />

Ziyang chen (yanshan University)<br />

Xud<strong>on</strong>g Lin (yanshan University)<br />

Jingfeng Guo (yanshan University)<br />

In this paper, we focus <strong>on</strong> efficient keyword query processing for XML data based<br />

<strong>on</strong> the SLCA and ELCA semantics. We propose a novel form of inverted lists for keywords<br />

which include IDs of nodes that directly or indirectly c<strong>on</strong>tain a given keyword.<br />

We propose a family of efficient algorithms that are based <strong>on</strong> the set intersecti<strong>on</strong> operati<strong>on</strong><br />

for both semantics. We show that the problem of SLCA/ELCA computati<strong>on</strong><br />

becomes finding a set of nodes that appear in all involved inverted lists and satisfy<br />

certain c<strong>on</strong>diti<strong>on</strong>s. We also propose several optimizati<strong>on</strong> techniques to further improve<br />

the query processing performance. We have c<strong>on</strong>ducted extensive experiments<br />

with many alternative methods. The results dem<strong>on</strong>strate that our proposed methods<br />

outperform previous methods by up to two orders of magnitude in many cases.<br />

SeSSi<strong>on</strong> 20: QUEry ProcESSiNG ii<br />

Optimizati<strong>on</strong> of Massive Pattern Queries by Dynamic<br />

C<strong>on</strong>figurati<strong>on</strong> Morphing<br />

Nikolay Laptev (University of california, Los Angeles)<br />

carlo Zaniolo (University of california, Los Angeles)<br />

Complex pattern queries play a critical role in many applicati<strong>on</strong>s that must efficiently<br />

search databases and data streams. Current techniques support the search<br />

for multiple patterns using deterministic or n<strong>on</strong>-deterministic automata. In practice<br />

however, the static pattern representati<strong>on</strong> does not fully utilize available system<br />

resources, subsequently suffering from poor performance. Therefore a low overhead<br />

auto-rec<strong>on</strong>figurable automat<strong>on</strong> is needed that optimizes pattern matching<br />

performance. In this paper, we propose a dynamic system that entails the efficient<br />

and reliable evaluati<strong>on</strong> of a very large number of pattern queries <strong>on</strong> a resource c<strong>on</strong>strained<br />

system under changing stress-load. Our system prototype, Morpheus, precomputes<br />

several query pattern representati<strong>on</strong>s, named templates, which are then<br />

morphed into a required form during run-time. Morpheus uses templates to speed<br />

up dynamic automat<strong>on</strong> rec<strong>on</strong>figurati<strong>on</strong>. Results from empirical studies c<strong>on</strong>firm the<br />

benefits of our approach, with three orders of magnitude improvement achieved in<br />

the overall pattern matching performance with the help of dynamic rec<strong>on</strong>figurati<strong>on</strong>.<br />

This is accomplished <strong>on</strong>ly with a modest increase in amortized memory usage.<br />

Page<br />

106


Three-level Processing of Multiple Aggregate C<strong>on</strong>tinuous Queries<br />

Shenoda Guirguis (University of Pittsburgh)<br />

Mohamed A. Sharaf (The University of Queensland)<br />

Panos K. chrysanthis (University of Pittsburgh)<br />

Alexandros Labrinidis (University of Pittsburgh)<br />

Abstracts<br />

Aggregate C<strong>on</strong>tinuous Queries (ACQs) are both a very popular class of C<strong>on</strong>tinuous<br />

Queries (CQs) and also have a potentially high executi<strong>on</strong> cost. As such, optimizing<br />

the processing of ACQs is imperative for <strong>Data</strong> Stream Management Systems<br />

(DSMSs) to reach their full potential in supporting (critical) m<strong>on</strong>itoring applicati<strong>on</strong>s.<br />

For multiple ACQs that vary in window specificati<strong>on</strong>s and pre-aggregati<strong>on</strong> filters,<br />

existing multiple ACQs optimizati<strong>on</strong> schemes assume a processing model where<br />

each ACQ is computed as a final-aggregati<strong>on</strong> of a sub-aggregati<strong>on</strong>. In this paper,<br />

we propose a novel processing model for ACQs, called TriOps, with the goal of<br />

minimizing the repetiti<strong>on</strong> of operator executi<strong>on</strong> at the sub-aggregati<strong>on</strong> level. We<br />

also propose TriWeave, a TriOps-aware multi-query optimizer. We analytically and<br />

experimentally dem<strong>on</strong>strate the performance gains of our proposed schemes which<br />

shows their superiority over alternative schemes. Finally, we generalize TriWeave to<br />

incorporate the classical subsumpti<strong>on</strong>-based multi-query optimizati<strong>on</strong> techniques.<br />

Accelerating Range Queries For Brain Simulati<strong>on</strong>s<br />

Farhan Tauheed (EPFL)<br />

Laurynas Biveinis (Aalborg University)<br />

Thomas Heinis (EPFL)<br />

Felix Schürmann (EPFL)<br />

Henry Markram (EPFL)<br />

Anastasia Ailamaki (EPFL)<br />

Neuroscientists increasingly use computati<strong>on</strong>al tools in building and simulating<br />

models of the brain. The amounts of data involved In these simulati<strong>on</strong>s are immense<br />

and efficiently managing this data is key. One particular problem in analyzing this<br />

data is the scalable executi<strong>on</strong> of range queries <strong>on</strong> spatial models of the brain.<br />

Known indexing approaches do not perform well even <strong>on</strong> today’s small models<br />

which represent a small fracti<strong>on</strong> of the brain, c<strong>on</strong>taining <strong>on</strong>ly few milli<strong>on</strong>s of densely<br />

packed spatial elements. The problem of current approaches is that with the increasing<br />

level of detail in the models, also the overlap in the tree structure increases,<br />

ultimately slowing down query executi<strong>on</strong>. The neuroscientists’ need to work<br />

with bigger and more detailed (denser) models thus motivates us to develop a new<br />

indexing approach. To this end we develop FLAT, a scalable indexing approach for<br />

dense data sets. We base the development of FLAT <strong>on</strong> the key observati<strong>on</strong> that<br />

current approaches suffer from overlap in case of dense data sets. We hence design<br />

FLAT as an approach with two phases, each independent of density. In the first<br />

phase it uses a traditi<strong>on</strong>al spatial index to retrieve an initial object efficiently. In the<br />

sec<strong>on</strong>d phase it traverses the initial object’s neighborhood to retrieve the remaining<br />

query result. Our experimental results show that FLAT not <strong>on</strong>ly outperforms R-Tree<br />

variants from a factor of two up to eight but that it also achieves independence<br />

from data set size and density.<br />

Page<br />

107


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Keyword Query Reformulati<strong>on</strong> <strong>on</strong> Structured <strong>Data</strong><br />

Junjie yao (Peking University)<br />

Bin cui (Peking University)<br />

Liansheng Hua (Peking University)<br />

yuxin Huang (Peking University)<br />

Textual web pages dominate web search engines nowadays. However, there is<br />

also a striking increase of structured data <strong>on</strong> the web. Efficient keyword query<br />

processing <strong>on</strong> structured data has attracted enough attenti<strong>on</strong>, but effective query<br />

understanding has yet to be investigated. In this paper, we focus <strong>on</strong> the problem of<br />

keyword query reformulati<strong>on</strong> in the structured data scenario. These reformulated<br />

queries provide alternative descripti<strong>on</strong>s of original input. They could better capture<br />

users’ informati<strong>on</strong> need and guide users to explore related items in the target<br />

structured data. We propose an automatic keyword query reformulati<strong>on</strong> approach<br />

by exploiting structural semantics in the underlying structured data sources. The<br />

reformulati<strong>on</strong> soluti<strong>on</strong> is decomposed into two stages, i.e., offline term relati<strong>on</strong><br />

extracti<strong>on</strong> and <strong>on</strong>line query generati<strong>on</strong>. We first utilize a heterogenous graph to<br />

model the words and items in structured data, and design an enhanced Random<br />

Walk approach to extract relevant terms from the graph c<strong>on</strong>text. In the <strong>on</strong>line query<br />

reformulati<strong>on</strong> stage, we introduce an efficient probabilistic generati<strong>on</strong> module to<br />

suggest substitutable reformulated queries. Extensive experiments are c<strong>on</strong>ducted<br />

<strong>on</strong> a real-life data set, and our approach yields promising results.<br />

SeSSi<strong>on</strong> 21: DATA MiNiNG<br />

Predicting Approximate Protein-DNA Binding Cores Using<br />

Associati<strong>on</strong> Rule Mining<br />

Po-yuen W<strong>on</strong>g (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Tak-Ming chan (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Man-H<strong>on</strong> W<strong>on</strong>g (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Kw<strong>on</strong>g-Sak Leung (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />

The studies of protein-DNA bindings between transcripti<strong>on</strong> factors (TFs) and transcripti<strong>on</strong><br />

factor binding sites (TFBSs) are important bioinformatics topics. High-resoluti<strong>on</strong><br />

(length490) are shown promising in identifying<br />

accurate binding cores without using any 3D structures. While the current associati<strong>on</strong><br />

rule mining method <strong>on</strong> this problem addresses exact sequences <strong>on</strong>ly, the most<br />

recent ad hoc method for approximati<strong>on</strong> does not establish any formal model and is<br />

limited by experimentally known patterns. As biological mutati<strong>on</strong>s are comm<strong>on</strong>, it is<br />

desirable to formally extend the exact model into an approximate <strong>on</strong>e. In this paper,<br />

we formalize the problem of mining approximate protein-DNA associati<strong>on</strong> rules<br />

from sequence data and propose a novel efficient algorithm to predict protein-DNA<br />

binding cores. Our two-phase algorithm first c<strong>on</strong>structs two compact intermediate<br />

structures called frequent sequence tree (FS-Tree) and frequent sequence class tree<br />

(FSCTree). Approximate associati<strong>on</strong> rules are efficiently generated from the structures<br />

and bioinformatics c<strong>on</strong>cepts (positi<strong>on</strong> weight matrix and informati<strong>on</strong> c<strong>on</strong>tent)<br />

Page<br />

108


Abstracts<br />

are further employed to prune meaningless rules. Experimental results <strong>on</strong> real data<br />

show the performance and applicability of the proposed algorithm.<br />

Upgrading Uncompetitive Products Ec<strong>on</strong>omically<br />

Hua Lu (Aalborg University)<br />

christian S. Jensen (Aarhus University)<br />

The skyline of a multidimensi<strong>on</strong>al point set c<strong>on</strong>sists of the points that are not<br />

dominated by other points. In a scenario where product features are represented by<br />

multidimensi<strong>on</strong>al points, the skyline points may be viewed as representing competitive<br />

products. A product provider may wish to upgrade uncompetitive products to<br />

become competitive, but wants to take into account the upgrading cost. We study<br />

the top-k product upgrading problem. Given a set P of competitor products, a set<br />

T of products that are candidates for upgrade, and an upgrading cost functi<strong>on</strong> f<br />

that applies to T, the problem is to return the k products in T that can be upgraded<br />

to not be dominated by any products in P at the lowest cost. This problem is n<strong>on</strong>trivial<br />

due to not <strong>on</strong>ly the large data set sizes, but also to the many possibilities for<br />

upgrading a product. We identify and provide soluti<strong>on</strong>s for the different opti<strong>on</strong>s for<br />

upgrading an uncompetitive product, and combine the soluti<strong>on</strong>s into a single soluti<strong>on</strong>.<br />

We also propose a spatial join-based soluti<strong>on</strong> that assumes P and T are indexed<br />

by an R-tree. Given a set of products in the same R-tree node, we derive three lower<br />

bounds <strong>on</strong> their upgrading costs. These bounds are employed by the join approach<br />

to prune upgrade candidates with uncompetitive upgrade costs. Empirical studies<br />

with synthetic and real data show that the join approach is efficient and scalable.<br />

Attribute-Based Subsequence Matching and Mining<br />

yu Peng (The H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology)<br />

raym<strong>on</strong>d chi-Wing W<strong>on</strong>g (The H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology)<br />

Liangliang ye (The H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology)<br />

Philip S. yu (University of illinois at chicago)<br />

Sequence analysis is very important in our daily life. Typically, each sequence is<br />

associated with an ordered list of elements. For example, in a movie rental applicati<strong>on</strong>,<br />

a customer’s movie rental record c<strong>on</strong>taining an ordered list of movies is a<br />

sequence example. Most studies about sequence analysis focus <strong>on</strong> subsequence<br />

matching which finds all sequences stored in the database such that a given query<br />

sequence is a subsequence of each of these sequences. In many applicati<strong>on</strong>s,<br />

elements are associated with properties or attributes. For example, each movie is<br />

associated with some attributes like “Director” and “Actors”. Unfortunately, to the<br />

best of our knowledge, all existing studies about sequence analysis do not c<strong>on</strong>sider<br />

the attributes of elements. In this paper, we propose two problems. The first problem<br />

is: given a query sequence and a set of sequences, c<strong>on</strong>sidering the attributes of<br />

elements, we want to find all sequences which are matched by this query sequence.<br />

This problem is called attribute-based subsequence matching (ASM). All existing<br />

applicati<strong>on</strong>s for the traditi<strong>on</strong>al subsequence matching problem can also be applied<br />

to our new problem provided that we are given the attributes of elements. We propose<br />

an efficient algorithm for problem ASM. The key idea to the efficiency of this<br />

algorithm is to compress each whole sequence with potentially many associated<br />

attributes into just a triplet of numbers. By dealing with these very compressed rep-<br />

Page<br />

109


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

resentati<strong>on</strong>s, we greatly speed up the attribute-based subsequence matching. The<br />

sec<strong>on</strong>d problem is to find all frequent attribute-based subsequence. We also adapt<br />

an existing efficient algorithm for this sec<strong>on</strong>d problem to show we can use the algorithm<br />

developed for the first problem. Empirical studies show that our algorithms<br />

are scalable in large datasets. In particular, our algorithms run at least an order of<br />

magnitude faster than a straightforward method in most cases. This work can stimulate<br />

a number of existing data mining problems which are fundamentally based <strong>on</strong><br />

subsequence matching such as sequence classificati<strong>on</strong>, frequent sequence mining,<br />

motif detecti<strong>on</strong> and sequence matching in bioinformatics.<br />

Integrating Frequent Pattern Mining from Multiple <strong>Data</strong> Domains<br />

for Classificati<strong>on</strong><br />

Dhaval Patel (Nati<strong>on</strong>al University of Singapore)<br />

Wynne Hsu (Nati<strong>on</strong>al University of Singapore)<br />

M<strong>on</strong>g Li Lee (Nati<strong>on</strong>al University of Singapore)<br />

Many frequent pattern mining algorithms have been developed for categorical,<br />

numerical, time series, or interval data. However, little attenti<strong>on</strong> has been given to<br />

integrate these algorithms so as to mine frequent patterns from multiple domain<br />

datasets for classificati<strong>on</strong>. In this paper, we introduce the noti<strong>on</strong> of a heterogenous<br />

pattern to capture the associati<strong>on</strong>s am<strong>on</strong>g different kinds of data. We propose a<br />

unified framework for mining multiple domain datasets and design an iterative algorithm<br />

called HTMiner. HTMiner discovers essential heterogenous patterns for classificati<strong>on</strong><br />

and performs instance eliminati<strong>on</strong>. This instance eliminati<strong>on</strong> step reduces<br />

the problem size progressively by removing training instances which are correctly<br />

covered by the discovered essential heterogenous pattern. Experiments <strong>on</strong> two real<br />

world datasets show that the HTMiner is efficient and can significantly improve the<br />

classificati<strong>on</strong> accuracy.<br />

SeSSi<strong>on</strong> 22:<br />

SciENTiFic DATA, ANALySiS AND viSUALiZATioN<br />

Efficient Versi<strong>on</strong>ing for Scientific Array <strong>Data</strong>bases<br />

Adam Seering (MiT cSAiL)<br />

Philippe cudre-Mauroux (University of Fribourg)<br />

Samuel Madden (MiT cSAiL)<br />

Michael St<strong>on</strong>ebraker (MiT cSAiL)<br />

In this paper, we describe a versi<strong>on</strong>ed database storage manager we are developing<br />

for the SciDB scientific database. The system is designed to efficiently store and<br />

retrieve array-oriented data, exposing a ``no-overwrite’’ storage model in which<br />

each update creates a new ``versi<strong>on</strong>’’ of an array. This makes it possible to perform<br />

comparis<strong>on</strong>s of versi<strong>on</strong>s produced at different times or by different algorithms, and<br />

to create complex chains and trees of versi<strong>on</strong>s. We present algorithms to efficiently<br />

encode these versi<strong>on</strong>s, minimizing storage space while still providing efficient access<br />

to the data. Additi<strong>on</strong>ally, we present an optimal algorithm that, given a l<strong>on</strong>g<br />

sequence of versi<strong>on</strong>s, determines which versi<strong>on</strong>s to encode in terms of each other<br />

(using delta compressi<strong>on</strong>) to minimize total storage space or query executi<strong>on</strong> cost.<br />

Page<br />

110


Abstracts<br />

We compare the performance of these algorithms <strong>on</strong> real world data sets from the<br />

Nati<strong>on</strong>al Oceanic and Atmospheric Administrati<strong>on</strong> (NOAA), OpenStreetMaps, and<br />

several other sources. We show that our algorithms provide better performance<br />

than existing versi<strong>on</strong> c<strong>on</strong>trol systems not optimized for array data, both in terms of<br />

storage size and access time, and that our delta-compressi<strong>on</strong> algorithms are able to<br />

substantially reduce the total storage space when versi<strong>on</strong>s exist with a high degree<br />

of similarity.<br />

Multidimensi<strong>on</strong>al Analysis of Atypical Events in Cyber-Physical <strong>Data</strong><br />

Lu-An Tang (UiUc)<br />

Xiao yu (UiUc)<br />

Sangkyum Kim (UiUc)<br />

Jiawei Han (UiUc)<br />

Wen-chih Peng (Nati<strong>on</strong>al chiao Tung University)<br />

yizhou Sun (UiUc)<br />

Hector G<strong>on</strong>zalez (Google)<br />

Sebastian Seith (Morning Star)<br />

A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras)<br />

with cyber (or informati<strong>on</strong>al) comp<strong>on</strong>ents to form a situati<strong>on</strong>-integrated analytical<br />

system that may resp<strong>on</strong>d intelligently to dynamic changes of the real-world situati<strong>on</strong>s.<br />

CPS claims many promising applicati<strong>on</strong>s, such as traffic observati<strong>on</strong>, battlefield<br />

surveillance and sensor-networkbased m<strong>on</strong>itoring. One important research<br />

topic in CPS is about the atypical event analysis, i.e., retrieving the events from<br />

large amount of data and analyzing them with spatial, temporal and other multidimensi<strong>on</strong>al<br />

informati<strong>on</strong>. Many traditi<strong>on</strong>al approaches are not feasible for such<br />

analysis since they use numeric measures and cannot describe the complex atypical<br />

events. In this study, we propose a new model of atypical cluster to effectively<br />

represent those events and efficiently retrieve them from massive data. The microcluster<br />

is designed to summarize individual events, and the macro-cluster is used<br />

to integrate the informati<strong>on</strong> from multiple event. To facilitate scalable, flexible and<br />

<strong>on</strong>line analysis, the c<strong>on</strong>cept of significant cluster is defined and a guided clustering<br />

algorithm is proposed to retrieve significant clusters in an efficient manner. We<br />

c<strong>on</strong>duct experiments <strong>on</strong> real datasets with the size of more than 50 GB, the results<br />

show that the proposed method can provide more accurate informati<strong>on</strong> with <strong>on</strong>ly<br />

15% to 20% time cost of the baselines.<br />

HiCS: High C<strong>on</strong>trast Subspaces for Density-Based Outlier Ranking<br />

Fabian Keller (Karlsruhe institute of Technology)<br />

Emmanuel Müller (Karlsruhe institute of Technology)<br />

Klemens Böhm (Karlsruhe institute of Technology)<br />

Outlier mining is a major task in data analysis. Outliers are objects that highly deviate<br />

from regular objects in their local neighborhood. Density-based outlier ranking<br />

methods score each object based <strong>on</strong> its degree of deviati<strong>on</strong>. In many applicati<strong>on</strong>s,<br />

these ranking methods degenerate to random listings due to low c<strong>on</strong>trast between<br />

outliers and regular objects. Outliers do not show up in the scattered full space,<br />

they are hidden in multiple high c<strong>on</strong>trast subspace projecti<strong>on</strong>s of the data. Measuring<br />

the c<strong>on</strong>trast of such subspaces for outlier rankings is an open research chal-<br />

Page<br />

111


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

lenge. In this work, we propose a novel subspace search method that selects high<br />

c<strong>on</strong>trast subspaces for density-based outlier ranking. It is designed as pre-processing<br />

step to outlier ranking algorithms. It searches for high c<strong>on</strong>trast subspaces with<br />

a significant amount of c<strong>on</strong>diti<strong>on</strong>al dependence am<strong>on</strong>g the subspace dimensi<strong>on</strong>s.<br />

With our approach, we propose a first measure for the c<strong>on</strong>trast of subspaces. Thus,<br />

we enhance the quality of traditi<strong>on</strong>al outlier rankings by computing outlier scores in<br />

high c<strong>on</strong>trast projecti<strong>on</strong>s <strong>on</strong>ly. The evaluati<strong>on</strong> <strong>on</strong> real and synthetic data shows that<br />

our approach outperforms traditi<strong>on</strong>al dimensi<strong>on</strong>ality reducti<strong>on</strong> techniques, naive<br />

random projecti<strong>on</strong>s as well as state-of-the-art subspace search techniques and<br />

provides enhanced quality for outlier ranking.<br />

Extracting Analyzing and Visualizing Triangle K-Core Motifs<br />

within Networks<br />

yang Zhang (The ohio State University)<br />

Srinivasan Parthasarathy (The ohio State University)<br />

Cliques are topological structures that usually provide important informati<strong>on</strong><br />

for understanding the structure of a graph or network. However, detecting and<br />

extracting cliques efficiently is known to be very hard. In this paper, we define and<br />

introduce the noti<strong>on</strong> of a Triangle K-Core, a simpler topological structure and <strong>on</strong>e<br />

that is more tractable and can moreover be used as a proxy for extracting cliquelike<br />

structure from large graphs. Based <strong>on</strong> this definiti<strong>on</strong> we first develop a localized<br />

algorithm for extracting Triangle K-Cores from large graphs. Subsequently we<br />

extend the simple algorithm to accommodate dynamic graphs (where edges can<br />

be dynamically added and deleted). Finally, we extend the basic definiti<strong>on</strong> to support<br />

various template pattern cliques with applicati<strong>on</strong>s to network visualizati<strong>on</strong> and<br />

event detecti<strong>on</strong> <strong>on</strong> graphs and networks. Our empirical results reveal the efficiency<br />

and efficacy of the proposed methods <strong>on</strong> many real world datasets.<br />

SeSSi<strong>on</strong> 23: SiMiLAriTy SEArcH AND DETEcTioN<br />

Horiz<strong>on</strong>tal Reducti<strong>on</strong>: Instance-Level Dimensi<strong>on</strong>ality Reducti<strong>on</strong> for<br />

Similarity Search in Large Document <strong>Data</strong>bases<br />

Min Soo Kim (KAiST)<br />

Kyu-young Whang (KAiST)<br />

yang-Sae Mo<strong>on</strong> (Kangw<strong>on</strong> Nati<strong>on</strong>al University)<br />

Dimensi<strong>on</strong>ality reducti<strong>on</strong> is essential in text mining since the dimensi<strong>on</strong>ality of text<br />

documents could easily reach several tens of thousands. Most recent efforts <strong>on</strong><br />

dimensi<strong>on</strong>ality reducti<strong>on</strong>, however, are not adequate to large document databases<br />

due to lack of scalability. We hence propose a new type of simple but effective<br />

dimensi<strong>on</strong>ality reducti<strong>on</strong>, called horiz<strong>on</strong>tal (dimensi<strong>on</strong>ality) reducti<strong>on</strong>, for large<br />

document databases. Horiz<strong>on</strong>tal reducti<strong>on</strong> c<strong>on</strong>verts each text document to a few<br />

bitmap vectors and provides tight lower bounds of inter-document distances using<br />

those bitmap vectors. Bitmap representati<strong>on</strong> is very simple and extremely fast, and<br />

its instance-based nature makes it suitable for large and dynamic document databases.<br />

Using the proposed horiz<strong>on</strong>tal reducti<strong>on</strong>, we develop an efficient k-nearest<br />

neighbor (k-NN) search algorithm for text mining such as classificati<strong>on</strong> and clustering,<br />

and we formally prove its correctness. The proposed algorithm decreases I/O<br />

Page<br />

112


Abstracts<br />

and CPU overheads simultaneously since horiz<strong>on</strong>tal reducti<strong>on</strong> (1) reduces the number<br />

of accesses to documents significantly by exploiting the bitmap-based lower<br />

bounds in filtering dissimilar documents at an early stage, and accordingly, (2)<br />

decreases the number of CPU-intensive computati<strong>on</strong>s for obtaining a real distance<br />

between high-dimensi<strong>on</strong>al document vectors. Extensive experimental results show<br />

that horiz<strong>on</strong>tal reducti<strong>on</strong> improves the performance of the reducti<strong>on</strong> (preprocessing)<br />

process by <strong>on</strong>e to two orders of magnitude compared with existing reducti<strong>on</strong><br />

techniques, and our k-NN search algorithm significantly outperforms the existing<br />

<strong>on</strong>es by <strong>on</strong>e to three orders of magnitude.<br />

Adaptive Windows for Duplicate Detecti<strong>on</strong><br />

Uwe Draisbach (Hasso-Plattner-institute)<br />

Felix Naumann (Hasso-Plattner-institute)<br />

Sascha Szott (Zuse institute)<br />

oliver W<strong>on</strong>neberg (r. Lindner GmbH & co. KG)<br />

Duplicate detecti<strong>on</strong> is the task of identifying all groups of records within a data set<br />

that represent the same real-world entity, respectively. This task is difficult, because<br />

(i) representati<strong>on</strong>s might differ slightly, so some similarity measure must be defined<br />

to compare pairs of records and (ii) data sets might have a high volume making a<br />

pair-wise comparis<strong>on</strong> of all records infeasible. To tackle the sec<strong>on</strong>d problem, many<br />

algorithms have been suggested that partiti<strong>on</strong> the data set and compare all record<br />

pairs <strong>on</strong>ly within each partiti<strong>on</strong>. One well-known such approach is the Sorted Neighborhood<br />

Method (SNM), which sorts the data according to some key and then advances<br />

a window over the data comparing <strong>on</strong>ly records that appear within the same<br />

window. We propose with the Duplicate Count Strategy (DCS) a variati<strong>on</strong> of SNM that<br />

uses a varying window size. It is based <strong>on</strong> the intuiti<strong>on</strong> that there might be regi<strong>on</strong>s of<br />

high similarity suggesting a larger window size and regi<strong>on</strong>s of lower similarity suggesting<br />

a smaller window size. Next to the basic variant of DCS, we also propose and<br />

thoroughly evaluate a variant called DCS++ which is provably better than the original<br />

SNM in terms of efficiency (same results with fewer comparis<strong>on</strong>s).<br />

Efficient Dual-Resoluti<strong>on</strong> Layer Indexing for Top-k Queries<br />

J<strong>on</strong>gwuk Lee (Pohang University of Science and Technology (PoSTEcH))<br />

Hyunsouk cho (Pohang University of Science and Technology (PoSTEcH))<br />

Seung-w<strong>on</strong> Hwang (Pohang University of Science and Technology (PoSTEcH))<br />

Top-k queries have gained c<strong>on</strong>siderable attenti<strong>on</strong> as an effective means for narrowing<br />

down the overwhelming amount of data. This paper studies the problem<br />

of c<strong>on</strong>structing an indexing structure that efficiently supports top-k queries for<br />

varying scoring functi<strong>on</strong>s and retrieval sizes. The existing work can be categorized<br />

into three classes: list-, layer-, and view-based approaches. This paper focuses <strong>on</strong><br />

the layer-based approach, pre-materializing tuples into c<strong>on</strong>secutive multiple layers.<br />

The layer-based index enables us to return top-k answers efficiently by restricting<br />

access to tuples in the k layers. However, we observe that the number of tuples<br />

accessed in each layer can be reduced further. For this purpose, we propose a dualresoluti<strong>on</strong><br />

layer structure. Specifically, we iteratively build coarse-level layers using<br />

skylines, and divide each coarse-level layer into fine-level sublayers using c<strong>on</strong>vex<br />

skylines. The dual-resoluti<strong>on</strong> layer is able to leverage not <strong>on</strong>ly the dominance rela-<br />

Page<br />

113


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

ti<strong>on</strong>ship between coarse-level layers, named forall-dominance, but also a relaxed<br />

dominance relati<strong>on</strong>ship between fine-level sublayers, named exists-dominance. Our<br />

extensive evaluati<strong>on</strong> results dem<strong>on</strong>strate that our proposed method significantly<br />

reduces the number of tuples accessed than the state-of-the-art methods.<br />

Evaluating Probabilistic Queries over Uncertain Matching<br />

reynold cheng (The University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Jian G<strong>on</strong>g (The University of H<strong>on</strong>g K<strong>on</strong>g)<br />

David W. cheung (The University of H<strong>on</strong>g K<strong>on</strong>g)<br />

Jiefeng cheng (Shenzhen institute of Advanced Technology)<br />

A matching between two database schemas, generated by machine learning<br />

techniques (e.g., COMA++), is often uncertain. Handling the uncertainty of schema<br />

matching has recently raised a lot of research interest, because the quality of applicati<strong>on</strong>s<br />

rely <strong>on</strong> the matching result. We study query evaluati<strong>on</strong> over an inexact<br />

schema matching, which is represented as a set of ``possible mappings’’, as well<br />

as the probabilities that they are correct. Since the number of possible mappings<br />

can be large, evaluating queries through these mappings can be expensive. By<br />

observing the fact that the possible mappings between two schemas often exhibit<br />

a high degree of overlap, we develop two efficient soluti<strong>on</strong>s. We also present a fast<br />

algorithm to compute answers with the k highest probabilities. An extensive evaluati<strong>on</strong><br />

<strong>on</strong> real schemas shows that our approaches improve the query performance by<br />

almost an order of magnitude.<br />

SeSSi<strong>on</strong> 24: SENSorS NETWorK AND TrAJEcTory<br />

Detecting Outliers in Sensor Networks using the Geometric Approach<br />

Sabbas Burdakis (Technical University of crete)<br />

Ant<strong>on</strong>ios Deligiannakis (Technical University of crete)<br />

The topic of outlier detecti<strong>on</strong> in sensor networks has received significant attenti<strong>on</strong><br />

in recent years. Detecting when the measurements of a node become ``abnormal’’<br />

is interesting, because this event may help detect either a malfuncti<strong>on</strong>ing node, or a<br />

node that starts observing a local interesting phenomen<strong>on</strong> (i.e., a fire). In this paper<br />

we present a new algorithm for detecting outliers in sensor networks, based <strong>on</strong> the<br />

geometric approach. Unlike prior work. our algorithms perform a distributed m<strong>on</strong>itoring<br />

of outlier readings, exhibit 100% accuracy in their m<strong>on</strong>itoring (assuming no<br />

message losses), and require the transmissi<strong>on</strong> of messages <strong>on</strong>ly at a fracti<strong>on</strong> of the<br />

epochs, thus allowing nodes to safely refrain from transmitting in many epochs. Our<br />

approach is based <strong>on</strong> transforming comm<strong>on</strong> similarity metrics in a way that admits<br />

the applicati<strong>on</strong> of the recently proposed geometric approach. We then propose<br />

a general framework and suggest multiple modes of operati<strong>on</strong>, which allow each<br />

sensor node to accurately m<strong>on</strong>itor its similarity to other nodes. Our experiments<br />

dem<strong>on</strong>strate that our algorithms can accurately detect outliers at a fracti<strong>on</strong> of the<br />

communicati<strong>on</strong> cost that a centralized approach would require (even in the case<br />

where the central node lies just <strong>on</strong>e hop away from all sensor nodes). Moreover, we<br />

dem<strong>on</strong>strate that these bandwidth savings become even larger as we incorporate<br />

further optimizati<strong>on</strong>s in our proposed modes of operati<strong>on</strong>.<br />

Page<br />

114


Efficient Threshold M<strong>on</strong>itoring for Distributed Probabilistic <strong>Data</strong><br />

Mingwang Tang (University of Utah)<br />

Feifei Li (University of Utah)<br />

Jeff M. Phillips (University of Utah)<br />

Jeffrey Jestes (University of Utah)<br />

Abstracts<br />

In distributed data management, a primary c<strong>on</strong>cern is m<strong>on</strong>itoring the distributed<br />

data and generating an alarm when a user specified c<strong>on</strong>straint is violated. A particular<br />

useful instance is the threshold based c<strong>on</strong>straint, which is comm<strong>on</strong>ly known<br />

as the distributed threshold m<strong>on</strong>itoring problem. This work extends this useful and<br />

fundamental study to distributed probabilistic data that emerge in a lot of applicati<strong>on</strong>s,<br />

where uncertainty naturally exists when massive amounts of data are<br />

produced at multiple sources in distributed, networked locati<strong>on</strong>s. Examples include<br />

distributed observing stati<strong>on</strong>s, large sensor fields, geographically separate scientific<br />

institutes/units and many more. When dealing with probabilistic data, there<br />

are two thresholds involved, the score and the probability thresholds. One must<br />

m<strong>on</strong>itor both simultaneously, as such, techniques developed for deterministic data<br />

are no l<strong>on</strong>ger directly applicable. This work presents a comprehensive study to this<br />

problem. Our algorithms have significantly outperformed the baseline method in<br />

terms of both the communicati<strong>on</strong> cost (number of messages and bytes) and the<br />

running time, as shown by an extensive experimental evaluati<strong>on</strong> using several, real<br />

large datasets.<br />

Incorporating Durati<strong>on</strong> Informati<strong>on</strong> for Trajectory Classificati<strong>on</strong><br />

Dhaval Patel (Nati<strong>on</strong>al University of Singapore)<br />

chang Sheng (DBS Bank)<br />

Wynne Hsu (Nati<strong>on</strong>al University of Singapore)<br />

M<strong>on</strong>g Li Lee (Nati<strong>on</strong>al University of Singapore)<br />

Trajectory classificati<strong>on</strong> has many useful applicati<strong>on</strong>s. Existing works <strong>on</strong> trajectory<br />

classificati<strong>on</strong> do not c<strong>on</strong>sider the durati<strong>on</strong> informati<strong>on</strong> of trajectory. In this<br />

paper, we extract durati<strong>on</strong>-aware features from trajectories to build a classifier. Our<br />

method utilizes informati<strong>on</strong> theory to obtain regi<strong>on</strong>s where the trajectories have<br />

similar speeds and directi<strong>on</strong>s. Further, trajectories are summarized into a network<br />

based <strong>on</strong> the MDL principle that takes into account the durati<strong>on</strong> difference am<strong>on</strong>g<br />

trajectories of different classes. A graph traversal is performed <strong>on</strong> this trajectory<br />

network to obtain the top-k covering path rules for each trajectory. Based <strong>on</strong> the<br />

discovered regi<strong>on</strong>s and top-k path rules, we build a classifier to predict the class<br />

labels of new trajectories. Experiment results <strong>on</strong> real-world datasets show that the<br />

proposed durati<strong>on</strong>-aware classifier can obtain higher classificati<strong>on</strong> accuracy than<br />

the state-of-the-art trajectory classifier.<br />

Page<br />

115


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Reducing Uncertainty of Low-Sampling-Rate Trajectories<br />

Kai Zheng (The University of Queensland)<br />

yu Zheng (Microsoft research Asia)<br />

Xing Xie (Microsoft research Asia)<br />

Xiaofang Zhou (The University of Queensland)<br />

The increasing availability of GPS-embedded mobile devices has given rise to a new<br />

spectrum of locati<strong>on</strong>-based services, which have accumulated a huge collecti<strong>on</strong> of<br />

locati<strong>on</strong> trajectories. In practice, a large porti<strong>on</strong> of these trajectories are of lowsampling-rate.<br />

For instance, the time interval between c<strong>on</strong>secutive GPS points of<br />

some trajectories can be several minutes or even hours. With such a low sampling<br />

rate, most details of their movement are lost, which makes them difficult to process<br />

effectively. In this work, we investigate how to reduce the uncertainty in such kind<br />

of trajectories. Specifically, given a low-sampling-rate trajectory, we aim to infer its<br />

possible routes. The methodology adopted in our work is to take full advantage<br />

of the rich informati<strong>on</strong> extracted from the historical trajectories. We propose a<br />

systematic soluti<strong>on</strong>, History based Route Inference System (HRIS), which covers a<br />

series of novel algorithms that can derive the travel pattern from historical data and<br />

incorporate it into the route inference process. To validate the effectiveness of the<br />

system, we apply our soluti<strong>on</strong> to the map-matching problem which is an important<br />

applicati<strong>on</strong> scenario of this work, and c<strong>on</strong>duct extensive experiments <strong>on</strong> a real<br />

taxi trajectory dataset. The experiment results dem<strong>on</strong>strate that HRIS can achieve<br />

higher accuracy than the existing map-matching algorithms for low-sampling-rate<br />

trajectories.<br />

SeSSi<strong>on</strong> 25: Error rEDUcTioN AND DATA SEcUriTy<br />

Efficient Similarity Search over Encrypted <strong>Data</strong><br />

Mehmet Kuzu (The University of Texas at Dallas)<br />

Mohammad Saiful islam (The University of Texas at Dallas)<br />

Murat Kantarcioglu (The University of Texas at Dallas)<br />

In recent years, due to the appealing features of cloud computing, large amount<br />

of data have been stored in the cloud. Although cloud based services offer many<br />

advantages, privacy and security of the sensitive data is a big c<strong>on</strong>cern. To mitigate<br />

the c<strong>on</strong>cerns, it is desirable to outsource sensitive data in encrypted form. Encrypted<br />

storage protects the data against illegal access, but it complicates some basic,<br />

yet important functi<strong>on</strong>ality such as the search <strong>on</strong> the data. To achieve search over<br />

encrypted data without compromising the privacy, c<strong>on</strong>siderable amount of searchable<br />

encrypti<strong>on</strong> schemes have been proposed in the literature. However, almost all<br />

of them handle exact query matching but not similarity matching; a crucial requirement<br />

for real world applicati<strong>on</strong>s. Although some sophisticated secure multi-party<br />

computati<strong>on</strong> based cryptographic techniques are available for similarity tests, they<br />

are computati<strong>on</strong>ally intensive and do not scale for large data sources. In this paper,<br />

we propose an efficient scheme for similarity search over encrypted data. To do so,<br />

we utilize a state-of-the-art algorithm for fast near neighbor search in high dimensi<strong>on</strong>al<br />

spaces called locality sensitive hashing. To ensure the c<strong>on</strong>fidentiality of the<br />

sensitive data, we provide a rigorous security definiti<strong>on</strong> and prove the security of<br />

the proposed scheme under the provided definiti<strong>on</strong>. In additi<strong>on</strong>, we provide a real<br />

Page<br />

116


Abstracts<br />

world applicati<strong>on</strong> of the proposed scheme and verify the theoretical results with<br />

empirical observati<strong>on</strong>s <strong>on</strong> a real dataset.<br />

Obfuscating the Topical Intenti<strong>on</strong> in Enterprise Text Search<br />

HweeHwa Pang (Singapore Management University)<br />

Xiaokui Xiao (Nanyang Technological University)<br />

Jialie Shen (Singapore Management University)<br />

The text search queries in an enterprise can reveal the users’ topic of interest, and<br />

in turn c<strong>on</strong>fidential staff or business informati<strong>on</strong>. To safeguard the enterprise from<br />

c<strong>on</strong>sequences arising from a disclosure of the query traces, it is desirable to obfuscate<br />

the true user intenti<strong>on</strong> from the search engine, without requiring it to be reengineered.<br />

In this paper, we advocate a unique approach to profile the topics that<br />

are relevant to the user intenti<strong>on</strong>. Based <strong>on</strong> this approach, we introduce an (epsil<strong>on</strong> 1 ,<br />

epsil<strong>on</strong> 2 )-privacy model that allows a user to stipulate that topics relevant to her<br />

intenti<strong>on</strong> at epsil<strong>on</strong> 1 level should appear to any adversary to be innocuous at epsil<strong>on</strong><br />

2 level. We then present a TopPriv algorithm to achieve the customized (epsil<strong>on</strong> 1 ,<br />

epsil<strong>on</strong> 2 )-privacy requirement of individual users through injecting automatically<br />

formulated fake queries. The advantages of TopPriv over existing techniques are<br />

c<strong>on</strong>firmed through benchmark queries <strong>on</strong> a real corpus, with experiment settings<br />

fashi<strong>on</strong>ed after an enterprise search applicati<strong>on</strong>.<br />

Correlati<strong>on</strong> Support for Risk Evaluati<strong>on</strong> in <strong>Data</strong>bases<br />

Katrin Eisenreich (SAP research)<br />

Jochen Adamek (Technische Universität Berlin)<br />

Philipp rösch (SAP research)<br />

volker Markl (Technische Universität Berlin)<br />

Gregor Hackenbroich (SAP research)<br />

Investigating potential dependencies in data and their effect <strong>on</strong> future business<br />

developments can help experts to prevent misestimati<strong>on</strong>s of risks and chances. This<br />

makes correlati<strong>on</strong> a highly important factor in risk analysis tasks. Previous research<br />

<strong>on</strong> correlati<strong>on</strong> in uncertain data management addressed foremost the handling of<br />

dependencies between discrete rather than c<strong>on</strong>tinuous distributi<strong>on</strong>s. Also, n<strong>on</strong>e of<br />

the existing approaches provides a clear method for extracting correlati<strong>on</strong> structures<br />

from data and introducing assumpti<strong>on</strong>s about correlati<strong>on</strong> to independently<br />

represented data. To enable risk analysis under correlati<strong>on</strong> assumpti<strong>on</strong>s, we use<br />

an approximati<strong>on</strong> technique based <strong>on</strong> copula functi<strong>on</strong>s. This technique enables<br />

analysts to introduce arbitrary correlati<strong>on</strong> structures between arbitrary distributi<strong>on</strong>s<br />

and calculate relevant measures over thus correlated data. The correlati<strong>on</strong> informati<strong>on</strong><br />

can either be extracted at runtime from historic data or be accessed from a<br />

parametrically precomputed structure. We discuss the c<strong>on</strong>structi<strong>on</strong>, applicati<strong>on</strong> and<br />

querying of approximate correlati<strong>on</strong> representati<strong>on</strong>s for different analysis tasks. Our<br />

experiments dem<strong>on</strong>strate the efficiency and accuracy of the proposed approach,<br />

and point out several possibilities for optimizati<strong>on</strong>.<br />

Page<br />

117


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

A Game-Theoretic Approach for High-Assurance of <strong>Data</strong> Trustworthiness<br />

in Sensor Networks<br />

Hyo-Sang Lim (Purdue University & computer and Telecommunicati<strong>on</strong>s <strong>Engineering</strong><br />

Divisi<strong>on</strong>, South Korea)<br />

Gabriel Ghinita (University of Massachusetts at Bost<strong>on</strong>)<br />

Elisa Bertino (Purdue University)<br />

Murat Kantarcioglu (University of Texas at Dallas)<br />

Sensor networks are being increasingly deployed in many applicati<strong>on</strong> domains<br />

ranging from envir<strong>on</strong>ment m<strong>on</strong>itoring to supervising critical infrastructure systems<br />

(e.g., the power grid). Due to their ability to c<strong>on</strong>tinuously collect large amounts of<br />

data, sensor networks represent a key comp<strong>on</strong>ent in decisi<strong>on</strong>-making, enabling<br />

timely situati<strong>on</strong> assessment and resp<strong>on</strong>se. However, sensors deployed in hostile envir<strong>on</strong>ments<br />

may be subject to attacks by adversaries who intend to inject false data<br />

into the system. In this c<strong>on</strong>text, data trustworthiness is an important c<strong>on</strong>cern, as<br />

false readings may result in wr<strong>on</strong>g decisi<strong>on</strong>s with serious c<strong>on</strong>sequences (e.g., largescale<br />

power outages). To defend against this threat, it is important to establish trust<br />

levels for sensor nodes and adjust node trustworthiness scores to account for malicious<br />

interferences. In this paper, we develop a game-theoretic defense strategy<br />

to protect sensor nodes from attacks and to guarantee a high level of trustworthiness<br />

for sensed data. We use a discrete time model, and we c<strong>on</strong>sider that there is a<br />

limited attack budget that bounds the capability of the attacker in each round. The<br />

defense strategy objective is to ensure that sufficient sensor nodes are protected in<br />

each round such that the discrepancy between the value accepted and the truthful<br />

sensed value is below a certain threshold. We model the attack-defense interacti<strong>on</strong><br />

as a Stackelberg game, and we derive the Nash equilibrium c<strong>on</strong>diti<strong>on</strong> that is sufficient<br />

to ensure that the sensed data are truthful within a nominal error bound. We<br />

implement a prototype of the proposed strategy and we show through extensive<br />

experiments that our soluti<strong>on</strong> provides an effective and efficient way of protecting<br />

sensor networks from attacks.<br />

induStrial SeSSi<strong>on</strong> 1:<br />

SUPPorT For LArGE ScALE DATA ANALyTicS<br />

Exploiting Comm<strong>on</strong> Subexpressi<strong>on</strong>s for Cloud Query Processing<br />

yasin N. Silva (Ariz<strong>on</strong>a State University)<br />

Per-Ake Lars<strong>on</strong> (Microsoft research)<br />

Jingren Zhou (Microsoft corp.)<br />

Many companies now routinely run massive data analysis jobs – expressed in some<br />

scripting language – <strong>on</strong> large clusters of low-end servers. Many analysis scripts are<br />

complex and c<strong>on</strong>tain comm<strong>on</strong> subexpressi<strong>on</strong>s, that is, intermediate results that are<br />

subsequently joined and aggregated in multiple different ways. Applying c<strong>on</strong>venti<strong>on</strong>al<br />

optimizati<strong>on</strong> techniques to such scripts will produce plans that execute a<br />

comm<strong>on</strong> subexpressi<strong>on</strong> multiple times, <strong>on</strong>ce for each c<strong>on</strong>sumer, which is clearly<br />

wasteful. Moreover, different c<strong>on</strong>sumers may have different physical requirements<br />

<strong>on</strong> the result: <strong>on</strong>e c<strong>on</strong>sumer may want it partiti<strong>on</strong>ed <strong>on</strong> a column A and another<br />

<strong>on</strong>e partiti<strong>on</strong>ed <strong>on</strong> column B. To find a truly optimal plan, the optimizer must trade<br />

Page<br />

118


Abstracts<br />

off such c<strong>on</strong>flicting requirements in a cost-based manner. In this paper we show<br />

how to extend a Cascade-style optimizer to correctly optimize scripts c<strong>on</strong>taining<br />

comm<strong>on</strong> subexpressi<strong>on</strong>. The approach has been prototyped in SCOPE, Microsoft’s<br />

system for massive data analysis. Experimental analysis of both simple and large<br />

real-world scripts shows that the extended optimizer produces plans with 21 to 57%<br />

lower estimated costs.<br />

Vectorwise: a Vectorized Analytical DBMS<br />

Marcin Zukowski (Actian Netherlands)<br />

Mark van de Wiel (Actian corp.)<br />

Peter B<strong>on</strong>cz (cWi)<br />

vectorwise is a new entrant in the analytical database marketplace whose technology<br />

comes straight from innovati<strong>on</strong>s in the database research community in the past<br />

years. The product has since made waves due to its excellent performance in analytical<br />

customer workloads as well as benchmarks. We describe the history of vectorwise, as<br />

well as its basic architecture and the experiences in turning a technology developed in<br />

an academic c<strong>on</strong>text into a commercial-grade product. Finally, we turn our attenti<strong>on</strong> to<br />

recent performance results, most notably <strong>on</strong> the TPc-H benchmark at various sizes.<br />

Scalable and Numerically Stable Descriptive Statistics in SystemML<br />

yuanyuan Tian (iBM Almaden research center)<br />

Shirish Tatik<strong>on</strong>da (iBM Almaden research center)<br />

Berthold reinwald (iBM Almaden research center)<br />

There has been growing need for applying machine learning (ML) algorithms <strong>on</strong><br />

very large datasets. SystemML is a declarative approach to scalable statistical ML.<br />

In SystemML, statistical ML algorithms are expressed as simple scripts in a highlevel<br />

language. SystemML then complies and optimizes the scripts, and eventually<br />

translates them into efficient runtime <strong>on</strong> MapReduce. As the basis of virtually<br />

every quantitative analysis, descriptive statistics provide powerful tools to explore<br />

data in SystemML. This paper describes our experience in implementing descriptive<br />

statistics in SystemML. In particular, we elaborate <strong>on</strong> how to overcome the two<br />

major challenges: (1) numerical stability while operating <strong>on</strong> large datasets in the<br />

distributed setting of MapReduce; (2) efficient implementati<strong>on</strong> of order statistics in<br />

MapReduce.<br />

Page<br />

119


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

induStrial SeSSi<strong>on</strong> 2:<br />

EvoLviNG PLATForMS For NEW APPLicATioNS<br />

Earlybird: Real-Time Search at Twitter<br />

Michael Busch (Twitter)<br />

Krishna Gade (Twitter)<br />

Brian Lars<strong>on</strong> (Twitter)<br />

Patrick Lok (Twitter)<br />

Samuel Luckenbill (Twitter)<br />

Jimmy Lin (Twitter)<br />

The web today is increasingly characterized by social and real-time signals, which<br />

we believe represent two fr<strong>on</strong>tiers in informati<strong>on</strong> retrieval. In this paper, we present<br />

Earlybird, the core retrieval engine that powers Twitter’s real-time search service.<br />

Although Earlybird builds and maintains inverted indexes like nearly all modern retrieval<br />

engines, its index structures differ from those built to support traditi<strong>on</strong>al web<br />

search. We describe these differences and present the rati<strong>on</strong>ale behind our design.<br />

A key requirement of real-time search is the ability to ingest c<strong>on</strong>tent rapidly and<br />

make it searchable immediately, while c<strong>on</strong>currently supporting low-latency, highthroughput<br />

query evaluati<strong>on</strong>. These demands are met with a single-writer, multiplereader<br />

c<strong>on</strong>currency model and the targeted use of memory barriers. Earlybird represents<br />

a point in the design space of real-time search engines that has worked well<br />

for Twitter’s needs. By sharing our experiences, we hope to spur additi<strong>on</strong>al interest<br />

and innovati<strong>on</strong> in this exciting space.<br />

<strong>Data</strong> Infrastructure at LinkedIn<br />

Linkedin <strong>Data</strong> infrastructure Team<br />

LinkedIn is am<strong>on</strong>g the largest social networking sites in the world. As the company<br />

has grown, our core data sets and request processing requirements have grown as<br />

well. In this paper, we describe a few selected data infrastructure projects at LinkedIn<br />

that have helped us accommodate this increasing scale. Most of those projects<br />

build <strong>on</strong> existing open source projects and are themselves available as open source.<br />

The projects covered in this paper include: (1) Voldemort: a scalable and fault tolerant<br />

key-value store; (2) <strong>Data</strong>bus: a framework for delivering database changes to<br />

downstream applicati<strong>on</strong>s; (3) Espresso: a distributed data store that supports flexible<br />

schemas and sec<strong>on</strong>dary indexing; (4) Kafka: a scalable and efficient messaging<br />

system for collecting various user activity events and log data.<br />

The Credit Suisse Meta-data Warehouse<br />

claudio Jossen (credit Suisse AG)<br />

Lukas Blunschi (ETH Zurich)<br />

Magdalini Mori (credit Suisse AG)<br />

D<strong>on</strong>ald Kossmann (ETH Zurich)<br />

Kurt Stockinger (credit Suisse AG)<br />

This paper describes the meta-data warehouse of Credit Suisse that is productive<br />

since 2009. Like most other large organizati<strong>on</strong>s, Credit Suisse has a complex<br />

Page<br />

120


Abstracts<br />

applicati<strong>on</strong> landscape and several data warehouses in order to meet the informati<strong>on</strong><br />

needs of its users. The problem addressed by the meta-data warehouse is to<br />

increase the agility and flexibility of the organizati<strong>on</strong> with regards to changes such<br />

as the development of a new business process, a new business analytics report, or<br />

the implementati<strong>on</strong> of a new regulatory requirement. The meta-data warehouse<br />

supports these changes by providing services to search for informati<strong>on</strong> items in<br />

the data warehouses and to extract the lineage of informati<strong>on</strong> items. One difficulty<br />

in the design of such a meta-data warehouse is that there is no standard or wellknown<br />

meta-data model that can be used to support such search services. Instead,<br />

the meta-data structures need to be flexible themselves and evolve with the changing<br />

IT landscape. This paper describes the current data structures and implementati<strong>on</strong><br />

of the Credit Suisse meta-data warehouse and shows how its services help to<br />

increase the flexibility of the whole organizati<strong>on</strong>. A series of example meta-data<br />

structures, use cases, and screenshots are given in order to illustrate the c<strong>on</strong>cepts<br />

used and the less<strong>on</strong>s learned based <strong>on</strong> feedback of real business and IT users<br />

within Credit Suisse.<br />

induStrial SeSSi<strong>on</strong> 3:<br />

iNDEXiNG, UPDATES AND ProcESSiNG<br />

Efficient Support of XQuery Update Facility in XML Enabled RDBMS<br />

Zhen Hua Liu (oracle)<br />

Hui J. chang (oracle)<br />

Balasubramanyam Sthanikam (oracle)<br />

XQuery Update Facility (XQUF), which provides a declarative way of updating<br />

XML, has become recommendati<strong>on</strong> by W3C. The SQL/XML standard, <strong>on</strong> the other<br />

hand, defines XMLType as a column data type in RDBMS envir<strong>on</strong>ment and defines<br />

the standard SQL/XML operator, such as XMLQuery() to embed XQuery to query<br />

XMLType column in RDBMS. Based <strong>on</strong> this SQL/XML standard, XML enabled RD-<br />

BMS becomes industrial strength platforms to host XML applicati<strong>on</strong>s in a standard<br />

compliance way by providing XML store and query capability. However, updating<br />

XML capability support remains to be proprietary in RDBMS until XQUF becomes<br />

the recommendati<strong>on</strong>. XQUF is agnostic of how XML is stored so that propagati<strong>on</strong><br />

of actual update to any persistent XML store is bey<strong>on</strong>d the scope of XQUF. In this<br />

paper, we show how XQUF can be incorporated into XMLQuery() to effectively<br />

update XML stored in XMLType column in the envir<strong>on</strong>ment of XML enabled RDBMS,<br />

such as Oracle XMLDB. We present various compile time and run time optimisati<strong>on</strong><br />

techniques to show how XQUF can be efficiently implemented to declaratively<br />

update XML stored in RDBMS. We present how our approaches of optimising XQUF<br />

for comm<strong>on</strong> physical XML storage models: native binary XML storage model and<br />

relati<strong>on</strong>al decompositi<strong>on</strong> of XML storage model. Although our study is d<strong>on</strong>e using<br />

Oracle XMLDB, all of the presented optimisati<strong>on</strong> techniques are generic to XML<br />

stores that need to support update of persistent XML store and not specific to<br />

Oracle XMLDB implementati<strong>on</strong>.<br />

Page<br />

121


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Making Unstructured <strong>Data</strong> SPARQL Using Semantic Indexing in Oracle<br />

<strong>Data</strong>base<br />

Souripriya Das (oracle)<br />

Seema Sundara (oracle)<br />

Matthew Perry (oracle)<br />

Jagannathan Srinivasan (oracle)<br />

Jayanta Banerjee (oracle)<br />

Aravind yalamanchi (oracle)<br />

This paper describes the Semantic Indexing feature introduced in Oracle <strong>Data</strong>base<br />

for indexing unstructured text (document) columns. This capability enables searching<br />

for c<strong>on</strong>cepts (such as people, places, organizati<strong>on</strong>s, and events), in additi<strong>on</strong> to<br />

words or phrases, with further opti<strong>on</strong>s for sense disambiguati<strong>on</strong> and term expansi<strong>on</strong><br />

by c<strong>on</strong>sulting knowledge captured in OWL/RDF <strong>on</strong>tologies. The distinguishing<br />

aspects of our approach are: 1) Indexing: Instead of building a traditi<strong>on</strong>al inverted<br />

index of (annotated) token and/or named entity occurrences, we extract the entities,<br />

associati<strong>on</strong>s, and events present in a text column data and store them as RDF<br />

named graphs in the Oracle <strong>Data</strong>base Semantic Store. This base c<strong>on</strong>tent can be<br />

further augmented with knowledge bases and inferred triples (obtained by applying<br />

domain-specific <strong>on</strong>tologies and rulebases). 2) Querying: Instead of relying <strong>on</strong><br />

proprietary extensi<strong>on</strong>s for specifying a search, we allow users to specify a complete<br />

SPARQL query pattern that can capture arbitrarily complex relati<strong>on</strong>ships between<br />

query terms. We have implemented this feature by introducing a sem_c<strong>on</strong>tains<br />

SQL operator and the associated sem_indextype indexing scheme. The indexing<br />

scheme employs an extensible architecture that supports indexing of unstructured<br />

text using native as well as third party text extracti<strong>on</strong> tools. The paper presents a<br />

model for the semantic index and querying, describes the feature, and outlines its<br />

implementati<strong>on</strong> leveraging Oracle’s native support for RDF/OWL storage, inferencing,<br />

and querying. We also report a study involving use of this feature <strong>on</strong> a TREC<br />

collecti<strong>on</strong> of over 130,000 news articles.<br />

A meta-language for MDX queries in eLog Business Soluti<strong>on</strong><br />

S<strong>on</strong>ia Bergamaschi (University of Modena and reggio Emilia)<br />

Matteo interlandi (University of Modena and reggio Emilia)<br />

Mario L<strong>on</strong>go (eBilling S.p.A.)<br />

Laura Po (University of Modena and reggio Emilia)<br />

Maurizio vincini (University of Modena and reggio Emilia)<br />

The adopti<strong>on</strong> of business intelligence technology in industries is growing rapidly.<br />

Business managers are not satisfied with ad hoc and static reports and they ask for<br />

more flexible and easy to use data analysis tools. Recently, applicati<strong>on</strong> interfaces<br />

that expand the range of operati<strong>on</strong>s available to the user, hiding the underlying<br />

complexity, have been developed. The paper presents eLog, a business intelligence<br />

soluti<strong>on</strong> designed and developed in collaborati<strong>on</strong> between the database group of<br />

the University of Modena and Reggio Emilia and eBilling, an Italian SME supplier of<br />

soluti<strong>on</strong>s for the design, producti<strong>on</strong> and automati<strong>on</strong> of documentary processes for<br />

top Italian companies. eLog enables business managers to define OLAP reports by<br />

means of a web interface and to customize analysis indicators adopting a simple<br />

meta-language. The framework translates the user’s reports into MDX queries and<br />

Page<br />

122


Abstracts<br />

is able to automatically select the data cube suitable for each query. Over 140<br />

medium and large companies have exploited the technological services of eBilling<br />

S.p.A. to manage their documents flows. In particular, eLog services have been used<br />

by the major media and telecommunicati<strong>on</strong>s Italian companies and their foreign<br />

annex, such as Sky, Mediaset, H3G, Tim Brazil etc. The largest customer can provide<br />

up to 30 milli<strong>on</strong>s mail pieces within 6 m<strong>on</strong>ths (about 200 GB of data in the relati<strong>on</strong>al<br />

DBMS). In a period of 18 m<strong>on</strong>ths, eLog could reach 150 milli<strong>on</strong>s mail pieces (1<br />

TB of data) to handle.<br />

demo group 1:<br />

SMIX Live – A Self-Managing Index Infrastructure for Dynamic Workloads<br />

Thomas Kissinger (Dresden University of Technology)<br />

Hannes voigt (Dresden University of Technology)<br />

Wolfgang Lehner (Dresden University of Technology)<br />

As databases accumulate growing amounts of data at an increasing rate, adaptive<br />

indexing becomes more and more important. At the same time, applicati<strong>on</strong>s and<br />

their use get more agile and flexible, resulting in less steady and less predictable<br />

workload characteristics. Being inert and coarse-grained, state-of-the-art index tuning<br />

techniques become less useful in such envir<strong>on</strong>ments. Especially the full-column<br />

indexing paradigm results in lot of indexed but never queried data and prohibitively<br />

high memory and maintenance costs. In our dem<strong>on</strong>strati<strong>on</strong>, we present Self-Managing<br />

Indexes, a novel, adaptive, fine-grained, aut<strong>on</strong>omous indexing infrastructure.<br />

In its core, our approach builds <strong>on</strong> a novel access path that automatically collects<br />

useful index informati<strong>on</strong>, discards useless index informati<strong>on</strong>, and competes with<br />

its kind for resources to host its index informati<strong>on</strong>. Compared to existing technologies<br />

for adaptive indexing, we are able to dynamically grow and shrink our indexes,<br />

instead of incrementally enhancing the index granularity. In the dem<strong>on</strong>strati<strong>on</strong>, we<br />

visualize performance and system measures for different scenarios and allow the<br />

user to interactively change several system parameters.<br />

Multi-Query Stream Processing <strong>on</strong> FPGAs<br />

Mohammad Sadoghi (University of Tor<strong>on</strong>to)<br />

rija Javed (University of Tor<strong>on</strong>to)<br />

Naif Tarafdar (University of Tor<strong>on</strong>to)<br />

Harsh Singh (University of Tor<strong>on</strong>to)<br />

rohan Palaniappan (University of Tor<strong>on</strong>to)<br />

Hans-Arno Jacobsen (University of Tor<strong>on</strong>to)<br />

We present an efficient multi-query event stream platform to support query processing<br />

over high-frequency event streams. Our platform is built over rec<strong>on</strong>figurable<br />

hardware—-FPGAs—-to achieve line-rate multi-query processing by exploiting<br />

unprecedented degrees of parallelism and potential for pipelining, <strong>on</strong>ly available<br />

through custom-built, applicati<strong>on</strong>-specific and low-level logic design. Moreover, a<br />

multi-query event stream processing engine is at the core of a wide range of applicati<strong>on</strong>s<br />

including real-time data analytics, algorithmic trading, targeted advertisement,<br />

and (complex) event processing.<br />

Page<br />

123


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

EUDEMON: A System for Online Video Frame Copy Detecti<strong>on</strong> by Earth<br />

Mover’s Distance<br />

Jia Xu (Northeastern University, china)<br />

Qiushi Bai (Northeastern University, china)<br />

yu Gu (Northeastern University, china)<br />

Anth<strong>on</strong>y K.H. Tung (Nati<strong>on</strong>al University of Singapore)<br />

Guoren Wang (Northeastern University, china)<br />

Ge yu (Northeastern University, china)<br />

Zhenjie Zhang (Advanced Digital Sciences center, illinois at Singapore Pte.)<br />

The Earth Mover’s Distance, or EMD for short, has been proven to be effective for<br />

c<strong>on</strong>tent-based image retrieval. However, due to the cubic complexity of EMD computati<strong>on</strong>,<br />

it remains difficult to use EMD in applicati<strong>on</strong>s with stringent requirement<br />

for efficiency. In this paper, we present our new system, called EUDEMON, which<br />

utilizes new techniques to support fast Online Video Frame Copy Detecti<strong>on</strong> based<br />

<strong>on</strong> the EMD. Given a group of registered frames as queries and a set of targeted<br />

detecti<strong>on</strong> videos, EUDEMON is capable of identifying relevant frames from the<br />

video stream in real time. The significant improvement <strong>on</strong> efficiency mainly relies<br />

<strong>on</strong> the primal-dual theory in linear programming and well-designed B+ tree filters<br />

for adaptive candidate pruning. Generally speaking, our system includes a variety<br />

of new features crucial to the deployment of EUDEMON in real applicati<strong>on</strong>s. First,<br />

EUDEMON achieves high throughput even when a large number of queries are registered<br />

in the system. Sec<strong>on</strong>d, EUDEMON c<strong>on</strong>tains self-optimizati<strong>on</strong> comp<strong>on</strong>ent to<br />

automatically enhance the effectiveness of the filters based <strong>on</strong> the recent c<strong>on</strong>tent<br />

of the video stream. Finally, EUDEMON provides a user-friendly visualizati<strong>on</strong> interface,<br />

named EMD Flow Chart, to help the users to better understand the alarm with<br />

the perspective of the EMD.<br />

A <strong>Data</strong>set Search Engine for the Research Document Corpus<br />

Meiyu Lu (Nati<strong>on</strong>al University of Singapore)<br />

Srinivas Bangalore (AT&T Labs–research)<br />

Graham cormode (AT&T Labs–research)<br />

Marios Hadjieleftheriou (AT&T Labs–research)<br />

Divesh Srivastava (AT&T Labs–research)<br />

A key step in validating a proposed idea or system is to evaluate over a suitable<br />

dataset. However, to this date there have been no useful tools for researchers to<br />

understand which datasets have been used for what purpose, or in what prior work.<br />

Instead, they have to manually browse through papers to find the suitable datasets<br />

and their corresp<strong>on</strong>ding URLs, which is laborious and inefficient. To better aid the<br />

dataset discovery process, and provide a better understanding of how and where<br />

datasets have been used, we propose a framework to effectively identify datasets<br />

within the scientific corpus. The key technical challenges are identificati<strong>on</strong> of datasets,<br />

and discovery of the associati<strong>on</strong> between a dataset and the URLs where they<br />

can be accessed. Based <strong>on</strong> this, we have built a user friendly web-based search<br />

interface for users to c<strong>on</strong>veniently explore the dataset-paper relati<strong>on</strong>ships, and find<br />

relevant datasets and their properties.<br />

Page<br />

124


AskFuzzy: Attractive Visual Fuzzy Query Builder<br />

Keivan Kianmehr (University of Western <strong>on</strong>tario)<br />

Negar Koochakzadeh (University of calgary)<br />

reda Alhajj (University of calgary)<br />

Abstracts<br />

The user-centric query interface is very comm<strong>on</strong> applicati<strong>on</strong> that allows expressing<br />

both the input and the output using fuzzy terms. This is becoming a need in the<br />

evolving internet-based era where web-based applicati<strong>on</strong>s are very comm<strong>on</strong> and<br />

the number of users accessing structured databases is increasing rapidly. Restricting<br />

the user group to <strong>on</strong>ly experts in query coding must be avoided. The AskFuzzy<br />

system has been developed to address this vital issue which has social and industrial<br />

impact. It is an attractive and friendly visual user interface that facilitates<br />

expressing queries using both fuzziness and traditi<strong>on</strong>al methods. The fuzziness is<br />

not expressed explicitly inside the database; it is rather absorbed and effectively<br />

handled by an intermediate layer which is cleverly incorporated between the fr<strong>on</strong>tend<br />

visual user-interface and the back-end database.<br />

F2DB: The Flash-Forward <strong>Data</strong>base System<br />

Ulrike Fischer (Dresden University of Technology)<br />

Frank rosenthal (Dresden University of Technology)<br />

Wolfgang Lehner (Dresden University of Technology)<br />

Forecasts are important to decisi<strong>on</strong>-making and risk assessment in many domains.<br />

Since current database systems do not provide integrated support for forecasting,<br />

it is usually d<strong>on</strong>e outside the database system by specially trained experts using<br />

forecast models. However, integrating model-based forecasting as a first-class<br />

citizen inside a DBMS speeds up the forecasting process by avoiding exporting the<br />

data and by applying database-related optimizati<strong>on</strong>s like reusing created forecast<br />

models. It especially allows subsequent processing of forecast results inside the database.<br />

In this demo, we present our prototype F2DB based <strong>on</strong> PostgreSQL, which<br />

allows for transparent processing of forecast queries. Our system automatically<br />

takes care of model maintenance when the underlying dataset changes. In additi<strong>on</strong>,<br />

we offer optimizati<strong>on</strong>s to save maintenance costs and increase accuracy by using<br />

derivati<strong>on</strong> schemes for multidimensi<strong>on</strong>al data. Our approach reduces the required<br />

expert knowledge by enabling arbitrary users to apply forecasting in a declarative<br />

way.<br />

Provenance-Based Debugging and Drill-Down in <strong>Data</strong>-Oriented Workflows<br />

robert ikeda (Stanford University)<br />

Junsang cho (Stanford University)<br />

charlie Fang (Stanford University)<br />

Semih Salihoglu (Stanford University)<br />

Satoshi Torikai (Stanford University)<br />

Jennifer Widom (Stanford University)<br />

Panda (for Provenance and <strong>Data</strong>) is a system that supports the creati<strong>on</strong> and executi<strong>on</strong><br />

of data-oriented workflows, with automatic provenance generati<strong>on</strong> and built-in<br />

provenance tracing operati<strong>on</strong>s. Workflows in Panda are arbitrary acyclic graphs<br />

Page<br />

125


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

c<strong>on</strong>taining both relati<strong>on</strong>al (SQL) processing nodes and opaque processing nodes<br />

programmed in Pyth<strong>on</strong>. For both types of nodes, Panda generates logical provenance—-provenance<br />

informati<strong>on</strong> stored at the processing-node level—-and uses<br />

the generated provenance to support record-level backward tracing and forward<br />

tracing operati<strong>on</strong>s. In our dem<strong>on</strong>strati<strong>on</strong> we use Panda to integrate, process, and<br />

analyze actual educati<strong>on</strong> data from multiple sources. We specifically dem<strong>on</strong>strate<br />

how Panda’s provenance generati<strong>on</strong> and tracing capabilities can be very useful for<br />

workflow debugging, and for drilling down <strong>on</strong> specific results of interest.<br />

demo group 2:<br />

M 3 : Stream Processing <strong>on</strong> Main-Memory MapReduce<br />

Ahmed M. Aly (Purdue University)<br />

Asmaa Sallam (Purdue University)<br />

Bala M. Gnanasekaran (Purdue University)<br />

L<strong>on</strong>g-van Nguyen-Dinh (Purdue University)<br />

Walid G. Aref (Purdue University)<br />

Mourad ouzzani (Qatar computing research institute)<br />

Arif Ghafoor (Purdue University)<br />

The c<strong>on</strong>tinuous growth of social web applicati<strong>on</strong>s al<strong>on</strong>g with the development of<br />

sensor capabilities in electr<strong>on</strong>ic devices is creating countless opportunities to analyze<br />

the enormous amounts of data that is c<strong>on</strong>tinuously steaming from these applicati<strong>on</strong>s<br />

and devices. To process large scale data <strong>on</strong> large scale computing clusters,<br />

MapReduce has been introduced as a framework for parallel computing. However,<br />

most of the current implementati<strong>on</strong>s of the MapReduce framework support <strong>on</strong>ly<br />

the executi<strong>on</strong> of fixed-input jobs. Such restricti<strong>on</strong> makes these implementati<strong>on</strong>s<br />

inapplicable for most streaming applicati<strong>on</strong>s, in which queries are c<strong>on</strong>tinuous in<br />

nature, and input data streams are c<strong>on</strong>tinuously received at high arrival rates. In<br />

this dem<strong>on</strong>strati<strong>on</strong>, we showcase M 3 , a prototype implementati<strong>on</strong> of the MapReduce<br />

framework in which c<strong>on</strong>tinuous queries over streams of data can be efficiently<br />

answered. M 3 extends Hadoop, the open source implementati<strong>on</strong> of MapReduce, bypassing<br />

the Hadoop Distributed File System (HDFS) to support main-memory-<strong>on</strong>ly<br />

processing. Moreover, M 3 supports c<strong>on</strong>tinuous executi<strong>on</strong> of the Map and Reduce<br />

phases where individual Mappers and Reducers never terminate.<br />

A Deep Embedding of Queries into Ruby<br />

Torsten Grust (University of Tübingen)<br />

Manuel Mayr (University of Tübingen)<br />

We dem<strong>on</strong>strate SWITCH, a deep embedding of relati<strong>on</strong>al queries into Ruby and<br />

Ruby <strong>on</strong> Rails. With SWITCH, there is no syntactic or stylistic difference between<br />

Ruby programs that operate over in-memory array objects or database-resident<br />

tables, even if these programs rely <strong>on</strong> array order or nesting. SWITCH’s built-in<br />

compiler and SQL code generator guarantee to emit few queries, addressing l<strong>on</strong>gstanding<br />

performance problems that trace back to Rails’ ActiveRecord database<br />

binding. “Looks likes Ruby, but performs like handcrafted SQL,” is the ideal that<br />

drives the research and development effort behind SWITCH.<br />

Page<br />

126


Asking the Right Questi<strong>on</strong>s in Crowd <strong>Data</strong> Sourcing<br />

rubi Boim (Tel-Aviv University)<br />

ohad Greenshpan (Tel-Aviv University)<br />

Tova Milo (Tel-Aviv University)<br />

Slava Novgorodov (Tel-Aviv University)<br />

Neoklis Polyzotis (University of california, Santa cruz)<br />

Wang-chiew Tan (University of california, Santa cruz)<br />

Abstracts<br />

Crowd-based data sourcing is a new and powerful data procurement paradigm that<br />

engages Web users to collectively c<strong>on</strong>tribute informati<strong>on</strong>. In this work, we target<br />

the problem of gathering data from the crowd in an ec<strong>on</strong>omical and principled<br />

fashi<strong>on</strong>. We present AskIt!, a system that allows interactive data sourcing applicati<strong>on</strong>s<br />

to effectively determine which questi<strong>on</strong>s should be directed to which users<br />

for reducing the uncertainty about the collected data. AskIt! uses a set of novel<br />

algorithms for minimizing the number of probing (questi<strong>on</strong>s) required from the<br />

different users. We dem<strong>on</strong>strate the challenge and our soluti<strong>on</strong> in the c<strong>on</strong>text of a<br />

multiple-choice questi<strong>on</strong> game played by the <strong>ICDE</strong>’12 attendees, targeted to gather<br />

informati<strong>on</strong> <strong>on</strong> the c<strong>on</strong>ference’s publicati<strong>on</strong>s, authors and colleagues.<br />

LotusX: A Positi<strong>on</strong>-Aware XML Graphical Search System with<br />

Auto-Completi<strong>on</strong><br />

chunbin Lin (renmin University of china)<br />

Jiaheng Lu (renmin University of china)<br />

Tok Wang Ling (Nati<strong>on</strong>al Universtiy of Singapore)<br />

Bogdan cautis (Télécom ParisTech)<br />

The existing query languages for XML (e.g., XQuery) require professi<strong>on</strong>al programming<br />

skills to be formulated, however, such complex query languages burden the<br />

query processing. In additi<strong>on</strong>, when issuing an XML query, users are required to<br />

be familiar with the c<strong>on</strong>tent (including the structural and textual informati<strong>on</strong>) of<br />

the hierarchical XML, which is diffcult for comm<strong>on</strong> users. The need for designing<br />

userfriendly interfaces to reduce the burden of query formulati<strong>on</strong> is fundamental to<br />

the spreading of XML community. We present a twig-based XML graphical search<br />

system, called LotusX, that provides a graphical interface to simplify the query<br />

processing without the need of learning query language and data schemas and the<br />

knowledge of the c<strong>on</strong>tent of the XML document. The basic idea is that LotusX proposes<br />

“positi<strong>on</strong>-aware” and “auto-completi<strong>on</strong>” features to help users to create treemodeled<br />

queries (twig pattern) by providing the possible candidates <strong>on</strong>-the-fly.<br />

In additi<strong>on</strong>, complex twig queries (including ordersensitive queries) are supported<br />

in LotusX. Furthermore, a new ranking strategy and a query rewriting soluti<strong>on</strong> are<br />

implemented to rank and rewrite the query effectively.<br />

Efficient Top-k Keyword Search in Graphs with Polynomial Delay<br />

Mehdi Kargar (york University)<br />

Aijun An (york University)<br />

A system for efficient keyword search in graphs is dem<strong>on</strong>strated. The system has<br />

two comp<strong>on</strong>ents, a search through <strong>on</strong>ly the nodes c<strong>on</strong>taining the input keywords<br />

Page<br />

127


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

for a set of nodes that are close to each other and together cover the input keywords<br />

and an explorati<strong>on</strong> for finding how these nodes are related to each other. The<br />

system generates all or top-k answers in polynomial delay. Answers are presented<br />

to the user according to a ranking criteri<strong>on</strong> so that the answers with nodes closer to<br />

each other are presented before the <strong>on</strong>es with nodes farther away from each other.<br />

In additi<strong>on</strong>, the set of answers produced by our system is duplicati<strong>on</strong> free. The<br />

system uses two methods for presenting the final answer to the user. The presentati<strong>on</strong><br />

methods reveal relati<strong>on</strong>ships am<strong>on</strong>g the nodes in an answer through a tree or<br />

a multi-center graph. We will show that each method has its own advantages and<br />

disadvantages. The system is dem<strong>on</strong>strated using two challenging datasets, very<br />

large DBLP and highly cyclic M<strong>on</strong>dial. Challenges and difficulties in implementing<br />

an efficient keyword search system are also dem<strong>on</strong>strated.<br />

TEDAS: a Twitter-based Event Detecti<strong>on</strong> and Analysis System<br />

rui Li (University of illinois at Urbana-champaign)<br />

Kin Hou Lei (Brigham young University)<br />

ravi Khadiwala (University of illinois at Urbana-champaign)<br />

Kevin chen-chuan chang (University of illinois at Urbana-champaign)<br />

Witnessing the emergence of Twitter, we propose a Twitter-based Event Detecti<strong>on</strong><br />

and Analysis System (TEDAS), which helps to (1) detect new events, to (2) analyze<br />

the spatial and temporal pattern of an event, and to (3) identify importance of<br />

events. In this dem<strong>on</strong>strati<strong>on</strong>, we show the overall system architecture, explain in<br />

detail the implementati<strong>on</strong> of the comp<strong>on</strong>ents that crawl, classify, and rank tweets<br />

and extract locati<strong>on</strong> from tweets, and present some interesting results of our system.<br />

AutoDict: Automated Dicti<strong>on</strong>ary Discovery<br />

Fei chiang (University of Tor<strong>on</strong>to)<br />

Periklis Andritsos (University of Tor<strong>on</strong>to)<br />

Erkang Zhu (University of Tor<strong>on</strong>to)<br />

renee J. Miller (University of Tor<strong>on</strong>to)<br />

An attribute dicti<strong>on</strong>ary is a set of attributes together with a set of comm<strong>on</strong> values<br />

of each attribute. Such dicti<strong>on</strong>aries are valuable in understanding unstructured<br />

or loosely structured textual descripti<strong>on</strong>s of entity collecti<strong>on</strong>s, such as product<br />

catalogs. Dicti<strong>on</strong>aries provide the supervised data for learning product or entity<br />

descripti<strong>on</strong>s. In this dem<strong>on</strong>strati<strong>on</strong>, we will present AutoDict, a system that analyzes<br />

input data records, and discovers high quality dicti<strong>on</strong>aries using informati<strong>on</strong><br />

theoretic techniques. To the best of our knowledge, AutoDict is the first end-to-end<br />

system for building attribute dicti<strong>on</strong>aries. Our dem<strong>on</strong>strati<strong>on</strong> will showcase the<br />

different informati<strong>on</strong> analysis and extracti<strong>on</strong> features within AutoDict, and highlight<br />

the process of generating high quality attribute dicti<strong>on</strong>aries.<br />

Page<br />

128


demo group 3:<br />

Abstracts<br />

Trust & Share: Trusted Informati<strong>on</strong> Sharing in Online Social Networks<br />

Barbara carminati (University of insubria)<br />

Elena Ferrari (University of insubria)<br />

Jacopo Girardi (University of insubria)<br />

Trust & Share (T&S) aims at providing relati<strong>on</strong>ship-based access c<strong>on</strong>trol in the<br />

Facebook realm. T&S is a third-party Facebook applicati<strong>on</strong>, designed to support a<br />

flexible and c<strong>on</strong>trolled sharing of user data. It makes users able to upload resources<br />

(i.e., any file) and specify for each of them which users have to be authorized by<br />

T&S to access them. To enforce this c<strong>on</strong>trolled informati<strong>on</strong> sharing, T&S relies <strong>on</strong><br />

the OSN access c<strong>on</strong>trol model proposed in \cite{tissec}, where social network relati<strong>on</strong>ships<br />

have an enhanced semantics than the c<strong>on</strong>tacts in Facebook. According to<br />

\cite{tissec}, OSN users associate with each of their c<strong>on</strong>tacts a type, representing<br />

the nature of the relati<strong>on</strong>ship (e.g., friends, colleagues, parents). Moreover, the creator<br />

of the relati<strong>on</strong>ship can assign to it also a trust level to represent the strength<br />

of the c<strong>on</strong>necti<strong>on</strong>. This graph enables users to specify more expressive rules for<br />

the c<strong>on</strong>trolled informati<strong>on</strong> sharing. Indeed, <strong>on</strong> top of this enhanced social graph,<br />

T&S users can specify access c<strong>on</strong>straints <strong>on</strong> the type, trust level and depth of the<br />

relati<strong>on</strong>ship it must exist with a given Facebook c<strong>on</strong>tact in order to access a certain<br />

resource.<br />

Evaluati<strong>on</strong> of Clusterings – Metrics and Visual Support<br />

Elke Achtert (Ludwig-Maximilians-Universität München)<br />

Sascha Goldhofer (Ludwig-Maximilians-Universität München)<br />

Hans-Peter Kriegel (Ludwig-Maximilians-Universität München)<br />

Erich Schubert (Ludwig-Maximilians-Universität München)<br />

Arthur Zimek (Ludwig-Maximilians-Universität München)<br />

When comparing clustering results, any evaluati<strong>on</strong> metric breaks down the available<br />

informati<strong>on</strong> to a single number. However, a lot of evaluati<strong>on</strong> metrics are around,<br />

that are not always c<strong>on</strong>cordant nor easily interpretable in judging the agreement of<br />

a pair of clusterings. Here, we provide a tool to visually support the assessment of<br />

clustering results in comparing multiple clusterings. Al<strong>on</strong>g the way, the suitability of<br />

a couple of clustering comparis<strong>on</strong> measures can be judged in different scenarios.<br />

Hort<strong>on</strong>: Online Query Executi<strong>on</strong> Engine For Large Distributed Graphs<br />

Mohamed Sarwat (University of Minnesota)<br />

Sameh Elnikety (Microsoft research)<br />

yuxi<strong>on</strong>g He (Microsoft research)<br />

Gabriel Kliot (Microsoft research)<br />

Graphs are used in many large-scale applicati<strong>on</strong>s, such as social networking. The<br />

management of these graphs poses new challenges as such graphs are too large<br />

for a single server to manage efficiently. Current distributed techniques such as<br />

map-reduce and Pregel are not well-suited to processing interactive ad-hoc queries<br />

against large graphs. In this paper we dem<strong>on</strong>strate Hort<strong>on</strong>, a distributed interac-<br />

Page<br />

129


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

tive query executi<strong>on</strong> engine for large graphs. Hort<strong>on</strong> defines a query language that<br />

allows the expressi<strong>on</strong> of regular language reachability queries and provides a query<br />

executi<strong>on</strong> engine with a query optimizer that allows interactive executi<strong>on</strong> of queries<br />

<strong>on</strong> large distributed graphs in parallel. In the demo, we show the functi<strong>on</strong>ality of<br />

Hort<strong>on</strong> managing a large graph for a social networking applicati<strong>on</strong> called Codebook,<br />

whose graph represents data <strong>on</strong> software comp<strong>on</strong>ents, developers, development<br />

artifacts such as bug reports, and their interacti<strong>on</strong>s in large software projects.<br />

MXQuery With Hardware Accelerati<strong>on</strong><br />

Jens Teubner (ETH Zurich)<br />

Peter Fischer (University of Freiburg)<br />

We dem<strong>on</strong>strate MXQuery/H, a modified versi<strong>on</strong> of MXQuery that uses hardware<br />

accelerati<strong>on</strong> to speed up XML processing. The main goal of this dem<strong>on</strong>strati<strong>on</strong> is to<br />

give an interactive example of hardware/software co-design and show how system<br />

performance and energy efficiency can be improved by off-loading tasks to FPGA<br />

hardware. To this end, we equipped MXQuery/H with various hooks to inspect the<br />

different parts of the system. Besides that, our system can finally really leverage the<br />

idea of XML projecti<strong>on</strong>. Though the idea of projecti<strong>on</strong> had been around for a while,<br />

its effectiveness remained always limited because of the unavoidable and high parsing<br />

overhead. By performing the task in hardware, we relieve the software part from<br />

this overhead and achieve processing speed-ups of several factors.<br />

<strong>Data</strong> 3 – A Kinect Interface for OLAP using Complex Event Processing<br />

Steffen Hirte (ilmenau University of Technology)<br />

Andreas Seifert (ilmenau University of Technology)<br />

Stephan Baumann (ilmenau University of Technology)<br />

Daniel Klan (ilmenau University of Technology)<br />

Kai-Uwe Sattler (ilmenau University of Technology)<br />

Moti<strong>on</strong> sensing input devices like Microsoft’s Kinect offer an alternative to traditi<strong>on</strong>al<br />

computer input devices like keyboards and mouses. Daily new applicati<strong>on</strong>s using<br />

this in- terface appear. Most of them implement their own gesture detecti<strong>on</strong>. In our<br />

dem<strong>on</strong>strati<strong>on</strong> we show a new approach using the data stream engine AnduIN. The<br />

gesture detecti<strong>on</strong> is d<strong>on</strong>e based <strong>on</strong> AnduIN’s complex event processing functi<strong>on</strong>ality.<br />

This way we build a system that allows to define new and complex gestures <strong>on</strong><br />

the basis of a declarative programming interface. On this basis our dem<strong>on</strong>strati<strong>on</strong><br />

data 3 provides a basic natural interacti<strong>on</strong> OLAP interface for a sample star schema<br />

database using Microsoft’s Kinect.<br />

Analyzing Query Optimizati<strong>on</strong> Process: Portraits of Join<br />

Enumerati<strong>on</strong> Algorithms<br />

Anisoara Nica (Sybase, An SAP company)<br />

ian charlesworth (University of Waterloo)<br />

Maysum Panju (University of Waterloo)<br />

Search spaces generated by query optimizers during the optimizati<strong>on</strong> process<br />

encapsulate characteristics of the join enumerati<strong>on</strong> algorithms, the cost models, as<br />

Page<br />

130


Abstracts<br />

well as critical decisi<strong>on</strong>s made for pruning and choosing the best plan. We dem<strong>on</strong>strate<br />

the JoinEnumerati<strong>on</strong>Viewer which is a tool designed for visualizing, mining,<br />

and comparing plan search spaces generated by different join enumerati<strong>on</strong> algorithms<br />

when optimizing same SQL statement. We have enhanced Sybase SQL Anywhere<br />

relati<strong>on</strong>al database management system to log, in a very compact format,<br />

its search space during an optimizati<strong>on</strong> process. Such optimizati<strong>on</strong> log can then<br />

be analyzed by the JoinEnumerati<strong>on</strong>Viewer which internally builds the logical and<br />

physical plan graphs representing complete and partial plans c<strong>on</strong>sidered during the<br />

optimizati<strong>on</strong> process. The optimizati<strong>on</strong> logs also c<strong>on</strong>tain statistics of the resource<br />

c<strong>on</strong>sumpti<strong>on</strong> during the query optimizati<strong>on</strong> such as optimizati<strong>on</strong> time breakdown,<br />

for example, for logical join enumerati<strong>on</strong> versus costing physical plans, and memory<br />

allocati<strong>on</strong> for different optimizati<strong>on</strong> structures. The SQL Anywhere Optimizer<br />

implements a highly adaptable, self-managing, search space generati<strong>on</strong> algorithm<br />

by having several join enumerati<strong>on</strong> algorithms to choose from, each enhanced with<br />

different ordering and pruning techniques. The emphasis of the dem<strong>on</strong>strati<strong>on</strong> will<br />

be <strong>on</strong> comparing and c<strong>on</strong>trasting these join enumerati<strong>on</strong> algorithms by analyzing<br />

their optimizati<strong>on</strong> logs. The dem<strong>on</strong>strati<strong>on</strong> scenarios will include optimizing<br />

SQL statements under various c<strong>on</strong>diti<strong>on</strong>s which will exercise different algorithms,<br />

pruning and ordering techniques. These search spaces will then be visualized and<br />

compared using the JoinEnumerati<strong>on</strong>Viewer.<br />

DPCube: Releasing Differentially Private <strong>Data</strong> Cubes for Health Informati<strong>on</strong><br />

y<strong>on</strong>ghui Xiao (Emory University)<br />

James Gardner (Digital reas<strong>on</strong>ing Systems inc.)<br />

Li Xi<strong>on</strong>g (Emory University)<br />

We propose to dem<strong>on</strong>strate DPCube, a comp<strong>on</strong>ent in our Health Informati<strong>on</strong> DEidentificati<strong>on</strong><br />

(HIDE) framework, for releasing differentially private data cubes (or<br />

multidimensi<strong>on</strong>al histograms) for sensitive data. HIDE is a framework we developed<br />

for integrating heterogenous structured and unstructured health informati<strong>on</strong> and<br />

provides methods for privacy preserving data publishing. The DPCube comp<strong>on</strong>ent<br />

provides the differentially private multidimensi<strong>on</strong>al data cube release. The DPCube<br />

algorithm uses the differentially private access mechanisms as provided by HIDE<br />

and guarantees differential privacy for the released data. It utilizes an innovative<br />

two-step multidimensi<strong>on</strong>al partiti<strong>on</strong>ing technique to publish a generalized data<br />

cube or multi-dimensi<strong>on</strong>al histogram that achieve good utility while satisfying the<br />

privacy requirement. We dem<strong>on</strong>strate that the released data cubes can serve as a<br />

sanitized synopsis of the raw database and, together with an opti<strong>on</strong>al synthesized<br />

dataset based <strong>on</strong> the data cubes, can support various Online Analytical Processing<br />

(OLAP) queries and learning tasks.<br />

Page<br />

131


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

demo group 4:<br />

Nyaya: a System Supporting the Uniform Management of Large Sets of<br />

Semantic <strong>Data</strong><br />

roberto De virgilio (Universitá roma Tre)<br />

Giorgio orsi (University of oxford)<br />

Letizia Tanca (Politecnico di Milano)<br />

riccardo Torl<strong>on</strong>e (Universitá roma Tre)<br />

We present Nyaya, a flexible system for the management of large-scale semantic<br />

data which couples a general-purpose storage mechanism with efficient <strong>on</strong>tological<br />

query answering. Nyaya rapidly imports semantic data expressed in different<br />

formalisms into semantic data kiosks. Each kiosk exposes the native <strong>on</strong>tological<br />

c<strong>on</strong>straints in a uniform fashi<strong>on</strong> using datalog+-, a very general rule-based language<br />

for the representati<strong>on</strong> of <strong>on</strong>tological c<strong>on</strong>straints. A group of kiosks forms a semantic<br />

data market where the data in each kiosk can be uniformly accessed using c<strong>on</strong>junctive<br />

queries and where users can specify user-defined c<strong>on</strong>straints over the data.<br />

Nyaya is easily extensible and robust to updates of both data and meta-data in the<br />

kiosk and can readily adapt to different logical organizati<strong>on</strong>s of the persistent storage.<br />

In the dem<strong>on</strong>strati<strong>on</strong>, we will show the capabilities of Nyaya over real-world<br />

case studies and dem<strong>on</strong>strate its efficiency over well-known benchmarks.<br />

R2DB: A System for Querying and Visualizing Weighted RDF Graphs<br />

S<strong>on</strong>gling Liu (Ariz<strong>on</strong>a State University)<br />

Juan P. cedeno (Ariz<strong>on</strong>a State University)<br />

K. Selcuk candan (Ariz<strong>on</strong>a State University)<br />

Maria Luisa Sapino (University of Torino)<br />

Shengyu Huang (Ariz<strong>on</strong>a State University)<br />

Xinsheng Li (Ariz<strong>on</strong>a State University)<br />

Existing RDF query languages and RDF stores fail to support a large class of<br />

knowledge applicati<strong>on</strong>s which associate utilities or costs <strong>on</strong> the available knowledge<br />

statements. A recent proposal includes (a) a ranked RDF (R2DF) specificati<strong>on</strong><br />

to enhance RDF triples with an applicati<strong>on</strong> specific weights and (b) a SPARankQL<br />

query language specificati<strong>on</strong>, which provides novel primitives <strong>on</strong> top of the<br />

SPARQL language to express top-k queries using traditi<strong>on</strong>al query patterns as well<br />

as novel flexible path predicates. We introduce and dem<strong>on</strong>strate R2DB, a database<br />

system for querying weighted RDF graphs. R2DB relies <strong>on</strong> the AR2Q query processing<br />

engine, which leverages novel index structures to support efficient ranked<br />

path search and includes query optimizati<strong>on</strong> strategies based <strong>on</strong> proximity and<br />

sub-result inter-arrival times. In additi<strong>on</strong> to being the first data management system<br />

for the R2DF data model, R2DB also provides an innovative features-of-interest<br />

(FoI) based method for visualizing large sets of query results (i.e., subgraphs of the<br />

data graph).<br />

Page<br />

132


Project Dayt<strong>on</strong>a: <strong>Data</strong> Analytics as a Cloud Service<br />

roger Barga (Microsoft)<br />

Jaliya Ekanayake (Microsoft research)<br />

Wei Lu (Microsoft research)<br />

Abstracts<br />

Spreadsheets are established data collecti<strong>on</strong> and analysis tools in business, technical<br />

computing and academic research. Excel, for example, offers an attractive<br />

user interface, provides an easy to use data entry model, and offers substantial<br />

interactivity for what-if analysis. However, spreadsheets and other comm<strong>on</strong> client<br />

applicati<strong>on</strong>s do not offer scalable computati<strong>on</strong> for large scale data analytics and<br />

explorati<strong>on</strong>. Increasingly researchers in domains ranging from the social sciences<br />

to envir<strong>on</strong>mental sciences are faced with a deluge of data, often sitting in spreadsheets<br />

such as Excel or other client applicati<strong>on</strong>s, and they lack a c<strong>on</strong>venient way to<br />

explore the data, to find related data sets, or to invoke scalable analytical models<br />

over the data. To address these limitati<strong>on</strong>s, we have developed a cloud data analytics<br />

service based <strong>on</strong> Dayt<strong>on</strong>a, which is an iterative MapReduce runtime optimized<br />

for data analytics. In our model, Excel and other existing client applicati<strong>on</strong>s provide<br />

the data entry and user interacti<strong>on</strong> surfaces, Dayt<strong>on</strong>a provides a scalable runtime<br />

<strong>on</strong> the cloud for data analytics, and our service seamlessly bridges the gap between<br />

the client and cloud. Any analyst can use our data analytics service to discover<br />

and import data from the cloud, invoke cloud scale data analytics algorithms<br />

to extract informati<strong>on</strong> from large datasets, invoke data visualizati<strong>on</strong>, and then store<br />

the data back to the cloud all through a spreadsheet or other client applicati<strong>on</strong> they<br />

are already familiar with.<br />

Interactive User Feedback in Ontology Matching Using Signature Vectors<br />

isabel F. cruz (University of illinois at chicago)<br />

cosmin Stroe (University of illinois at chicago)<br />

Matteo Palm<strong>on</strong>ari (University of Milano-Bicocca)<br />

When compared to a gold standard, the set of mappings that are generated by an<br />

automatic <strong>on</strong>tology matching process is neither complete nor are the individual<br />

mappings always correct. However, given the explosi<strong>on</strong> in the number, size, and<br />

complexity of available <strong>on</strong>tologies, domain experts no l<strong>on</strong>ger have the capability<br />

to create <strong>on</strong>tology mappings without c<strong>on</strong>siderable effort. We present a soluti<strong>on</strong><br />

to this problem that c<strong>on</strong>sists of making the <strong>on</strong>tology matching process interactive<br />

so as to incorporate user feedback in the loop. Our approach clusters mappings to<br />

identify where user feedback will be most beneficial in reducing the number of user<br />

interacti<strong>on</strong>s and system iterati<strong>on</strong>s. This feedback process has been implemented<br />

in the AgreementMaker system and is supported by visual analytic techniques that<br />

help users to better understand the matching process. Experimental results using<br />

the OAEI benchmarks show the effectiveness of our approach. We will dem<strong>on</strong>strate<br />

how users can interact with the <strong>on</strong>tology matching process through the AgreementMaker<br />

user interface to match real-world <strong>on</strong>tologies.<br />

Page<br />

133


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

DObjects+: Enabling Privacy-Preserving <strong>Data</strong> Federati<strong>on</strong> Services<br />

Pawel Jurczyk (Google inc.)<br />

Li Xi<strong>on</strong>g (Emory University)<br />

Slawomir Goryczka (Emory University)<br />

The emergence of cloud computing implies and facilitates managing large collecti<strong>on</strong>s<br />

of highly distributed, aut<strong>on</strong>omous, and possibly private databases. While<br />

there is an increasing need for services that allow integrati<strong>on</strong> and sharing of various<br />

data repositories, it remains a challenge to ensure the privacy, interoperability, and<br />

scalability for such services. In this paper we dem<strong>on</strong>strate a scalable and extensible<br />

framework that is aimed to enable privacy preserving data federati<strong>on</strong>s. The framework<br />

is built <strong>on</strong> top of a distributed mediator-wrapper architecture where nodes<br />

can form collaborative groups for secure an<strong>on</strong>ymizati<strong>on</strong> and secure query processing<br />

when private data need to be accessed. New an<strong>on</strong>ymizati<strong>on</strong> models and protocols<br />

will be dem<strong>on</strong>strated that counter potential attacks in the distributed setting.<br />

DRAGOON: An Informati<strong>on</strong> Accountability System for<br />

High-Performance <strong>Data</strong>bases<br />

Kyriacos E. Pavlou (The University of Ariz<strong>on</strong>a)<br />

richard T. Snodgrass (The University of Ariz<strong>on</strong>a)<br />

Regulati<strong>on</strong>s and societal expectati<strong>on</strong>s have recently emphasized the need to mediate<br />

access to valuable databases, even access by insiders. Fraud occurs when a<br />

pers<strong>on</strong>, often an insider, tries to hide illegal activity. Companies would like to be<br />

assured that such tampering has not occurred, or if it does, that it will be quickly<br />

discovered and used to identify the perpetrator. At <strong>on</strong>e end of the compliance spectrum<br />

lies the approach of restricting access to informati<strong>on</strong> and <strong>on</strong> the other that of<br />

informati<strong>on</strong> accountability. We focus <strong>on</strong> effecting informati<strong>on</strong> accountability of data<br />

stored in high-performance databases. The dem<strong>on</strong>strated work ensures appropriate<br />

use and thus end-to-end accountability of database informati<strong>on</strong> via a c<strong>on</strong>tinuous<br />

assurance technology based <strong>on</strong> cryptographic hashing techniques. A prototype<br />

tamper detecti<strong>on</strong> and forensic analysis system named DRAGOON was designed and<br />

implemented to determine when tampering(s) occurred and what data were tampered<br />

with. DRAGOON is scalable, customizable, and intuitive. This work will show<br />

that informati<strong>on</strong> accountability is a viable alternative to informati<strong>on</strong> restricti<strong>on</strong> for<br />

ensuring the correct storage, use, and maintenance of databases <strong>on</strong> extant DBMSes.<br />

Intuitive Interacti<strong>on</strong> With Encrypted Query Executi<strong>on</strong> in <strong>Data</strong>Storm<br />

Ken Smith (MiTrE)<br />

Ameet Kini (MiTrE)<br />

William Wang (MiTrE)<br />

chris Wolf (MiTrE)<br />

M. David Allen (MiTrE)<br />

Andrew Sillers (MiTrE)<br />

The encrypted executi<strong>on</strong> of database queries promises powerful security protecti<strong>on</strong>s,<br />

however users are currently unlikely to benefit without significant expertise. In<br />

this dem<strong>on</strong>strati<strong>on</strong>, we illustrate a simple workflow enabling users to design secure<br />

Page<br />

134


Abstracts<br />

executi<strong>on</strong>s of their queries. The <strong>Data</strong>Storm system dem<strong>on</strong>strated simplifies both the<br />

design and executi<strong>on</strong> of encrypted executi<strong>on</strong> plans, and represents progress toward<br />

the challenge of developing a general planner for encrypted query executi<strong>on</strong>.<br />

Seminar 1: <strong>Data</strong> Management Issues <strong>on</strong> the Semantic Web<br />

oktie Hassanzadeh (University of Tor<strong>on</strong>to & iBM research)<br />

Anastasios Kementsietsidis (iBM research)<br />

yannis velegrakis (University of Trento)<br />

We provide an overview of the current data management research issues in the<br />

c<strong>on</strong>text of the Semantic Web. The objective is to introduce the audience into the<br />

area of the Semantic Web, and to highlight the fact that the area provides many<br />

interesting research opportunities for the data management community. A new<br />

model, the Resource Descripti<strong>on</strong> Framework (RDF), coupled with a new query<br />

language, called SPARQL, lead us to revisit some classical data management problems,<br />

including efficient storage, query optimizati<strong>on</strong>, and data integrati<strong>on</strong>. These<br />

are problems that the Semantic Web community has <strong>on</strong>ly recently started to explore,<br />

and therefore the experience and l<strong>on</strong>g traditi<strong>on</strong> of the database community<br />

can prove valuable. We target both experienced and novice researchers that are<br />

looking for a thorough presentati<strong>on</strong> of the area and its key research topics.<br />

Seminar 2: Discovering Multiple Clustering Soluti<strong>on</strong>s: Grouping Objects<br />

in Different Views of the <strong>Data</strong><br />

Emmanuel Müller (Karlsruhe institute of Technology)<br />

Stephan Günnemann (rWTH Aachen University)<br />

ines Färber (rWTH Aachen University)<br />

Thomas Seidl (rWTH Aachen University)<br />

Traditi<strong>on</strong>al clustering algorithms identify just a single clustering of the data. Today’s<br />

complex data, however, allow multiple interpretati<strong>on</strong>s leading to several valid<br />

groupings hidden in different views of the database. Each of these multiple clustering<br />

soluti<strong>on</strong>s is valuable and interesting as different perspectives <strong>on</strong> the same data<br />

and several meaningful groupings for each object are given. Especially for high<br />

dimensi<strong>on</strong>al data, where each object is described by multiple attributes, alternative<br />

clusters in different attribute subsets are of major interest. In this tutorial, we<br />

describe several real world applicati<strong>on</strong> scenarios for multiple clustering soluti<strong>on</strong>s.<br />

We abstract from these scenarios and provide the general challenges in this emerging<br />

research area. We describe state-of-the-art paradigms, we highlight specific<br />

techniques, and we give an overview of this topic by providing a tax<strong>on</strong>omy of the<br />

existing clustering methods. By focusing <strong>on</strong> open challenges, we try to attract<br />

young researchers for participating in this emerging research field.<br />

Seminar 3: Detecting Cl<strong>on</strong>es, Copying and Reuse <strong>on</strong> the Web<br />

Xin Luna D<strong>on</strong>g (AT&T Labs–research)<br />

Divesh Srivastava (AT&T Labs–research)<br />

The Web has enabled the availability of a vast amount of useful informati<strong>on</strong> in<br />

recent years. However, the web technologies that have enabled sources to share<br />

Page<br />

135


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

their informati<strong>on</strong> have also made it easy for sources to copy from each other and<br />

often publish without proper attributi<strong>on</strong>. Understanding the copying relati<strong>on</strong>ships<br />

between sources has many benefits, including helping data providers protect their<br />

own rights, im- proving various aspects of data integrati<strong>on</strong>, and facilitating in-<br />

depth analysis of informati<strong>on</strong> flow. The importance of copy detecti<strong>on</strong> has led to a<br />

substantial amount of research in many disciplines of Computer Science, based <strong>on</strong><br />

the type of informati<strong>on</strong> c<strong>on</strong>sidered, such as text, images, videos, software code, and<br />

structured data. This seminar explores the similarities and differences between the<br />

techniques proposed for copy detecti<strong>on</strong> across the different types of informati<strong>on</strong>.<br />

We also examine the computati<strong>on</strong>al challenges associated with large-scale copy<br />

detecti<strong>on</strong>, indicating how they could be detected efficiently, and identify a range of<br />

open problems for the community.<br />

Seminar 4: Mining Knowledge from <strong>Data</strong>: An Informati<strong>on</strong> Network<br />

Analysis Approach<br />

Jiawei Han (University of illinois at Urbana-champaign)<br />

yizhou Sun (University of illinois at Urbana-champaign)<br />

Xifeng yan (University of california at Santa Barbara)<br />

Philip S. yu (University of illinois at chicago)<br />

Most people c<strong>on</strong>sider a database is merely a data repos- itory that supports data<br />

storage and retrieval. Actually, a database c<strong>on</strong>tains rich, inter-related, multi-typed<br />

data and informati<strong>on</strong>, forming <strong>on</strong>e or a set of gigantic, interc<strong>on</strong>nected, heterogeneous<br />

informati<strong>on</strong> networks. Much knowledge can be derived from such informati<strong>on</strong><br />

networks if we systematically develop an effective and scalable database-oriented<br />

informati<strong>on</strong> network analysis technology. In this tutorial, we systematically introduce<br />

database-oriented informati<strong>on</strong> network analysis methods and dem<strong>on</strong>strate how<br />

such a technology can be used to turn database data into useful knowledge and<br />

how such informati<strong>on</strong> networks can be used to enhance data qual- ity, c<strong>on</strong>sistency,<br />

and the generati<strong>on</strong> of interesting knowl- edge. This tutorial presents an organized<br />

picture <strong>on</strong> how to turn a database into <strong>on</strong>e or a set of organized heteroge- neous<br />

informati<strong>on</strong> networks, how such informati<strong>on</strong> net- works can be used for data cleaning,<br />

data c<strong>on</strong>solidati<strong>on</strong>, and data qualify improvement, how to perform OLAP in<br />

such informati<strong>on</strong> networks, how to discover various kinds of knowledge from such<br />

informati<strong>on</strong> networks, and how to transform database data into knowledge by<br />

informati<strong>on</strong> network analysis. Moreover, we present interesting case studies <strong>on</strong> real<br />

datasets, including DBLP and Flickr, and show how interesting and organized knowledge<br />

can be generated from such database-oriented informati<strong>on</strong> networks.<br />

Seminar 5: Emerging Graph Queries In Linked <strong>Data</strong><br />

Arijit Khan (University of california, Santa Barbara)<br />

yinghui Wu (University of california, Santa Barbara)<br />

Xifeng yan (University of california, Santa Barbara)<br />

In a wide array of disciplines, data can be modeled as an interc<strong>on</strong>nected network of<br />

entities, where various attributes could be associated with both the entities and the<br />

relati<strong>on</strong>s am<strong>on</strong>g them. Knowledge is often hidden in the complex structure and attributes<br />

inside these networks. While querying and mining these linked datasets are essential<br />

for various applicati<strong>on</strong>s, traditi<strong>on</strong>al graph queries may not be able to capture<br />

Page<br />

136


Abstracts<br />

the rich semantics in these networks. With the advent of complex informati<strong>on</strong> networks,<br />

new graph queries are emerging, including graph pattern matching and mining,<br />

similarity search, ranking and expert finding, graph aggregati<strong>on</strong> and OLAP. These<br />

queries require both the topology and c<strong>on</strong>tent informati<strong>on</strong> of the network data, and<br />

hence, different from classical graph algorithms such as shortest path, reachability<br />

and minimum cut, which depend <strong>on</strong>ly <strong>on</strong> the structure of the network. In this tutorial,<br />

we shall give an introducti<strong>on</strong> of the emerging graph queries, their indexing and resoluti<strong>on</strong><br />

techniques, the current challenges and the future research directi<strong>on</strong>s.<br />

Seminar 6: Boolean Matrix Decompositi<strong>on</strong> Problem: Theory, Variati<strong>on</strong>s<br />

and Applicati<strong>on</strong>s to <strong>Data</strong> <strong>Engineering</strong><br />

Jaideep vaidya (rutgers University)<br />

With the ubiquitous nature and sheer scale of data collecti<strong>on</strong>, the problem of data<br />

summarizati<strong>on</strong> is most critical for effective data management. Classical matrix<br />

decompositi<strong>on</strong> techniques have often been used for this purpose, and have been<br />

the subject of much study. In recent years, several other forms of decompositi<strong>on</strong>,<br />

including Boolean Matrix Decompositi<strong>on</strong> have become of significant practical<br />

interest. Since much of the data collected is categorical in nature, it can be viewed<br />

in terms of a Boolean matrix. Boolean matrix decompositi<strong>on</strong> (BMD), wherein a<br />

boolean matrix is expressed as a product of two Boolean matrices, can be used<br />

to provide c<strong>on</strong>cise and interpretable representati<strong>on</strong>s of Boolean data sets. The<br />

decomposed matrices give the set of meaningful c<strong>on</strong>cepts and their combinati<strong>on</strong><br />

which can be used to rec<strong>on</strong>struct the original data. Such decompositi<strong>on</strong>s are useful<br />

in a number of applicati<strong>on</strong> domains including role engineering, text mining as<br />

well as knowledge discovery from databases. In this seminar, we look at the theory<br />

underlying the BMD problem, study some of its variants and soluti<strong>on</strong>s, and examine<br />

different practical applicati<strong>on</strong>s.<br />

Page<br />

137


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

138


Co-Located Workshops<br />

<strong>ICDE</strong> Workshop <strong>on</strong> <strong>Data</strong>-DrIvEn DECIsI<strong>on</strong> support<br />

anD GuIDanCE systEms (DGss <strong>2012</strong>)<br />

http://dgss.vse.gmu.edu/<br />

Decisi<strong>on</strong> support systems (Dss) are widely used to support business or organizati<strong>on</strong>al<br />

decisi<strong>on</strong>-making at the management, operati<strong>on</strong>s and planning levels of an organizati<strong>on</strong>.<br />

Decisi<strong>on</strong> guidance systems (DGs) are decisi<strong>on</strong> support systems that go bey<strong>on</strong>d organizing<br />

and displaying informati<strong>on</strong>, providing acti<strong>on</strong>able recommendati<strong>on</strong>s to and extracting<br />

knowledge from human decisi<strong>on</strong>-makers. this workshop will bring together DGss<br />

researchers and practiti<strong>on</strong>ers to present novel methodologies, models, algorithms,<br />

systems, tools, applicati<strong>on</strong>s and case studies of DGss. most importantly, the workshop<br />

will be a forum to discuss how to utilize advances from multiple disciplines for building<br />

DGss that can intelligently merge human knowledge and expertise with formal<br />

mathematical models to make better decisi<strong>on</strong>s. the workshop will include both formal<br />

presentati<strong>on</strong>s and informal discussi<strong>on</strong> of important research directi<strong>on</strong>s in DGss, and<br />

their interacti<strong>on</strong>s with knowledge and data engineering.<br />

Page<br />

139


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Program<br />

8:50 – 9 am opening remarks<br />

9 – 10 am paper sessi<strong>on</strong> 1<br />

10 – 10:30 am Coffee break<br />

10:30 am – no<strong>on</strong> paper sessi<strong>on</strong> 2<br />

no<strong>on</strong> – 2 pm Lunch<br />

Page<br />

140<br />

A MAUT Approach for Reusing Ontologies<br />

Ant<strong>on</strong>io Jiménez, Mari Carmen Suárez-Figueroa, Alf<strong>on</strong>so<br />

Mateos, Mariano Fernández-López and Asunción Gómez-<br />

Pérez<br />

Online Optimizati<strong>on</strong> through Preprocessing for Multi-<br />

Stage Producti<strong>on</strong> Decisi<strong>on</strong> Guidance Queries<br />

Nathan Egge, Alexander Brodsky and Igor Griva<br />

A Decisi<strong>on</strong>-theoretic Model of Disease Surveillance<br />

and C<strong>on</strong>trol and a Prototype Implementati<strong>on</strong> for the<br />

Disease Influenza<br />

Michael Wagner, Gregory Cooper, Fuchiang Tsui,<br />

Jeremy Espino, Hendrik Harkema, John Levander,<br />

Ricardo Villamarin, Nicholas Millett, Shawn Brown and<br />

Anth<strong>on</strong>y Gallagher<br />

Pers<strong>on</strong>al Health Explorer: A Semantic Health Recommendati<strong>on</strong><br />

System<br />

Thomas Morrell and Larry Kerschberg<br />

Striving for Market Dominance in UK’s Private Healthcare<br />

Sector: A Case of Cygnet Healthcare<br />

Mlungisi Masilela, Fenio Annansingh and Shaofeng Liu<br />

2 – 3:30 pm poster sessi<strong>on</strong>: brief overview presentati<strong>on</strong>s followed up with<br />

parallel poster presentati<strong>on</strong>s<br />

Towards a DGSS Prototype for Early Warning for Ski<br />

Injuries<br />

Boris Delibašić and Zoran Obradović


3:30 – 4 pm Coffee break<br />

4 – 5 pm paper sessi<strong>on</strong> 3<br />

Co-Located Workshops<br />

N<strong>on</strong>-Parametric Synthesis Of Private Probabilistic<br />

Predicti<strong>on</strong>s<br />

Phan Giang<br />

Battle Management System: An Optimizati<strong>on</strong> for Military<br />

Decisi<strong>on</strong> Makers<br />

Richard Haberlin and Alexander Brodsky<br />

An explanati<strong>on</strong> of decisi<strong>on</strong>-making under uncertainty –<br />

a qualitative research approach<br />

Eurico Lopes<br />

Agent Negotiati<strong>on</strong> Strategies for Composing Service<br />

Workflows<br />

John Mcdowall and Larry Kerschberg<br />

A Scalable <strong>Data</strong> Warehouse Model based <strong>on</strong> Complex<br />

Semantic Event Processing in Distributed Systems<br />

Dingyu Yang and Jian Cao<br />

A Stigmergic Guiding System to Facilitate the Group<br />

Decisi<strong>on</strong> Process<br />

C<strong>on</strong>stantin-Bala Zamfirescu and Ciprian Candea<br />

A Regressi<strong>on</strong> Based Algorithm for Optimizing Top-K<br />

Selecti<strong>on</strong> in Simulati<strong>on</strong> Query Language<br />

Susan Farley, Alexander Brodsky and Chun-Hung Chen<br />

Towards a Training-Oriented Adaptive Decisi<strong>on</strong> Guidance<br />

and Support System<br />

Farhana Zulkernine, Patrick Martin, Sima Soltani, Wendy<br />

Powley, Serge Mankovskii and Mark Addleman<br />

5 – 5:30 pm Wrap-up sessi<strong>on</strong>: Discussi<strong>on</strong> <strong>on</strong> the future and<br />

organizati<strong>on</strong> of DGss<br />

Page<br />

141


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

3rD IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> <strong>Data</strong> EnGInEErInG<br />

mEEts thE sEmantIC WEb (DEsWEb <strong>2012</strong>)<br />

https://sites.google.com/site/desweb<strong>2012</strong>/<br />

DEsWeb brings together researchers and practiti<strong>on</strong>ers from <strong>Data</strong> management and<br />

semantic Web. <strong>on</strong> <strong>on</strong>e hand, the semantic Web brings several new data management<br />

problems, while <strong>on</strong> the other hand, several <strong>Data</strong> management problems can be<br />

solved with the help of semantic Web technologies. DEsWeb attracts papers <strong>on</strong> three<br />

broad areas: semantics in <strong>Data</strong> management, management of semantic Web <strong>Data</strong>, and<br />

semantic search and Linked <strong>Data</strong>. DEsWeb <strong>2012</strong> features an invited talk by prof. tim<br />

Finin <strong>on</strong> how to make semantic Web tools easier to use, as well as four regular c<strong>on</strong>tributi<strong>on</strong>s<br />

<strong>on</strong> the topics of improving query processing, benchmarking, schema matching,<br />

and challenges related to enabling semantic Web tools within <strong>Data</strong>spaces.<br />

Program<br />

9 – 10 am sessi<strong>on</strong> 1<br />

10 – 10:30 am Coffee break<br />

10:30 am – no<strong>on</strong> Invited talk<br />

no<strong>on</strong> – 2 pm Lunch<br />

2 – 3 pm sessi<strong>on</strong> 2<br />

Page<br />

142<br />

Scientific SparQL<br />

Andrej Andrejev and Tore Risch<br />

A Benchmark for RDF-Based metadata<br />

Ivan Subotic, Lukas Rosenthaler and Heiko Schuldt<br />

Making the Semantic Web Easier to Use<br />

Tim Finin<br />

Opaque Attribute Alignment<br />

Jennifer Sleeman, Rafael Al<strong>on</strong>so, Hua Li, Art Pope and<br />

Ant<strong>on</strong>io Badia<br />

Linked <strong>Data</strong> and Live Querying for Enabling Support<br />

Platforms for Web <strong>Data</strong>spaces<br />

Jürgen Umbrich, Marcel Karnstedt, Josiane Xavier Parreira,<br />

Axel Polleres and Manfred Hauswirth<br />

3 – 3:30 pm Discussi<strong>on</strong> and Wrap-up


Co-Located Workshops<br />

1st IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> <strong>Data</strong> manaGEmEnt<br />

In thE CLouD (DmC <strong>2012</strong>)<br />

http://www.nec-labs.com/dm/dmc<strong>2012</strong>/<br />

the cloud computing has emerged as a promising computing and business model. by providing<br />

<strong>on</strong>-demand scaling capabilities without any large upfr<strong>on</strong>t investment or l<strong>on</strong>g-term<br />

commitment, it is attracting wide range of users. the database community has also shown<br />

great interest in exploiting this new platform for data management services in a highly<br />

scalable and cost-efficient manner. as a result, the cloud computing presents challenges<br />

and opportunities for data management. the DmC workshop aims at bringing researchers<br />

and practiti<strong>on</strong>ers in cloud computing and data management systems together to discuss<br />

the research issues at the intersecti<strong>on</strong> of those areas, and also to draw more attenti<strong>on</strong> from<br />

the larger data management research community to this new and highly promising field.<br />

Program<br />

8:50 – 9 am Welcome<br />

9 – 10 am keynote<br />

10 – 10:30 am Coffee break<br />

10:30 am – no<strong>on</strong> sessi<strong>on</strong> 1<br />

no<strong>on</strong> – 2:30 pm Lunch<br />

Supporting Extensible Performance SLAs for Cloud<br />

<strong>Data</strong>bases<br />

Olga Papaemmanouil (Brandeis University)<br />

Applicati<strong>on</strong>-Managed <strong>Data</strong>base Replicati<strong>on</strong> <strong>on</strong> Virtualized<br />

Cloud Envir<strong>on</strong>ments<br />

Liang Zhao (Nati<strong>on</strong>al ICT Australia), Sherif Sakr (Nati<strong>on</strong>al<br />

ICT Australia), Alan Fekete (University of Sydney, Australia),<br />

Hiroshi Wada (Nati<strong>on</strong>al ICT Australia), and Anna Liu<br />

(Nati<strong>on</strong>al ICT Australia)<br />

Efficient Updates for Web-scale Indexes over the Cloud<br />

Panagiotis Ant<strong>on</strong>opoulos (Microsoft Corp), Ioannis<br />

K<strong>on</strong>stantinou (Nati<strong>on</strong>al Technical University of Athens),<br />

Dimitrios Tsoumakos, and Nectarios Koziris (Nati<strong>on</strong>al<br />

Technical University of Athens)<br />

Page<br />

143


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

2:30 – 3:30 pm sessi<strong>on</strong> 2<br />

3:30 – 4 pm Coffee break<br />

4 – 5 pm sessi<strong>on</strong> 3<br />

Page<br />

144<br />

Secure Access for Healthcare <strong>Data</strong> in the Cloud Using<br />

Ciphertext-Policy Attribute-Based Encrypti<strong>on</strong><br />

Suhair Alshehri (Rochester Inst. of Technology),<br />

Stanislaw Radziszowski, and Rajendra Raj (Rochester<br />

Inst. of Technology)<br />

Achieving <strong>Data</strong>base Informati<strong>on</strong> Accountability in the<br />

Cloud<br />

Kyriacos Pavlou (The University of Ariz<strong>on</strong>a), and Richard<br />

Snodgrass (The University of Ariz<strong>on</strong>a)<br />

Building Large XML Stores in the Amaz<strong>on</strong> Cloud<br />

Jesús Camacho-Rodríguez* (LRI, Universite Paris-Sud 11),<br />

Dario Colazzo (LRI, Universite Paris-Sud 11), and Ioana<br />

Manolescu (INRIA Saclay)<br />

Stream As You Go: The Case for Incremental <strong>Data</strong><br />

Access and Processing in the Cloud<br />

Romeo Kienzler (ETH Zurich), Rémy Bruggmann (University<br />

of Berne), Anand Ranganathan (IBM Research), and<br />

Nesime Tatbul* (ETH Zurich)<br />

3rD IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> Graph <strong>Data</strong><br />

manaGEmEnt: tEChnIquEs anD appLICatI<strong>on</strong>s<br />

(GDm <strong>2012</strong>)<br />

http://www.cse.unsw.edu.au/~iwgdm/<strong>2012</strong>/<br />

recently, there has been a lot of interest in the applicati<strong>on</strong> of graphs in different domains.<br />

they have been widely used for data modeling of different applicati<strong>on</strong> domains<br />

such as chemical compounds, multimedia databases, protein networks, social networks<br />

and semantic web. With the c<strong>on</strong>tinued emergence and increase of massive and complex<br />

structural graph data, a graph database that efficiently supports elementary data<br />

management mechanisms is crucially required to effectively understand and utilize any<br />

collecti<strong>on</strong> of graphs. the overall goal of the workshop is to bring people from different<br />

fields together, exchange research ideas and results, and encourage discussi<strong>on</strong> about<br />

how to provide efficient graph data management techniques in different applicati<strong>on</strong><br />

domains and to understand the research challenges of such area.


Program<br />

9 – 10 am Welcome and keynote presentati<strong>on</strong><br />

10 – 10:30 am Coffee break<br />

10:30 – no<strong>on</strong> research sessi<strong>on</strong><br />

no<strong>on</strong> – 2 pm Lunch break<br />

Co-Located Workshops<br />

Keynote Speaker<br />

Prof. Jiawei Han - Univ. of Illinois at Urbana-Champaign<br />

A Comparis<strong>on</strong> of Current Graph <strong>Data</strong>base Models<br />

Renzo Angles<br />

Design of Declarative Graph Query Languages: On the<br />

Choice between Value, Pattern and Object-based Representati<strong>on</strong>s<br />

for Graphs<br />

Hasan M Jamil<br />

Benchmarking traversal operati<strong>on</strong>s over graph databases<br />

Marek Ciglan, Alex Averbuch, and Ladialav Hluchy<br />

Mining Associati<strong>on</strong>s Using Directed Hypergraphs<br />

Ramanuja Simha, Rahul Tripathi, and Mayur Thakur<br />

2 – 3:30 pm research sessi<strong>on</strong> (Invited papers)<br />

3:30 – 4 pm Coffee break<br />

Finding Skyline Nodes in Large Networks<br />

Arijit Khan, Vishwakarma Singh, and Jian Wu<br />

Partiti<strong>on</strong>ing Social Networks for Fast Retrieval of Timedependent<br />

Queries<br />

Mindi Yuan, David Stein, Berenice Carrasco, Joana M. F.<br />

Trindade, and Yi Lu<br />

Will Graph <strong>Data</strong> Management Techniques C<strong>on</strong>tribute<br />

to the Successful Large-Scale Deployment of Semantic<br />

Web Technologies?<br />

Philippe Cudre-Mauroux<br />

Page<br />

145


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

4 – 6 pm Industrial sessi<strong>on</strong><br />

Page<br />

146<br />

Virtuoso 7 - Column Store and Adaptive Techniques for<br />

Graph<br />

Orri Erling<br />

HyperGraphDB: Model and Applicati<strong>on</strong>s<br />

Borislav Iordanov<br />

The Bigdata(r) parallel graph database<br />

Bryan Thomps<strong>on</strong><br />

RDF Graph Stores<br />

Christopher J. Matheus<br />

<strong>ICDE</strong> Workshop <strong>on</strong> sECurE <strong>Data</strong> manaGEmEnt <strong>on</strong><br />

smartph<strong>on</strong>Es anD mobILEs (sDmsm <strong>2012</strong>)<br />

http://dig.csail.mit.edu/<strong>2012</strong>/<strong>ICDE</strong>-SDMSM/<br />

there has been a widespread adopti<strong>on</strong> of powerful mobile devices such as smartph<strong>on</strong>es<br />

and tablets within the enterprise in the recent past. this widespread adopti<strong>on</strong> of mobile<br />

devices raises serious data management challenges around data privacy and security of<br />

pers<strong>on</strong>al and enterprise data <strong>on</strong> these devices. the further adopti<strong>on</strong> of mobile devices<br />

within the enterprise depends <strong>on</strong> str<strong>on</strong>g guarantees that the enterprise is still in c<strong>on</strong>trol<br />

of its sensitive data <strong>on</strong> mobile endpoints in the wild, and no data leakage or unauthorized<br />

modificati<strong>on</strong>s to the data can happen through these devices. popular mobile platforms such<br />

as android and ios allow users to download apps from respective marketplaces, and enterprises<br />

can host their own market places to distribute their own apps. however, given the<br />

pers<strong>on</strong>al nature of these devices, most users run both enterprise as well as pers<strong>on</strong>al apps<br />

<strong>on</strong> the same device simultaneously. since most apps <strong>on</strong> the public marketplaces are not<br />

security certified, and existing platform security soluti<strong>on</strong>s are lacking, for example by being<br />

coarse grained or being checked <strong>on</strong>ly at applicati<strong>on</strong> install time, it is possible for malicious<br />

apps to steal/modify enterprise sensitive informati<strong>on</strong> that is resident <strong>on</strong> these devices.<br />

similarly, given the compact dimensi<strong>on</strong>s of mobile devices such as smartph<strong>on</strong>es, users<br />

could potentially lose their ph<strong>on</strong>es, which carry sensitive data. Furthermore, most devices<br />

come packed with an array of sensors and communicati<strong>on</strong> capabilities such as Gps, cameras,<br />

near field communicati<strong>on</strong> (nFC), accelerometers, WiFi and bluetooth. these myriad<br />

<strong>on</strong>-device sensors generate large amounts of raw sensor data and managing this data to<br />

infer high-level events about the user and the end device remains a challenge. additi<strong>on</strong>ally,<br />

devices like ipads and Internet tablets are now being increasingly used in a multi-user envir<strong>on</strong>ment<br />

where c<strong>on</strong>tinuous and secure authenticati<strong>on</strong> and authorizati<strong>on</strong>s for data access is<br />

critical. In this workshop, we focus <strong>on</strong> the data management challenges that arise from the<br />

use of enterprise and other privacy sensitive data <strong>on</strong> mobile devices such as smartph<strong>on</strong>es.


Program<br />

9 – 9:15 am opening address & speaker Introducti<strong>on</strong><br />

Co-Located Workshops<br />

9:15 – 10 am Invited Talk: “Privacy in Mobile, Collaborative,<br />

C<strong>on</strong>text-aware Systems”<br />

Prof Tim Finin<br />

10 – 10:30 am break<br />

10:30 – no<strong>on</strong> research papers (3 papers, 30 mins each)<br />

no<strong>on</strong> – 2 pm Lunch break<br />

2 – 2:45 pm Invited talk<br />

2:45 – 3:30 pm panel “managing data <strong>on</strong> smart ph<strong>on</strong>es: Enterprises and bey<strong>on</strong>d”<br />

3:30 – 4 pm break<br />

4 – 5 pm research papers (3 papers, 30 mins each)<br />

5 pm – 5:15 pm Group Discussi<strong>on</strong><br />

5:15 – 5:30 pm Closing remarks<br />

7th IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> sELF-manaGInG<br />

<strong>Data</strong>basE systEms (smDb <strong>2012</strong>)<br />

http://smdb<strong>2012</strong>.dvs.informatik.tu-darmstadt.de/<br />

aut<strong>on</strong>omic, or self-managing, systems are a promising approach to achieve the goal of<br />

systems that are easier to use and maintain in the face of growing system complexity. a<br />

system is c<strong>on</strong>sidered to be aut<strong>on</strong>omic if it is self-c<strong>on</strong>figuring, self-optimizing, self-healing<br />

and/or self-protecting. the aim of the smDb workshop is to provide a forum for<br />

researchers from both industry and academia to present and discuss ideas and experiences<br />

related to self-management and self-organizati<strong>on</strong> in all areas of Informati<strong>on</strong> management<br />

(Im) in general. smDb targets not <strong>on</strong>ly classical databases but also the new<br />

generati<strong>on</strong> of storage engines such as column stores, key-value stores and in-memory<br />

databases. bey<strong>on</strong>d databases smDb aims to cover aut<strong>on</strong>omic aspects of data intensive<br />

systems represented by large-scale map-reduce (e.g., hadoop) and cloud envir<strong>on</strong>ments<br />

where much work <strong>on</strong> self-management is needed. Last but not least, smDb wants to<br />

expand its horiz<strong>on</strong>s to include self-management of n<strong>on</strong>-traditi<strong>on</strong>al, new areas of Im<br />

such as social networks and peer-to-peer systems.<br />

Page<br />

147


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Program<br />

9 – 10 am sessi<strong>on</strong> 1<br />

10 – 10:30 am break<br />

10:30 - 12:30 pm sessi<strong>on</strong> 2<br />

12:30 – 2 pm Lunch break<br />

Page<br />

148<br />

Opening<br />

Alejandro Buchmann (TU Darmstadt), Malu Castellanos<br />

(HP Labs)<br />

Keynote: Quantitative Methods for Workload Management<br />

in Integrated Large Scale <strong>Data</strong> Platforms<br />

Nachum Shacham (eBay)<br />

Discovering Indicators for C<strong>on</strong>gesti<strong>on</strong> in DBMSs<br />

Mingyi Zhang (Queens University, Canada), Pat Martin<br />

(Queen’s University), Wendy Powley (Queen’s University),<br />

Paul Bird (IBM Tor<strong>on</strong>to Lab), and Keith McD<strong>on</strong>ald (IBM<br />

Tor<strong>on</strong>to Lab)<br />

Online load balancing in parallel database queries with<br />

model predictive c<strong>on</strong>trol<br />

Anastasios Gounaris (Aristotle University of Thessal<strong>on</strong>iki),<br />

and Christos Yfoulis (ATEI of Thessal<strong>on</strong>iki)<br />

Same Queries, Different <strong>Data</strong>: Can we Predict Runtime<br />

Performance?<br />

Adrian Daniel Popescu (EPFL), Vuk Ercegovac (IBM<br />

Almaden), Andrey Balmin (IBM Almaden), Miguel Branco<br />

(EPFL), and Anastasia Ailamaki (EPFL)<br />

Elastic Scale-out for Partiti<strong>on</strong>-Based <strong>Data</strong>base Systems<br />

Umar Farooq Minhas (University of Waterloo), Rui Liu<br />

(University of Waterloo), Ashraf Aboulnaga (University of<br />

Waterloo), Ken Salem (University of Waterloo), J<strong>on</strong>athan<br />

Ng (University of Waterloo), and Sean Roberts<strong>on</strong><br />

(University of Waterloo)


2 – 3:30 pm sessi<strong>on</strong> 3<br />

3:30 – 4 pm break<br />

4 – 6 pm sessi<strong>on</strong> 4<br />

Co-Located Workshops<br />

Adaptive class-based scheduling of c<strong>on</strong>tinuous queries<br />

Lory Al Moakar (University of Pittsburgh), Alexandros<br />

Labrinidis (University of Pittsburgh), and Panos<br />

Chrysanthis (University of Pittsburgh)<br />

Adaptive Provisi<strong>on</strong>ing of Stream Processing Systems in<br />

the Cloud<br />

Javier Cervio (Universidad Politcnica de Madrid), Evangelia<br />

Kalyvianaki (Imperial College L<strong>on</strong>d<strong>on</strong>), Joaqun Salvacha<br />

(Universidad Politcnica de Madrid), and Peter Pietzuch<br />

(Imperial College L<strong>on</strong>d<strong>on</strong>)<br />

Lifting the burden of history in adaptive ordering of<br />

pipelined stream filters<br />

Efthymia Tsamoura (Aristotle University of Thessal<strong>on</strong>iki),<br />

Anastasios Gounaris (Aristotle University of Thessal<strong>on</strong>iki),<br />

and Yannis Manolopoulos (Aristotle University of<br />

Thessal<strong>on</strong>iki)<br />

Adaptive Index Buffer<br />

Hannes Voigt (TU Dresden), Tobias Jaekel (TU Dresden),<br />

Thomas Kissinger (TU Dresden), and Wolfgang Lehner<br />

(TU Dresden)<br />

Applicati<strong>on</strong> of Micro-Specializati<strong>on</strong> to Query Evaluati<strong>on</strong><br />

Operators<br />

Rui Zhang (University of Ariz<strong>on</strong>a), Richard Snodgrass<br />

(University of<br />

Ariz<strong>on</strong>a), and Saumya Debray (University of Ariz<strong>on</strong>a)<br />

Automatic <strong>Data</strong> Placement in MPP <strong>Data</strong>bases<br />

Carlos Garcia-Alvarado (University of Houst<strong>on</strong>), Venkatesh<br />

Raghavan (Greenplum EMC), Sivaramakrishnan<br />

Narayanan (Greenplum EMC), and Florian Waas (Greenplum<br />

EMC)<br />

Discussi<strong>on</strong> & closing<br />

Alejandro Buchmann (TU Darmstadt), and Malu Castellanos<br />

(HP Labs)<br />

Page<br />

149


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> spatIo tEmporaL <strong>Data</strong><br />

IntEGratI<strong>on</strong> anD rEtrIEvaL (stIr <strong>2012</strong>)<br />

http://research.ihost.com/stir12/<br />

the increasing world populati<strong>on</strong> is putting higher demands <strong>on</strong> the planet’s limited<br />

resources due to shifting life-styles. C<strong>on</strong>sequently, we not <strong>on</strong>ly need to m<strong>on</strong>itor how we<br />

c<strong>on</strong>sume resources but also optimize resource usage. some examples of the planet’s<br />

limited resources are water, energy, land, food and air. today, significant challenges exist<br />

for reducing usage of these resources, while maintaining quality of life. the challenges<br />

range from understanding regi<strong>on</strong>ally varied impacts of global envir<strong>on</strong>mental change,<br />

through tracking diffusi<strong>on</strong> of avian flu and resp<strong>on</strong>ding to natural disasters, to adapting<br />

business practice to dynamically changing resources, markets and geopolitical situati<strong>on</strong>s.<br />

this workshop is focused <strong>on</strong> making the research in informati<strong>on</strong> integrati<strong>on</strong><br />

and retrieval more relevant to the challenges in systems with significant spatial and<br />

temporal comp<strong>on</strong>ents. the workshop will build up<strong>on</strong> traditi<strong>on</strong>al themes of interest<br />

namely integrati<strong>on</strong> architectures, informati<strong>on</strong> extracti<strong>on</strong>, record linkage, named entity<br />

extracti<strong>on</strong>, source meta-data learning, query executi<strong>on</strong> and optimizati<strong>on</strong>. however, we<br />

gave special emphasis to how this can be applied to integrating informati<strong>on</strong> arising<br />

from systems that are (likely to be) deployed over wide geographic spaces, and collects<br />

and uses data that changes over time.<br />

Program<br />

8:30 – 10 am sessi<strong>on</strong> 1<br />

10 – 10:30 am Coffee break<br />

10:30 – no<strong>on</strong> sessi<strong>on</strong> 2<br />

Page<br />

150<br />

Opening and Welcome<br />

Invited Talk: “On the Roles of Spatio-Temporal <strong>Data</strong> in<br />

Web Search”<br />

Prof. Christian S Jensen, ACM & <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Fellow (Aarhus<br />

University, Denmark)<br />

TNeT: Tensor-based Neighborhood Discovery in<br />

Traffic Networks<br />

Yanan Sun, Vandana P Janeja, Aryya Gangopadhayay<br />

(University of Maryland, Baltimore County, USA) and<br />

Michael P McGuire (Tows<strong>on</strong> University, USA)


no<strong>on</strong> – 1:30 pm Lunch<br />

1:30 – 3:30 pm sessi<strong>on</strong> 3<br />

3:30 – 4 pm Coffee break<br />

4 – 5:30 pm sessi<strong>on</strong> 4<br />

Co-Located Workshops<br />

A Study of the Correlati<strong>on</strong> between the Spatial Attributes<br />

<strong>on</strong> Twitter<br />

Bumsuk Lee and Byung-Ye<strong>on</strong> Hwang (The Catholic<br />

University of Korea, Korea)<br />

Multi-representati<strong>on</strong> Lens for Visual Analytics<br />

Sandro Danilo Gatto and Andre Santanche<br />

(UNICAMP, Brazil)<br />

Invited Talk/Panel Discussi<strong>on</strong> - TBD<br />

Who was Where, When? Spatiotemporal Analysis of<br />

Researcher Mobility in Nuclear Science<br />

Miray Kas, Kathleen M Carley, and L. Richard Carley<br />

(Carnegie Mell<strong>on</strong> University, USA)<br />

Architecting the <strong>Data</strong>base Access for a IT Infrastructure<br />

and <strong>Data</strong> Center M<strong>on</strong>itoring tool<br />

Pradeep Unde, Harrick Vin, Maitreya Natu, Vaishali Kulkarni,<br />

Dilys Thomas, Sreeram Vasudevan, Amruta Dh<strong>on</strong>dage,<br />

Chinmay Jog, Shivam Sahai, and Rekha Pathak (Tata<br />

Research Development and Design Center, Pune, India)<br />

Moving Objects and KML Files<br />

Karine Reis Ferreira, Lúbia Vinhas, Antônio Miguel Vieira<br />

M<strong>on</strong>teiro and Gilberto Camara (Nati<strong>on</strong>al Institute of<br />

Space Research, Brazil)<br />

Closing Remarks<br />

Page<br />

151


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

152


Local Informati<strong>on</strong><br />

Washingt<strong>on</strong>, DC: capItaL of tHE USa<br />

the city, which is located <strong>on</strong> the north bank of the potomac River, is bordered by the<br />

states of Virginia to the southwest and Maryland to the other sides. the District has a<br />

resident populati<strong>on</strong> of 599,657; because of commuters from the surrounding suburbs, its<br />

populati<strong>on</strong> rises to over <strong>on</strong>e milli<strong>on</strong> during the workweek. the Washingt<strong>on</strong> Metropolitan<br />

area, of which the District is a part, has a populati<strong>on</strong> of 5.3 milli<strong>on</strong>, the ninth-largest metropolitan<br />

area in the country. the District has a total area of 68.3 square miles (177 km 2 ),<br />

of which 61.4 square miles (159 km 2 ) is land and 6.9 square miles (18 km 2 ) (10.16%) is<br />

water. the District has three major natural flowing streams: the potomac River and its<br />

tributaries, the anacostia River, and Rock creek, and tiber creek, a watercourse that <strong>on</strong>ce<br />

passed through the Nati<strong>on</strong>al Mall, but was fully enclosed underground during the 1870s.<br />

the highest natural point in the District of columbia is point Reno, located in fort<br />

Reno park, in the tenleytown neighborhood, at 409 feet (125 m) above sea level. the<br />

lowest point is sea level at the potomac River. the geographic center of Washingt<strong>on</strong> is<br />

located near the intersecti<strong>on</strong> of 4th and L Streets NW.<br />

approximately 19.4% of Washingt<strong>on</strong>, D.c. is parkland, which ties New York city for<br />

largest percentage of parkland am<strong>on</strong>g high-density U.S. cities. the U.S. Nati<strong>on</strong>al park<br />

Service manages most of the natural habitat in Washingt<strong>on</strong>, D.c., including Rock creek<br />

park, the chesapeake and ohio canal Nati<strong>on</strong>al Historical park, the Nati<strong>on</strong>al Mall,<br />

theodore Roosevelt Island, the c<strong>on</strong>stituti<strong>on</strong> Gardens, Meridian Hill park, and anacostia<br />

park. the <strong>on</strong>ly significant area of natural habitat not managed by the Nati<strong>on</strong>al park<br />

Page<br />

153


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

LOCAL INFORMATION<br />

Service is the U.S. Nati<strong>on</strong>al arboretum, which is operated by the U.S. Department of<br />

agriculture. the Great falls of the potomac River are located upstream (northwest) of<br />

Washingt<strong>on</strong>. During the 19th century, the chesapeake and ohio canal, which starts in<br />

Washingt<strong>on</strong>,<br />

Georgetown,<br />

D.C<br />

was<br />

the<br />

used<br />

capital<br />

to allow<br />

of USA.<br />

barge traffic to bypass the falls.<br />

The city, which is located <strong>on</strong> the north bank of the Potomac River, is bordered by the states of<br />

Virginia<br />

Washingt<strong>on</strong><br />

to the southwest<br />

is located<br />

and<br />

in the<br />

Maryland<br />

humid<br />

to<br />

subtropical<br />

the other sides.<br />

climate<br />

The<br />

z<strong>on</strong>e,<br />

District<br />

exhibiting<br />

has a resident<br />

four distinct<br />

populati<strong>on</strong><br />

of seas<strong>on</strong>s. 599,657; Its because climate of commuters is typical of from Mid-atlantic the surrounding U.S. areas suburbs, removed its populati<strong>on</strong> from bodies rises to of over water.<br />

<strong>on</strong>e Spring milli<strong>on</strong> and during fall are the warm, workweek. with The low Washingt<strong>on</strong> humidity, while Metropolitan winter is Area, cool, of with which annual the District snowfall is a<br />

part, averaging has a populati<strong>on</strong> 14.7 inches of 5.3 (370 milli<strong>on</strong>, mm). the average ninth-largest winter metropolitan lows tend to area be around in the country. 30°f (-1°c) The from<br />

District mid-December has a total area to mid-february. of 68.3 square Blizzards miles (177 affect km Washingt<strong>on</strong> <strong>on</strong> average <strong>on</strong>ce every four<br />

to six years. the most violent storms are called “nor’easters”, which typically feature high<br />

winds, heavy rains, and occasi<strong>on</strong>al snow. these storms often affect large secti<strong>on</strong>s of the<br />

U.S. East coast. Summers are hot and humid, with highs averaging in the upper 80s°f<br />

(lower 30s°c) and lows averaging in the upper 60s °f (lower 20s°c). the combinati<strong>on</strong> of<br />

heat and humidity in the summer brings very frequent thunderstorms, some of which<br />

occasi<strong>on</strong>ally produce tornadoes in the area. While hurricanes (or their remnants) occasi<strong>on</strong>ally<br />

track through the area in late summer and early fall, they have often weakened<br />

by the time they reach Washingt<strong>on</strong>, partly due to the city’s inland locati<strong>on</strong>. flooding of<br />

the potomac River, however, caused by a combinati<strong>on</strong> of high tide, storm surge, and<br />

runoff, has been known to cause extensive property damage in Georgetown.<br />

History<br />

an alg<strong>on</strong>quian people known as the Nacotchtank inhabited the area around the anacostia<br />

River where Washingt<strong>on</strong> now lies when the first Europeans arrived in the 17th<br />

century; however, Native american people had largely relocated from the area by the<br />

early 18th century. Georgetown was chartered by the province of Maryland <strong>on</strong> the north<br />

bank of the potomac River in 1751. the town would be included within the new federal<br />

territory established nearly 40 years later. the city of alexandria, Virginia, founded in<br />

1749, was also originally included within the District.<br />

James Madis<strong>on</strong> expounded the need for a federal district <strong>on</strong> January 23, 1788, in his<br />

“federalist No. 43”, arguing that the nati<strong>on</strong>al capital needed to be distinct from the<br />

2 ), of which 61.4 square miles (159 km 2 ) is<br />

land and 6.9 square miles (18 km 2 ) (10.16%) is water. The District has three major natural<br />

flowing streams: the Potomac River and its tributaries, the Anacostia River, and Rock Creek, and<br />

Tiber Creek, a watercourse that <strong>on</strong>ce passed through the Nati<strong>on</strong>al Mall, but was fully enclosed<br />

underground during the 1870s.<br />

The highest natural point in the District of Columbia is Point Reno, located in Fort Reno Park, in<br />

the Tenleytown neighborhood, at 409 feet (125 m) above sea level. The lowest point is sea level<br />

at the Potomac River. The geographic center of Washingt<strong>on</strong> is located near the intersecti<strong>on</strong> of 4th<br />

and L Streets NW.<br />

Approximately 19.4% of Washingt<strong>on</strong>, D.C. is parkland, which ties New York City for largest<br />

percentage of parkland am<strong>on</strong>g high-density U.S. cities. The U.S. Nati<strong>on</strong>al Park Service manages<br />

most of the natural habitat in Washingt<strong>on</strong>, D.C., including Rock Creek Park, the Chesapeake and<br />

Ohio Canal Nati<strong>on</strong>al Historical Park, the Nati<strong>on</strong>al Mall, Theodore Roosevelt Island, the<br />

C<strong>on</strong>stituti<strong>on</strong> Gardens, Meridian Hill Park, and Anacostia Park. The <strong>on</strong>ly significant area of<br />

natural habitat not managed by the Nati<strong>on</strong>al Park Service is the U.S. Nati<strong>on</strong>al Arboretum, which<br />

is operated by the U.S. Department of Agriculture. The Great Falls of the Potomac River are<br />

located upstream (northwest) of Washingt<strong>on</strong>. During the 19th century, the Chesapeake and Ohio<br />

Canal, which starts in Georgetown, was used to allow barge traffic to bypass the falls.<br />

Washingt<strong>on</strong> is located in the humid subtropical climate z<strong>on</strong>e, exhibiting four distinct seas<strong>on</strong>s. Its<br />

climate is typical of Mid-Atlantic U.S. areas removed from bodies of water. Spring and fall are<br />

warm, with low humidity, while winter is cool, with annual snowfall averaging 14.7 inches<br />

(370 mm). Average winter lows tend to be around 30 °F (-1 °C) from mid-December to mid-<br />

Page<br />

154


Local Informati<strong>on</strong><br />

states in order to provide for its own maintenance and safety. an attack <strong>on</strong> the c<strong>on</strong>gress<br />

at philadelphia by a mob of angry soldiers, known as the pennsylvania Mutiny of 1783,<br />

had emphasized the need for the government to see to its own security. therefore, the<br />

authority to establish a federal capital was provided in article <strong>on</strong>e, Secti<strong>on</strong> Eight, of the<br />

United States c<strong>on</strong>stituti<strong>on</strong>, which permits a “District (not exceeding ten miles square),<br />

by cessi<strong>on</strong> of particular states, and the acceptance of c<strong>on</strong>gress, become the seat of<br />

the government of the United States”. the c<strong>on</strong>stituti<strong>on</strong> does not, however, specify a<br />

locati<strong>on</strong> for the new capital. In what later became known as the compromise of 1790,<br />

Madis<strong>on</strong>, alexander Hamilt<strong>on</strong>, and thomas Jeffers<strong>on</strong> came to an agreement that the<br />

federal government would assume war debt carried by the states, <strong>on</strong> the c<strong>on</strong>diti<strong>on</strong> that<br />

the new nati<strong>on</strong>al capital would be located in the South.<br />

<strong>on</strong> July 16, 1790, the Residence act provided for a new permanent capital to be located<br />

<strong>on</strong> the potomac River, the exact area to be selected by president Washingt<strong>on</strong>. as permitted<br />

by the U.S. c<strong>on</strong>stituti<strong>on</strong>, the initial shape of the federal district was a square,<br />

measuring 10 miles (16 km) <strong>on</strong> each side, totaling 100 square miles (260 km2). During<br />

1791-1792, andrew Ellicott and several assistants, including Benjamin Banneker,<br />

surveyed the border of the District with both Maryland and Virginia, placing boundary<br />

st<strong>on</strong>es at every mile point; many of the st<strong>on</strong>es are still standing. a new “federal city”<br />

was then c<strong>on</strong>structed <strong>on</strong> the north bank of the potomac, to the east of the established<br />

settlement at Georgetown. <strong>on</strong> September 9, 1791, the federal city was named in h<strong>on</strong>or<br />

of George Washingt<strong>on</strong>, and the district was named the territory of columbia, columbia<br />

being a poetic name for the United States in use at that time. c<strong>on</strong>gress held its first sessi<strong>on</strong><br />

in Washingt<strong>on</strong> <strong>on</strong> November 17, 1800.<br />

the organic act of 1801 officially organized the District of columbia and placed the<br />

entire federal territory, including the cities of Washingt<strong>on</strong>, Georgetown, and alexandria,<br />

under the exclusive c<strong>on</strong>trol of c<strong>on</strong>gress. further, the unincorporated territory within the<br />

District was organized into two counties: the county of Washingt<strong>on</strong> to the east of the<br />

potomac and the county of alexandria to the west. following this act, citizens located<br />

in the District were no l<strong>on</strong>ger c<strong>on</strong>sidered residents of Maryland or Virginia, thus ending<br />

their representati<strong>on</strong> in c<strong>on</strong>gress.<br />

<strong>on</strong> august 24–25, 1814, in a raid known as the Burning of Washingt<strong>on</strong>, British forces<br />

invaded the capital during the War of 1812, following the sacking and burning of York<br />

(modern-day tor<strong>on</strong>to). the capitol, treasury, and White House were burned and gutted<br />

during the attack. Most government buildings were quickly repaired, but the capitol,<br />

which was at the time largely under c<strong>on</strong>structi<strong>on</strong>, was not completed in its current form<br />

until 1868.<br />

Since 1800, the District’s residents have protested their lack of voting representati<strong>on</strong><br />

in c<strong>on</strong>gress. to correct this, various proposals have been offered to return the land<br />

ceded to form the District back to Maryland and Virginia. this process is known as<br />

retrocessi<strong>on</strong>. However, such efforts failed to earn enough support until the 1830s when<br />

the District’s southern county of alexandria went into ec<strong>on</strong>omic decline due to neglect<br />

by c<strong>on</strong>gress. alexandria was also a major market in the american slave trade, and<br />

Page<br />

155


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

rumors circulated that aboliti<strong>on</strong>ists in c<strong>on</strong>gress were attempting to end slavery in the<br />

District; such an acti<strong>on</strong> would have further depressed alexandria’s ec<strong>on</strong>omy. Unhappy<br />

with c<strong>on</strong>gressi<strong>on</strong>al authority over alexandria, in 1840 the people began to petiti<strong>on</strong> for<br />

the retrocessi<strong>on</strong> of the District’s southern territory back to Virginia. the state legislature<br />

complied in february 1846, partly because the return of alexandria provided two<br />

additi<strong>on</strong>al pro-slavery delegates to the Virginia General assembly. <strong>on</strong> July 9, 1846,<br />

c<strong>on</strong>gress agreed to return all of the District’s territory south of the potomac River to the<br />

comm<strong>on</strong>wealth of Virginia.<br />

c<strong>on</strong>firming the fears of pro-slavery alexandrians, the compromise of 1850 outlawed the<br />

slave trade in the District, though not slavery itself. By 1860, approximately 80% of the<br />

city’s african american residents were free blacks. the outbreak of the american civil<br />

War in 1861 led to notable growth in the District’s populati<strong>on</strong> due to the expansi<strong>on</strong> of the<br />

federal government and a large influx of freed slaves. In 1862, president abraham Lincoln<br />

signed the compensated Emancipati<strong>on</strong> act, which ended slavery in the District of columbia<br />

and freed about 3,100 enslaved pers<strong>on</strong>s, nine m<strong>on</strong>ths prior to the Emancipati<strong>on</strong><br />

proclamati<strong>on</strong>. By 1870, the District’s populati<strong>on</strong> had grown to nearly 132,000. Despite the<br />

city’s growth, Washingt<strong>on</strong> still had dirt roads and lacked basic sanitati<strong>on</strong>; the situati<strong>on</strong> was<br />

so bad that some members of c<strong>on</strong>gress proposed moving the capital elsewhere.<br />

With the organic act of 1871, c<strong>on</strong>gress created a new government for the entire federal<br />

territory. this act effectively combined the city of Washingt<strong>on</strong>, Georgetown, and Washingt<strong>on</strong><br />

county into a single municipality officially named the District of columbia. Even<br />

though the city of Washingt<strong>on</strong> legally ceased to exist after 1871, the name c<strong>on</strong>tinued<br />

in use and the whole city became comm<strong>on</strong>ly known as Washingt<strong>on</strong>, D.c. In the same<br />

organic act, c<strong>on</strong>gress also appointed a Board of public Works charged with modernizing<br />

the city. In 1873, president Grant appointed the board’s most influential member,<br />

alexander Shepherd, to the new post of governor. that year, Shepherd spent $20 milli<strong>on</strong><br />

<strong>on</strong> public works ($357 milli<strong>on</strong> in 2007), which modernized Washingt<strong>on</strong> but also<br />

bankrupted the city. In 1874, c<strong>on</strong>gress abolished Shepherd’s office in favor of direct<br />

rule. additi<strong>on</strong>al projects to renovate the city were not executed until the McMillan plan<br />

in 1901.<br />

the District’s populati<strong>on</strong> remained relatively stable until the Great Depressi<strong>on</strong> in the<br />

1930s when president franklin D. Roosevelt’s New Deal legislati<strong>on</strong> expanded the bureaucracy<br />

in Washingt<strong>on</strong>. World War II further increased government activity, adding to the<br />

number of federal employees in the capital; by 1950, the District’s populati<strong>on</strong> had reached<br />

a peak of 802,178 residents. the twenty-third amendment to the United States c<strong>on</strong>stituti<strong>on</strong><br />

was ratified in 1961, granting the District three votes in the Electoral college.<br />

after the assassinati<strong>on</strong> of civil rights leader Dr. Martin Luther King, Jr., <strong>on</strong> april 4,<br />

1968, riots broke out in the District, primarily in the U Street, 14th Street, 7th Street,<br />

and H Street corridors, centers of black residential and commercial areas. the riots<br />

raged for three days until over 13,000 federal and Nati<strong>on</strong>al Guard troops managed to<br />

quell the violence. Many stores and other buildings were burned; rebuilding was not<br />

complete until the late 1990s. In 1973, c<strong>on</strong>gress enacted the District of columbia<br />

Page<br />

156


Local Informati<strong>on</strong><br />

Home Rule act, providing for an elected mayor and city council for the District. In 1975,<br />

Walter Washingt<strong>on</strong> became the first elected and first black mayor of the District. However,<br />

Board during to oversee the later all municipal 1980s spending and 1990s, and rehabilitate city administrati<strong>on</strong>s the city government. were The District criticized for mismanagement<br />

regained c<strong>on</strong>trol and over waste. its finances In 1995, in September c<strong>on</strong>gress 2001 and created the oversight the District board's operati<strong>on</strong>s of columbia were financial<br />

Board suspended. to oversee all municipal spending and rehabilitate the city government. The District<br />

c<strong>on</strong>trol Board to oversee all municipal spending and rehabilitate the city government.<br />

regained c<strong>on</strong>trol over its finances in September 2001 and the oversight board's operati<strong>on</strong>s were<br />

the suspended. District regained c<strong>on</strong>trol over its finances in September 2001 and the oversight<br />

board’s Attracti<strong>on</strong>s operati<strong>on</strong>s in Washingt<strong>on</strong>, were D.C. suspended.<br />

Attracti<strong>on</strong>s White House in Washingt<strong>on</strong>, D.C.<br />

Attracti<strong>on</strong>s The White House in Washingt<strong>on</strong>, is the official residence D.C. and principal workplace of the President of the<br />

United States. Located at 1600 Pennsylvania Avenue NW in Washingt<strong>on</strong>, D.C., it was<br />

White<br />

White designed House<br />

House by Irish-born James Hoban and built<br />

The between White House 1792 and is 1800 the official in the late residence Georgian and style. principal workplace of the President of the<br />

the United It White has States. been House the Located residence is the at 1600 of official every Pennsylvania U.S. residence President Avenue NW in Washingt<strong>on</strong>, D.C., it was<br />

and designed since principal John by Irish-born Adams. workplace In 1814, James of during Hoban the president the and War built of of<br />

the between 1812, United 1792 the States. mansi<strong>on</strong> and 1800 Located was set in the ablaze at late 1600 by Georgian the pennsyl- British style.<br />

vania It has Army<br />

avenue been in the Burning residence of Washingt<strong>on</strong>,<br />

NW in Washingt<strong>on</strong>, of every U.S. destroying<br />

D.c., President it<br />

was<br />

since the<br />

designed<br />

John interior Adams. and charring<br />

by Irish-born<br />

In 1814, much during of the<br />

James<br />

the exterior.<br />

Hoban<br />

War of<br />

Rec<strong>on</strong>structi<strong>on</strong> began almost immediately, and<br />

1812, the mansi<strong>on</strong> was set ablaze by the British<br />

and President built between James M<strong>on</strong>roe 1792 moved and 1800 into the in partially the late<br />

Army in the Burning of Washingt<strong>on</strong>, destroying<br />

Georgian rec<strong>on</strong>structed style. house It has in been October the 1817. residence Under<br />

the Harry interior S. Truman, and charring the interior much rooms of the were exterior.<br />

of Rec<strong>on</strong>structi<strong>on</strong> every completely U.S. dismantled president began almost and since a new immediately, John internal adams. load- and<br />

In President 1814, bearing during James steel frame M<strong>on</strong>roe the c<strong>on</strong>structed War moved of 1812, inside into the man- partially walls.<br />

si<strong>on</strong> rec<strong>on</strong>structed Once was this set work ablaze house was in by completed, October the British 1817. the interior army Under rooms in<br />

the Harry were Burning S. rebuilt. Truman, of Today, Washingt<strong>on</strong>, the interior the White rooms House destroying were Complex the includes interior the and Executive charring Residence, much West of the exterior.<br />

Rec<strong>on</strong>structi<strong>on</strong><br />

completely Wing, Cabinet dismantled Room,<br />

began<br />

and Roosevelt<br />

almost<br />

a new Room,<br />

immediately,<br />

internal East load- Wing, and the Old Executive Office<br />

and president James M<strong>on</strong>roe moved into<br />

bearing<br />

Building,<br />

steel<br />

which<br />

frame<br />

houses<br />

c<strong>on</strong>structed<br />

the executive<br />

inside<br />

offices<br />

the walls.<br />

of the President and Vice President.<br />

the partially rec<strong>on</strong>structed house in october 1817. Under harry s. truman, the inte-<br />

Once this work was completed, the interior rooms<br />

rior Washingt<strong>on</strong> rooms were M<strong>on</strong>ument<br />

were completely dismantled and a new internal load-bearing steel frame<br />

The rebuilt. Washingt<strong>on</strong> Today, M<strong>on</strong>ument the White is an House obelisk Complex near the includes west end the of the Executive Nati<strong>on</strong>al Mall Residence, in West<br />

c<strong>on</strong>structed Wing, inside the walls. <strong>on</strong>ce this work was completed, the interior rooms were<br />

Washingt<strong>on</strong>, Cabinet Room, D.C., built Roosevelt to commemorate Room, East the first Wing, and the Old Executive Office<br />

rebuilt. Building, U.S. president, today, which the houses General White the George house executive Washingt<strong>on</strong>. Complex offices The of includes the President the Executive and Vice President. Residence, West Wing,<br />

Cabinet m<strong>on</strong>ument Room, is both Roosevelt the world's Room, tallest st<strong>on</strong>e East structure Wing, and the old Executive office Building,<br />

which Washingt<strong>on</strong> and houses the world's M<strong>on</strong>ument the tallest executive obelisk, offices standing of 555 the feet president and Vice President.<br />

The 5⅛ Washingt<strong>on</strong> inches (169.294 M<strong>on</strong>ument m). There is are an obelisk taller m<strong>on</strong>umental near the west end of the Nati<strong>on</strong>al Mall in<br />

Washingt<strong>on</strong>, columns, but D.C., they built are neither to commemorate all st<strong>on</strong>e nor true the first<br />

Washingt<strong>on</strong> obelisks. The corner M<strong>on</strong>ument st<strong>on</strong>e was laid <strong>on</strong> July 4, 1848.<br />

U.S. president, General George Washingt<strong>on</strong>. The<br />

the The Washingt<strong>on</strong> same trowel M<strong>on</strong>ument was used that George is an obelisk Washingt<strong>on</strong><br />

m<strong>on</strong>ument<br />

near the<br />

used to lay<br />

is both<br />

the cornerst<strong>on</strong>e<br />

the world's<br />

of<br />

tallest<br />

the Capitol<br />

st<strong>on</strong>e<br />

way<br />

structure<br />

back<br />

west and end of the Nati<strong>on</strong>al Mall in Washingt<strong>on</strong>, D.c.,<br />

in the 1793. world's tallest obelisk, standing 555 feet<br />

built 5⅛ inches to commemorate (169.294 m). There the first are taller U.S. president, m<strong>on</strong>umental<br />

General columns, Lincoln George but Memorial they Washingt<strong>on</strong>. are neither all the st<strong>on</strong>e m<strong>on</strong>ument nor true is<br />

both obelisks. The the Lincoln world’s The corner Memorial tallest st<strong>on</strong>e commemorates st<strong>on</strong>e was laid structure <strong>on</strong> the July life and 4, of 1848. the<br />

Abraham Lincoln, the 16th President of the United<br />

world’s The same tallest trowel obelisk, was used standing that George 555 feet Washingt<strong>on</strong> 5-1/8 inch-<br />

used States. to lay It the is located cornerst<strong>on</strong>e in Potomac of the Park, Capitol Washingt<strong>on</strong>, way back D.C. The Memorial was designed by<br />

es (169.294 Henry Bac<strong>on</strong>; m). the there style is are that taller of a Greek m<strong>on</strong>umental Doric temple with 36 enormous columns. Inside<br />

in 1793.<br />

columns, but they are neither all st<strong>on</strong>e nor true<br />

obelisks. Lincoln Memorial the corner st<strong>on</strong>e was laid <strong>on</strong> July 4, 1848.<br />

the The same Lincoln trowel Memorial was commemorates used that George the life Washingt<strong>on</strong> of<br />

used Abraham to lay Lincoln, the cornerst<strong>on</strong>e the 16th President of the capitol of the United way back in 1793.<br />

States. It is located in Potomac Park, Washingt<strong>on</strong>, D.C. The Memorial was designed by<br />

Lincoln Henry Bac<strong>on</strong>; Memorial the style is that of a Greek Doric temple with 36 enormous columns. Inside<br />

the Lincoln Memorial commemorates the life of abraham Lincoln, the 16th president<br />

of the United States. It is located in potomac park, Washingt<strong>on</strong>, D.c. the Memorial was<br />

Page<br />

157


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

the building is a huge statue of a sitting Lincoln. Also in th<br />

The World War II Memorial h<strong>on</strong>ors the 16 milli<strong>on</strong> who served in th<br />

designed by Henry Bac<strong>on</strong>; the style is that of a and Greek st<strong>on</strong>e Doric engravings temple of with Lincoln's 36 enormous sec<strong>on</strong>d inaugural addres<br />

columns. Inside U.S., the building the more is a than huge statue 400,000 of a sitting who died, Lincoln. and also all in who the Memorial supported the wa<br />

are two murals, and st<strong>on</strong>e engravings of Lincoln’s On August sec<strong>on</strong>d 28, inaugural 1963, Martin address Luther Symbolic and King, the Jr., of made the his def "I<br />

Gettysburg address.<br />

steps of the Lincoln Memorial (the speech was delivered o<br />

20th Century, the m<br />

Lincoln's statue); there is now an inscripti<strong>on</strong> <strong>on</strong> the step w<br />

<strong>on</strong> august 28, 1963, Martin Luther King, Jr., commemorating made his “I Have that a Dream” historic event. m<strong>on</strong>ument speech Dr. <strong>on</strong> King the was to speakin the sp<br />

steps of the Lincoln Memorial (the speech was for delivered Jobs and <strong>on</strong> Freedom. the landing 18 steps commitment below of the<br />

Lincoln’s statue); there is now an inscripti<strong>on</strong> <strong>on</strong> the step where Dr. King stood, The comSec<strong>on</strong>d<br />

World<br />

Nati<strong>on</strong>al World War II Memorial<br />

memorating that historic event. Dr. King was speaking at the March <strong>on</strong> Washingt<strong>on</strong> for<br />

The World War II Memorial h<strong>on</strong>ors Century the 16 event milli<strong>on</strong> who comm se<br />

Jobs and freedom.<br />

U.S., the more than 400,000 who Nati<strong>on</strong>al died, and all Mall’s who suppor cen<br />

Symbolic<br />

Nati<strong>on</strong>al World War II Memorial<br />

20th Cent<br />

the World War II Memorial h<strong>on</strong>ors the 16<br />

m<strong>on</strong>umen<br />

milli<strong>on</strong> who served in the armed forces of<br />

commitm<br />

the U.S., the more than 400,000 who died,<br />

The Seco<br />

and all who supported the war effort from<br />

Century e<br />

home. Symbolic of the defining event of the<br />

Japanese Cherry Nati<strong>on</strong>al BlM<br />

20th century, the memorial is a m<strong>on</strong>ument<br />

The Nati<strong>on</strong>al Cherr<br />

to the spirit, sacrifice, a spring and celebrati<strong>on</strong> commitment of in Washingt<strong>on</strong>, D.C. commemorating the March<br />

the american people. the Sec<strong>on</strong>d World War<br />

is the <strong>on</strong>ly 20th Japanese century event cherry commemo- trees from Mayor Yukio Ozaki of Tokyo to the city<br />

rated <strong>on</strong> the Nati<strong>on</strong>al Mayor Mall’s Ozaki central d<strong>on</strong>ated axis. the trees in an effort to enhance the growing Japanese f<br />

The Natio<br />

United States and Japan and also celebrate the c<strong>on</strong>tinued close relati<br />

Japanese Cherry Blossom Trees<br />

a spring celebrati<strong>on</strong> in Washingt<strong>on</strong>, D.C. commemorating<br />

two nati<strong>on</strong>s.<br />

Japanese cherry trees from Mayor Yukio Ozaki of Tokyo<br />

the Nati<strong>on</strong>al cherry Blossom festival is a spring celebrati<strong>on</strong> in Washingt<strong>on</strong>, D.c.<br />

commemorating In the 1994 March the 27, Festival 1912, gift was Mayor<br />

of Japanese expanded Ozaki d<strong>on</strong>ated<br />

cherry to trees two the<br />

from weeks trees in<br />

Mayor to an effort to enhance the<br />

Yukio accommodate th<br />

United States and Japan and also celebrate the c<strong>on</strong>tinued c<br />

ozaki of tokyo happen to the city during of Washingt<strong>on</strong>. the trees’ Mayor blooming.<br />

two<br />

ozaki<br />

nati<strong>on</strong>s.<br />

d<strong>on</strong>ated Today the trees the in Nati<strong>on</strong>al an effort Cherry Blos<br />

to enhance the coordinated growing friendship by the between the<br />

In 1994 the Festival was expanded to two weeks Nati<strong>on</strong>a to accom<br />

United States Festival, and Japan and Inc., also an celebrate umbrella the happen during the trees’ blooming. Today the Nati<strong>on</strong>al organiza C<br />

c<strong>on</strong>tinued close relati<strong>on</strong>ship between the two coordinated by the<br />

nati<strong>on</strong>s. representatives of business<br />

Festival, Inc., an umbrella<br />

governmental representatives of<br />

organiza<br />

In 1994 the festival 700,000 was expanded people visit to two weeks governmental<br />

Washing<br />

to accommodate the many activities that happen 700,000 people visit<br />

admire the blossoming cherry t<br />

during the trees’ blooming. today the Nati<strong>on</strong>al admire the blossoming<br />

cherry Blossom beginning festival is coordinated of spring by in the the beginning of spring in the<br />

nati<strong>on</strong>’s<br />

Nati<strong>on</strong>al cherry This Blossom year’s festival, Festival Inc., an (100th umThis<br />

year’s Festival (100th<br />

Anniver<br />

brella organizati<strong>on</strong> c<strong>on</strong>sisting of representatives Trees) will be March 31 –<br />

Trees) will be March 31 – April 15<br />

of business, civic, and governmental organizaSaturday,<br />

April 14.<br />

Saturday, April 14.<br />

ti<strong>on</strong>s. More than 700,000 people visit Washing-<br />

(www.nati<strong>on</strong>alcherryblossomfestival.org)<br />

t<strong>on</strong> each year to admire the blossoming cherry<br />

Franklin Delano Roosevelt Memorial<br />

trees that herald the beginning of spring in the nati<strong>on</strong>’s capital.<br />

this year’s festival (100th anniversary of the Gift of trees) will be March 31 – april 15;<br />

with the parade <strong>on</strong> Saturday, april 14. (www.nati<strong>on</strong>alcherryblossomfestival.org)<br />

Page<br />

158<br />

commemorating that historic event. Dr. King was speaking at the M<br />

for Jobs and Freedom.<br />

Nati<strong>on</strong>al World War II Memorial<br />

(www.nati<strong>on</strong>alcherryblossomfestival.org)<br />

Franklin Delano Roosevelt Memorial


Local Informati<strong>on</strong><br />

Franklin Delano Roosevelt Memorial<br />

Located al<strong>on</strong>g the famous cherry tree Walk <strong>on</strong> the Western edge of the tidal Basin near<br />

the Nati<strong>on</strong>al Mall, this is a memorial not <strong>on</strong>ly to fDR, but also to the era he represents.<br />

the memorial traces twelve years of american History through a sequence of four outdoor<br />

rooms - each <strong>on</strong>e devoted to <strong>on</strong>e of fDR’s terms of office. Sculptures inspired by<br />

photographs depict the 32nd president: a 10-foot statue shows him in a wheeled chair;<br />

a bas-relief depicts him riding in a car during his first inaugural. at the very beginning<br />

of the memorial in a prologue room there is a statue with fDR seated in a wheelchair<br />

much like the <strong>on</strong>e he actually used.<br />

Jeffers<strong>on</strong> Memorial<br />

this presidential memorial is dedicated to thomas Jeffers<strong>on</strong>, an american founding<br />

father and the third president of the United States. the neoclassical building was<br />

designed by John Russell pope. c<strong>on</strong>structi<strong>on</strong> began in 1939, the building was completed<br />

in 1943, and the br<strong>on</strong>ze statue of Jeffers<strong>on</strong> was added in 1947. When completed,<br />

the memorial occupied <strong>on</strong>e of the last significant sites left in the city. In 2007, it was<br />

ranked fourth <strong>on</strong> the List of america’s favorite architecture by the american Institute<br />

of architects.<br />

Smiths<strong>on</strong>ian<br />

this is an educati<strong>on</strong>al foundati<strong>on</strong> chartered by c<strong>on</strong>gress in 1846 that maintains most of<br />

the nati<strong>on</strong>’s official museums and galleries in Washingt<strong>on</strong>, D.c. the U.S. government<br />

partially funds the Smiths<strong>on</strong>ian, thus making its collecti<strong>on</strong>s open to the public free of<br />

charge. the most visited of the Smiths<strong>on</strong>ian museums in 2007 was the Nati<strong>on</strong>al Museum<br />

of Natural History located <strong>on</strong> the Nati<strong>on</strong>al Mall. other Smiths<strong>on</strong>ian Instituti<strong>on</strong><br />

museums and galleries located <strong>on</strong> the mall are: the Nati<strong>on</strong>al air and Space Museum;<br />

the Nati<strong>on</strong>al Museum of african art; the Nati<strong>on</strong>al Museum of american History; the<br />

Nati<strong>on</strong>al Museum of the american Indian; the Sackler and freer galleries, which both<br />

focus <strong>on</strong> asian art and culture; the Hirshhorn Museum and Sculpture Garden; the arts<br />

and Industries Building; the S. Dill<strong>on</strong> Ripley center; and the Smiths<strong>on</strong>ian Instituti<strong>on</strong><br />

Building (also known as “the castle”), which serves as the instituti<strong>on</strong>’s headquarters.<br />

the Smiths<strong>on</strong>ian american art Museum (formerly known as the Nati<strong>on</strong>al Museum of<br />

american art) and the Nati<strong>on</strong>al portrait Gallery are located in the same building, the<br />

D<strong>on</strong>ald W. Reynolds center, near Washingt<strong>on</strong>’s chinatown. the Reynolds center is<br />

also known as the old patent office Building. the Renwick Gallery is officially part of<br />

the Smiths<strong>on</strong>ian american art Museum but is located in a separate building near the<br />

White House. other Smiths<strong>on</strong>ian museums and galleries include: the anacostia community<br />

Museum in Southeast Washingt<strong>on</strong>; the Nati<strong>on</strong>al postal Museum near Uni<strong>on</strong><br />

Stati<strong>on</strong>; and the Nati<strong>on</strong>al Zoo in Woodley park.<br />

Nati<strong>on</strong>al Gallery of Art<br />

the Nati<strong>on</strong>al Gallery is located <strong>on</strong> the Nati<strong>on</strong>al Mall near the capitol, but is not a part<br />

of the Smiths<strong>on</strong>ian Instituti<strong>on</strong>. It is instead wholly owned by the U.S. government;<br />

thus admissi<strong>on</strong> to the gallery is free. the gallery’s West Building features the nati<strong>on</strong>’s<br />

collecti<strong>on</strong> of american and European art through the 19th century. the East Building,<br />

designed by architect I. M. pei, features works of modern art. the Smiths<strong>on</strong>ian ameri-<br />

Page<br />

159


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

can art Museum and the Nati<strong>on</strong>al portrait Gallery are often c<strong>on</strong>fused with the Nati<strong>on</strong>al<br />

Gallery of art when they are in fact entirely separate instituti<strong>on</strong>s. the Nati<strong>on</strong>al Building<br />

Museum occupies the former pensi<strong>on</strong> Building located near Judiciary Square, and was<br />

chartered by c<strong>on</strong>gress as a private instituti<strong>on</strong> to host exhibits <strong>on</strong> architecture, urban<br />

planning, and design. there are many private art museums in the District of columbia,<br />

which house major collecti<strong>on</strong>s and exhibits open to the public such as: the Nati<strong>on</strong>al Museum<br />

of Women in the arts; the corcoran Gallery of art, the largest private museum in<br />

Washingt<strong>on</strong>; and the phillips collecti<strong>on</strong> in Dup<strong>on</strong>t circle, the first museum of modern<br />

art in the United States. other private museums in Washingt<strong>on</strong> include the Newseum,<br />

the <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> Spy Museum, the Nati<strong>on</strong>al Geographic Society Museum, and the<br />

Marian Koshland Science Museum. the United States Holocaust Memorial Museum<br />

located near the Nati<strong>on</strong>al Mall maintains exhibits, documentati<strong>on</strong>, and artifacts related<br />

to the Holocaust.<br />

Performing Arts and Music<br />

Washingt<strong>on</strong>, D.c. is a nati<strong>on</strong>al center for the arts. the John f. Kennedy center for the<br />

performing arts, which is located al<strong>on</strong>g the potomac River, is home to the Nati<strong>on</strong>al<br />

Symph<strong>on</strong>y orchestra, the Washingt<strong>on</strong> Nati<strong>on</strong>al opera, and the Washingt<strong>on</strong> Ballet. the<br />

Kennedy center H<strong>on</strong>ors are awarded each year to those in the performing arts who<br />

have c<strong>on</strong>tributed greatly to the cultural life of the United States. the president and first<br />

Lady typically attend the H<strong>on</strong>ors cerem<strong>on</strong>y, as the first Lady is the h<strong>on</strong>orary chair of<br />

the Kennedy center Board of trustees. Washingt<strong>on</strong> also has a local independent theater<br />

traditi<strong>on</strong>. Instituti<strong>on</strong>s such as arena Stage, the Shakespeare theatre company, and the<br />

Studio theatre feature classic works and new american plays.<br />

the U street Corridor in Northwest D.c., known as “Washingt<strong>on</strong>’s Black Broadway”,<br />

is home to instituti<strong>on</strong>s like Bohemian Caverns and the Lincoln theatre, which hosted<br />

music legends such as Washingt<strong>on</strong>-native Duke Ellingt<strong>on</strong>, John Coltrane, and Miles<br />

Davis. other jazz venues feature modern blues such as Madam’s organ in adams Morgan<br />

and Blues alley in Georgetown. D.c. has its own native music genre called go-go;<br />

a post-funk, percussi<strong>on</strong>-driven flavor of R&B that blends live sets with relentless dance<br />

rhythms. the most accomplished practiti<strong>on</strong>er was D.c. band leader Chuck Brown, who<br />

brought go-go to the brink of nati<strong>on</strong>al recogniti<strong>on</strong> with his 1979 Lp Bustin’ Loose.<br />

Green Initiatives<br />

• 70 percent of land in Washingt<strong>on</strong>, DC is c<strong>on</strong>trolled by the Nati<strong>on</strong>al Park Service.<br />

there are 250,000 acres of parkland in the Greater Washingt<strong>on</strong> Metropolitan area.<br />

• In 2007, DC was named the most walkable city in the US in a study by the Brookings<br />

Institute.<br />

• In late 2006, City Council passed an initiative making the nati<strong>on</strong>’s capital the first<br />

major city to require developers to adhere to guidelines established by the U.S. Green<br />

Building council.<br />

• The Washingt<strong>on</strong> Nati<strong>on</strong>als Ballpark is striving to be the country’s first green-certified<br />

ballpark<br />

• The Walter E. Washingt<strong>on</strong> C<strong>on</strong>venti<strong>on</strong> Center is a green meeting facility, with<br />

earth-friendly features like low emissi<strong>on</strong> glass that c<strong>on</strong>trols heat gain and loss and<br />

Page<br />

160


Local Informati<strong>on</strong><br />

maximizes natural lighting; energy-c<strong>on</strong>serving heating, ventilati<strong>on</strong> and air c<strong>on</strong>diti<strong>on</strong>ing<br />

systems that operate in z<strong>on</strong>es; high-efficiency lighting; automatic c<strong>on</strong>trols <strong>on</strong><br />

restroom fixtures; plus recycling programs and easy public transportati<strong>on</strong> access.<br />

• DC’s hotels have implemented green initiatives, including wind power, renewable<br />

energy credits, recycling and adopt-a-park programs with neighborhood green spaces.<br />

<str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> DC<br />

• 84,000 DC residents (15%) speaking a language other than English at home.<br />

• 74,000 DC residents (12%) are foreign-born.<br />

• The Greater Washingt<strong>on</strong> regi<strong>on</strong> is home to 400 internati<strong>on</strong>al associati<strong>on</strong>, 700 internati<strong>on</strong>ally<br />

owned companies and more than 150 embassies and internati<strong>on</strong>al cultural<br />

centers.<br />

Page<br />

161


<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />

Page<br />

162


Platinum Sp<strong>on</strong>sors<br />

Gold Sp<strong>on</strong>sors<br />

Silver Sp<strong>on</strong>sors<br />

Br<strong>on</strong>ze Sp<strong>on</strong>sor<br />

Supported By

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!