28th IEEE International Conference on Data Engineering - ICDE 2012
28th IEEE International Conference on Data Engineering - ICDE 2012
28th IEEE International Conference on Data Engineering - ICDE 2012
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
★ ★ ★<br />
<str<strong>on</strong>g>28th</str<strong>on</strong>g> <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong><br />
Washingt<strong>on</strong>, DC • April 1-5, <strong>2012</strong><br />
<strong>Data</strong> <strong>Engineering</strong> (<strong>ICDE</strong>)
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Program<br />
<str<strong>on</strong>g>28th</str<strong>on</strong>g> <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong><br />
<strong>Data</strong> <strong>Engineering</strong> (<strong>ICDE</strong>)<br />
April 1-5, <strong>2012</strong><br />
Washingt<strong>on</strong>, DC<br />
COVER PHOTOS: Copyright © <strong>2012</strong> by Tasos Kementsietsidis
Table of C<strong>on</strong>tents<br />
Table of C<strong>on</strong>tents ................................................................................................3<br />
Message from the <strong>ICDE</strong> <strong>2012</strong> Program Chairs ........................................5<br />
and the General Chair<br />
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong> .................................................................................7<br />
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Venue ..............................................................................................17<br />
Program at a Glance .........................................................................................21<br />
Sessi<strong>on</strong> C<strong>on</strong>tents ...............................................................................................25<br />
Keynotes ................................................................................................................51<br />
Seminars ............................................................................................................... 55<br />
Panels ..................................................................................................................... 61<br />
Awards .................................................................................................................. 67<br />
Abstracts .............................................................................................................. 69<br />
Co-Located Workshops ................................................................................139<br />
Local Informati<strong>on</strong> ............................................................................................153<br />
Page<br />
3
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
4
Message from the <strong>ICDE</strong><br />
<strong>2012</strong> Program Chairs and<br />
the General Chair<br />
Since 1984, <strong>ICDE</strong> has established itself as a premier forum in the area of data management,<br />
providing a unique opportunity for database researchers, users, practiti<strong>on</strong>ers,<br />
and developers to exchange new ideas. The <str<strong>on</strong>g>28th</str<strong>on</strong>g> <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong><br />
<strong>Data</strong> <strong>Engineering</strong> takes place in the city of Washingt<strong>on</strong>, United States, from April 1 to 5,<br />
<strong>2012</strong>. We are proud to present its program in these proceedings.<br />
Each of the main days of the c<strong>on</strong>ference starts out with a keynote by a distinguished<br />
scientist: Serge Abiteboul from INRIA in France <strong>on</strong> April 2; Surajit Chaudhuri from<br />
Microsoft Research in the United States <strong>on</strong> April 3; and Peter Druschel from the Max-<br />
Planck Institute for Software Systems in Germany <strong>on</strong> April 4.<br />
We thank all the authors who submitted their work to <strong>ICDE</strong> for making the c<strong>on</strong>ference<br />
happen. We received 413 paper submissi<strong>on</strong>s for the research track, 22 submissi<strong>on</strong>s for<br />
the industrial track, and 68 demo proposals. The program committee was organized<br />
into fifteen topic-based tracks. Each track was headed by a vice-chair who formed a committee<br />
to evaluate the papers assigned to that track. This resulted in a research program<br />
committee c<strong>on</strong>sisting of 188 members for the research tracks, 12 members for the<br />
industrial track, and 30 members for the demo track. The evaluati<strong>on</strong> process c<strong>on</strong>sisted<br />
of three distinct phases: initial reviews of the papers by PC members, some initial dis-<br />
Page<br />
5
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
cussi<strong>on</strong>s, author resp<strong>on</strong>ses to these reviews, and then further discussi<strong>on</strong> by the PC and<br />
fine-tuning of the reviews.<br />
The research program features 100 papers, the industrial program 9 papers, and the<br />
dem<strong>on</strong>strati<strong>on</strong> program 28 demos. The c<strong>on</strong>ference program also includes 6 seminar<br />
tutorials and <strong>on</strong>e panel. As a feature of <strong>ICDE</strong> c<strong>on</strong>ferences in recent years, all papers are<br />
presented at a poster sessi<strong>on</strong>. Accompanying the main c<strong>on</strong>ference are seven workshops.<br />
The success of <strong>ICDE</strong> <strong>2012</strong> is a result of collegial teamwork from many individuals who<br />
worked tirelessly to make the c<strong>on</strong>ference a success. We thank Nico Bruno and Ken Ross<br />
who served as Industrial Chairs; Christof Bornhoevd, Richard Goodwin, and Mirek Riedewald<br />
who served as Demo Chairs; Aryya Gangopadhyay who served as Seminar Chair;<br />
Michael Gertz and Alex Tuzhilin who served as Panel Chairs; Anupam Joshi and Sharad<br />
Mehrotra who served as Workshop Chairs; and also the organizers of the accompanying<br />
workshops. We also express our deep appreciati<strong>on</strong> of the outstanding work put in over<br />
many m<strong>on</strong>ths by the organizati<strong>on</strong> team: Nabil Adam, Alex Brodsky and Vijay Atluri<br />
served as general (vice-)chairs, Carlotta Domenic<strong>on</strong>i and Huzefa Rangwala were the<br />
Local Organizati<strong>on</strong> and Sp<strong>on</strong>sorship Chairs, Hui Xi<strong>on</strong>g served as Finance Chair, So<strong>on</strong><br />
Ae Chun served as Publicity Chair, Anastasios Kementsietsidis and Marcos Vaz Salles<br />
as Proceedings Chairs, and Micah Sherr as Web Chair. We thank Carmen Saliba and<br />
Alkenia Winst<strong>on</strong> from the <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Computer Society’s <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Support Services for<br />
helping secure the various necessary c<strong>on</strong>tracts in a timely manner, and Beth Grohnke of<br />
GMU’s Office of Event Management for helping with many local arrangement issues.<br />
The Best Paper Award Committee included Minos Garofalakis (chair), Anth<strong>on</strong>y Tung,<br />
and Ugur Centintemel. Without the c<strong>on</strong>tributi<strong>on</strong>s of all of these excellent c<strong>on</strong>ference officers,<br />
this c<strong>on</strong>ference would not have been a success. We are also thankful to the many<br />
student volunteers from George Mas<strong>on</strong> University.<br />
We also thank the Microsoft CMT Team and the <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Publicati<strong>on</strong>s Team<br />
for their assistance and quick replies to our multitude of requests.<br />
We also gratefully acknowledge the financial support of our sp<strong>on</strong>sors: Microsoft and<br />
the Nati<strong>on</strong>al Science Foundati<strong>on</strong> as Platinum Sp<strong>on</strong>sors, EMC and Greenplum as Gold<br />
Sp<strong>on</strong>sors, HP and IBM Research as Silver Sp<strong>on</strong>sors, and Google as a Br<strong>on</strong>ze Sp<strong>on</strong>sor.<br />
Finally, we thank all the authors, presenters, and participants of the c<strong>on</strong>ference. We<br />
hope that all of you enjoy the c<strong>on</strong>ference!<br />
<strong>ICDE</strong> <strong>2012</strong> PC Chairs<br />
Johannes Gehrke (Cornell University, USA)<br />
Beng Chin Ooi (Nati<strong>on</strong>al University of Singapore, Singapore)<br />
Evaggelia Pitoura (University of Ioannina, Greece)<br />
<strong>ICDE</strong> <strong>2012</strong> General Chair<br />
X. Sean Wang (Fudan University, China)<br />
Page<br />
6
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />
Organizing COmmittee<br />
General Chairs<br />
X. Sean Wang (Fudan University)<br />
nabil r. adam (US DHS S&t, rutgers University)<br />
General Vice Chairs<br />
alex Brodsky (george mas<strong>on</strong> University)<br />
Vijay atluri (rutgers University)<br />
Program Chairs<br />
Johannes gehrke (Cornell University)<br />
Beng Chin Ooi (nati<strong>on</strong>al University of Singapore)<br />
evaggelia Pitoura (University of ioannina)<br />
Industrial Program Chairs<br />
nicolas Bruno (microsoft research)<br />
Liang-Jie zhang (iBm research)<br />
Kenneth ross (Columbia University)<br />
Seminar/Tutorial Chair<br />
aryya gangopadhyay (UmBC)<br />
Page<br />
7
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Workshop Chairs<br />
Sharad mehrotra (Univ. of California, irvine)<br />
anupam Joshi (UmBC)<br />
Panel Chairs<br />
alex tuzhilin (new York University)<br />
michael gertz (University of Heidelberg)<br />
Poster Chairs<br />
Jaideep Vaidya (rutgers University)<br />
zachary ives (University of Pennsylvania)<br />
Demo Chairs<br />
Christof Bornhoevd (SaP research)<br />
richard goodwin (iBm)<br />
mirek riedewald (northeastern University)<br />
Proceedings Chairs<br />
anastasios Kementsietsidis (iBm)<br />
marcos Vaz Salles (University of Copenhagen)<br />
Local Organizati<strong>on</strong> Chairs and Sp<strong>on</strong>osrship Chairs<br />
Carlotta Domenic<strong>on</strong>i (george mas<strong>on</strong> University)<br />
Huzefa rangwala (george mas<strong>on</strong> University)<br />
Finance Chair<br />
Hui Xi<strong>on</strong>g (rutgers University)<br />
Publicity Chair<br />
So<strong>on</strong> ae Chun (City University of new York)<br />
Web Chair<br />
micah Sherr (georgetown University)<br />
PrOgram COmmittee<br />
Program Committee Area Vice Chairs<br />
Cloud, data warehousing, and large data<br />
Volker markl (tU Berlin, germany)<br />
<strong>Data</strong> Integrati<strong>on</strong>, metadata management, interoperability<br />
erhard rahm (Univ. of Leipzig, germany)<br />
Page<br />
8
<strong>Data</strong> mining and knowledge discovery<br />
anth<strong>on</strong>y tung (nati<strong>on</strong>al University of Singapore, Singapore)<br />
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />
Distributed, peer-to-peer, grid, and mobile data management<br />
aoying zhou (east normal University, China)<br />
Indexing and storage<br />
Lei Chen (University of Science and technology, H<strong>on</strong>gk<strong>on</strong>g)<br />
Privacy and security<br />
elena Ferrari (University of insubria, italy)<br />
Query processing and query optimizati<strong>on</strong><br />
Kaushik Chakrabarti (microsoft research, USa)<br />
Scientific data and data visualizati<strong>on</strong><br />
zachary ives (University of Pennsylvania, USa)<br />
Semistructured data, XML<br />
ioana manolescu (inria, France)<br />
Social networks, web, and pers<strong>on</strong>al informati<strong>on</strong> management<br />
aris gi<strong>on</strong>is (Yahoo! research, Spain)<br />
Spatial, temporal, and multimedia data<br />
Heng tao Shen (University of Queensland, australia)<br />
Streams, sensor networks, and complex events processing<br />
Ugur Cetintemel (Brown University, USa)<br />
Systems, performance, and transacti<strong>on</strong> management<br />
Bettina Kemme (mcgill University, Canada)<br />
Text, graphs, and search<br />
Venkatesh ganti (google)<br />
Uncertain and probabilistic data<br />
minos garofalakis (technical University of Crete, greece)<br />
Research Program Committee Members<br />
Yanif ahmad, Johns Hopkins University<br />
aris anagnostopoulos, Sapienza University of Rome<br />
Walid aref, Purdue University<br />
ismail ari, Ozyegin University<br />
Soeren auer, Leipzig School of Media<br />
Page<br />
9
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Shivnath Babu, Duke University<br />
roger Barga, Microsoft<br />
zohra Bellahsene, University of M<strong>on</strong>tpellier II<br />
elisa Bertino, Purdue University<br />
Claudio Bettini, University of Milan<br />
michael Bohlen, University of Zurich<br />
Paolo Boldi, University of Milan<br />
Francesco B<strong>on</strong>chi, Yahoo! Research<br />
Peter B<strong>on</strong>cz, CWI<br />
angela B<strong>on</strong>ifati, ICAR-CNR, Italy<br />
Vinayak Borkar, University of California, Irvine<br />
Christof Bornhoevd, SAP<br />
randal Burns, Johns Hopkins University<br />
andrea Cali, University of Oxford<br />
Selcuk Candan, Ariz<strong>on</strong>a State University<br />
Barbara Carminati, University of Insubria, Italy<br />
Deepayan Chakrabarti, Yahoo! Research<br />
Chee Y<strong>on</strong>g Chan, Nati<strong>on</strong>al University of Singapore<br />
Badrish Chandramouli, Microsoft<br />
gang Chen, Zhejing University, China<br />
Shimin Chen, Intel Labs Pittsburgh<br />
Su Chen, Nati<strong>on</strong>al University of Singapore<br />
Yi Chen, Ariz<strong>on</strong>a State University<br />
reynold Cheng, University of H<strong>on</strong>g-K<strong>on</strong>g<br />
Sarah Cohen-Boulakia, LRI Orsay<br />
gao C<strong>on</strong>g, Nanyang Technological University, Singapore<br />
Stefan C<strong>on</strong>rad, University of Dortmund<br />
mariano C<strong>on</strong>sens, University of Tor<strong>on</strong>to<br />
graham Cormode, AT&T Research<br />
isabel Cruz, University of Illinois at Chicago<br />
Bin Cui, Beijing University, China<br />
alfredo Cuzzocrea, ICAR-CNR and University of Calabria, Italy<br />
Colazzo Dario, University Paris Sud<br />
gautam Das, University of Texas-Arlingt<strong>on</strong><br />
anish Das Sarma, Google<br />
Khuzaima Daudjee, University of Waterloo<br />
ant<strong>on</strong>ios Deligiannakis, Technical University of Crete<br />
Stefan Dessloch, University of Kaiserslautern<br />
zhiming Ding, Institute of Software, Chinese Academy of Science<br />
Jens Dittrich, Universitaet Saarland<br />
anhai Doan, University of Wisc<strong>on</strong>sin<br />
eduard Dragut, Purdue University<br />
Sameh elnikety, Microsoft<br />
Vuk ercegovac, IBM Almaden<br />
Wenfei Fan, University of Edinburgh<br />
alan Fekete, University of Sidney<br />
Page<br />
10
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />
alvaro Fernandes, University of Manchester<br />
Johann Christoph Freytag, University of Berlin<br />
avigdor gal, Techni<strong>on</strong><br />
Helena galhardas, Instituto Superior Tecnico, Portugal<br />
tingjian ge, University of Kentucky<br />
Bugra gedik, IBM<br />
Floris geerts, University of Edinburgh<br />
Sreenivas gollapudi, Microsoft Research<br />
Le gruenwald, University of Oklahoma<br />
torsten grust, University of Tuebingen<br />
amarnath gupta, San Diego Supercomputing Center<br />
Peter Haas, IBM Almaden<br />
Jeff Hammerbacher, Cloudera<br />
Wook-Shin Han, Korean Nati<strong>on</strong>al University<br />
Oktie Hassanzadeh, University of Tor<strong>on</strong>to<br />
magnus Lie Hetland, NTNU, Norway<br />
Vagelis Hristidis, Florida <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> University<br />
zi Huang, University of Queensland<br />
Seung-w<strong>on</strong> Hwang, POSTECH, Korea<br />
Stratos idreos, CWI<br />
Yoshiharu ishikawa, Nagoya University<br />
ryan Johns<strong>on</strong>, University of Tor<strong>on</strong>to<br />
theodore Johns<strong>on</strong>, AT&T Research<br />
Panos Kalnis, King Abdullah University of Science and Technology (KAUST)<br />
murat Kantarcioglu, University of Texas-Dallas<br />
Panagiotis Karras, Nati<strong>on</strong>al University of Singapore<br />
alf<strong>on</strong>s Kemper, TU Muenchen<br />
eam<strong>on</strong>n Keogh, University of California, Riverside<br />
Christoph Koch, EPFL<br />
george Kollios, Bost<strong>on</strong> University<br />
nick Koudas, University of Tor<strong>on</strong>to<br />
tim Kraska, University of California Berkeley<br />
Wang-Chien Lee, Penn State University<br />
Ulf Leser, Humboldt University Berlin<br />
Jure Leskovec, Stanford<br />
guiping Li, Renmin University of China<br />
Feifei Li, Florida State University<br />
guoliang Li, Tsinghua University<br />
ninghui Li, Purdue University<br />
Xiang Lian, H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology<br />
Xueming Lin, University of South Wales<br />
Kun Liu, Yahoo! Labs<br />
Ling Liu, Georgia Tech<br />
eric Lo, H<strong>on</strong>g K<strong>on</strong>g Polytechnic University<br />
Bo<strong>on</strong> thau Loo, University of Pennsylvania<br />
alexander Losup, TU Delft<br />
Page<br />
11
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Hua Lu, Aalborg University<br />
Bertram Ludaescher, University of California Davis<br />
Bradley malin, Vanderbilt University<br />
nikos mamoulis, The University of H<strong>on</strong>g K<strong>on</strong>g<br />
Stefan manegold, CWI<br />
Sebastian maneth, NICTA, Australia<br />
ioana manolescu, Inria<br />
alexandra meliou, University of Washingt<strong>on</strong><br />
Paolo missier, Newcastle University<br />
mohamed F. mokbel, University of Minnesota<br />
mirella moro, Universidade Federal de Minas Gerais, Brazil<br />
Vivek narasayya, Microsoft Research<br />
thomas neumann, Technical University Munich<br />
Silvia nittel, University of Maine<br />
Dan Olteanu, Oxford University<br />
tamer Ozsu, University of Waterloo<br />
thanasis Papaioannou, EPFL<br />
marta Patino-martinez, Technical University of Madrid<br />
glenn Paulley, Sybase<br />
Dino Pedreschi, University of Pisa<br />
Jian Pei, Sim<strong>on</strong> Fraser University<br />
Peter Pietzuch, Imperial College L<strong>on</strong>d<strong>on</strong><br />
neoklis Polyzotis, University of California Santa Cruz<br />
rachel Pottinger, UBC<br />
Sunil Prabhakar, Purdue University<br />
Weining Qian, East China Normal University<br />
Christoph Quix, RWTH Aachen<br />
ravi ramamurthy, Microsoft<br />
Vijayshankar raman, IBM<br />
Vibhor rastogi, Yahoo! Research<br />
indrakshi ray, Colorado State University<br />
Christopher re, University of Wisc<strong>on</strong>sin-Madis<strong>on</strong><br />
matthias renz, Ludwig-Maximilians-University Munich<br />
marcos Vaz Salles, University of Copenhagen<br />
Jagan Sankaranarayanan, NEC Labs America<br />
ralf Schenkel, Saarland University<br />
Heiko Schuldt, University of Basel<br />
Sudipta Sengupta, Microsoft Research<br />
Jayavel Shanmugasundaram, Google<br />
Jie Shao, University of Melbourne<br />
Jialie Shen, Singapore Management University<br />
elaine Shi, UC Berkeley<br />
Kyuseok Shim, Seoul Nati<strong>on</strong>al University<br />
Pavel Shvaiko, Informatica Trentina<br />
Claudio Silva, University of Utah<br />
mauro Sozio, Max Planck Institute for Computer Science, Germany<br />
Page<br />
12
Divesh Srivastava, AT&T Research<br />
Jessica Stadd<strong>on</strong>, Google<br />
S Sudarshan, IIT Bombay<br />
torsten Suel, Polytechnic Institute of NYU<br />
Kian-Lee tan, Nati<strong>on</strong>al University of Singapore<br />
Yufei tao, Chinese University of H<strong>on</strong>g K<strong>on</strong>g<br />
James terwilliger, Microsoft<br />
evimaria terzi, Bost<strong>on</strong> University<br />
Jens teubner, ETH Zurich<br />
Hannu toiv<strong>on</strong>en, University of Helsinki<br />
Panayiotis tsaparas, Microsoft Research<br />
antti Ukk<strong>on</strong>en, Yahoo! Research<br />
Shivakumar Vaithyanathan, IBM Almaden<br />
Vasilis Vassalos, Athens University of Ec<strong>on</strong>omics and Business<br />
Yannis Velegrakis, University of Trento<br />
Quang Hieu Vu, EBTIC<br />
Daisy zhe Wang, University of Florida<br />
guoren Wang, Northeastern University of China<br />
Haixun Wang, Microsoft Research<br />
Jiany<strong>on</strong>g Wang, Tsinghua University<br />
Junhu Wang, Griffith University, Australia<br />
Wei Wang, UNC<br />
Kyu-Young Whang, KAIST<br />
andrew Witkowski, Oracle<br />
raym<strong>on</strong>d W<strong>on</strong>g, H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology<br />
Sai Wu, Nati<strong>on</strong>al University of Singapore<br />
tianyi Wu, Microsoft<br />
Xiaokui Xiao, Nanyang Technological University, Singapore<br />
D<strong>on</strong>g Xin, Google<br />
Jianliang Xu, H<strong>on</strong>g K<strong>on</strong>g Baptist University<br />
Linhao Xu, IBM Research<br />
Xifeng Yan, UCSB<br />
Bin Yang, Max-Planck-Institut für Informatik<br />
Jun Yang, Duke University<br />
Linjun Yang, Microsoft Research Asia<br />
Ke Yi, H<strong>on</strong>g-K<strong>on</strong>g University of Science and Technology<br />
ge Yu, Northeastern University, China<br />
Hwanjo Yu, POSTECH<br />
Carlo zaniolo, UCLA<br />
D<strong>on</strong>gxiang zhang, Nati<strong>on</strong>al University of Singapore<br />
rui zhang, University of Melbourne<br />
zhenjie zhang, NUS<br />
minqi zhou, East China Normal University<br />
Xiangmin zhou, CSIRO<br />
Freida zhu, Singapore Management University<br />
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />
Page<br />
13
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Industrial Program Committee Members<br />
Bishwaranjan Bhattacharjee, IBM<br />
Philippe B<strong>on</strong>net, IT University of Copenhagen<br />
John Cieslewicz, Aster <strong>Data</strong><br />
amol Deshpande, University of Maryland<br />
Cesar galindo-Legaria, Microsoft<br />
Leo giakoumakis, Microsoft<br />
masaru Kitsuregawa, The University of Tokyo<br />
Harumi Kuno, HP<br />
Jun rao, LinkedIn<br />
rajeev rastogi, Yahoo!<br />
Florian Waas, EMC<br />
mohammed zait, Oracle<br />
Demo Program Committee Members<br />
Sihem amer-Yahia, Qatar Computing Research Institute<br />
arvind arasu, Microsoft Research<br />
Sunil arvindam, SAP Research, India<br />
magdalena Balazinska, University of Washingt<strong>on</strong><br />
Fabio Casati, University of Trento, Italy<br />
malu Castellanos, HP Labs, USA<br />
mariano Cilia, Intel Corporati<strong>on</strong>, Argentina<br />
Brian F Cooper, Google<br />
adina Crainiceanu, US Naval Academy<br />
abhinandan Das, Google<br />
alin Dobra, University of Florida<br />
Javier garcia-garcia, UNAM University, Mexico<br />
Pablo guerrero, TU Darmstadt, Germany<br />
melanie Herschel, Tubingen University<br />
Christian K<strong>on</strong>ig, Microsoft Research<br />
georgia Koutrika, IBM Almaden Research Center<br />
Wolfgang Lehner, TU Dresden, Germany<br />
Feifei Li, Florida State University<br />
ashwin machanavajjhala, Yahoo Research<br />
thomas neumann, TU Munchen<br />
Dan Olteanu, University of Oxford<br />
Carlos Ord<strong>on</strong>ez, University of Houst<strong>on</strong><br />
Peter Pietzuch, Imperial College L<strong>on</strong>d<strong>on</strong><br />
Lin Qiao, IBM Almaden<br />
Berthold reinwald, IBM Almaden, USA<br />
Vladislav Shkapenyuk, ATT Research<br />
adam Silberstein, Yahoo Research<br />
alkis Simitsis, HP Labs<br />
Page<br />
14
ioana r Stanoi, IBM Almaden<br />
ming-Chuan Wu, Microsoft, USA<br />
External Reviewers<br />
albert angel<br />
Pantelis aravogliadis<br />
Vassilis athitsos<br />
evandrino Barros<br />
nicole Bidoit<br />
nicolas B<strong>on</strong>vin<br />
Daniele Braga<br />
Lorenz Buehmann<br />
ruichu Cai<br />
Xin Cao<br />
Bogdan Cautis<br />
Yi-Ling Chen<br />
S<strong>on</strong>gting Chen<br />
Shiwen Cheng<br />
Fei Chiang<br />
Byr<strong>on</strong> Choi<br />
Juan Da Cruz Pinto<br />
maria Daltayanni<br />
mahashweta Das<br />
David DeHaan<br />
Bolin Ding<br />
marius Dumitru<br />
Santiago ezcurra<br />
Ju Fan<br />
Wei Feng<br />
Chuanc<strong>on</strong>g gao<br />
Shen ge<br />
Haris georgiadis<br />
Christan grant<br />
nitin gupta<br />
Yeye He<br />
arvid Heise<br />
Haibo Hu<br />
Heng Huang<br />
Lili Jiang<br />
Xin Jin<br />
alekh Jindal<br />
matti Järvisalo<br />
abhijith Kashyap<br />
asterios Katsifodimos<br />
Batya Kenig<br />
arijit Khan<br />
Julien Leblay<br />
Jae-gil Lee<br />
aurelien Lemay<br />
Jianxin Li<br />
nan Li<br />
Xingjie Liu<br />
Shuai ma<br />
Vincenzo maltese<br />
Bruno martins<br />
michael mathioudakis<br />
manuel mayr<br />
giansalvatore mecca<br />
gengxin miao<br />
Pablo michelis<br />
nabeel mohamed<br />
miyuki nakano<br />
akash nanavati<br />
axel ng<strong>on</strong>ga<br />
anisoara nica<br />
Bart niechweij<br />
tomasz nykiel<br />
matteo Palm<strong>on</strong>ari<br />
Panagiotis Papadimitriou<br />
Charalampos<br />
Papamanthou<br />
Xu Pu<br />
Jianzh<strong>on</strong>g Qi<br />
H<strong>on</strong>gda ren<br />
astrid rheinlaender<br />
Daniele rib<strong>on</strong>i<br />
Jan rittinger<br />
Senjuti Basu roy<br />
eduardo ruiz<br />
michael rys<br />
tomer Sagi<br />
Sim<strong>on</strong>as Saltenis<br />
Carlo Sartiani<br />
Jörg Schad<br />
Stefan Schuh<br />
Pierre Senellart<br />
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Organizati<strong>on</strong><br />
Chih-Ya Shen<br />
reza Sherkat<br />
Kelvin Sim<br />
guojie S<strong>on</strong><br />
Claus Stadler<br />
Johannes Starlinger<br />
Yizhou Sun<br />
andrej taliun<br />
takayuki tamura<br />
nan tang<br />
Saravanan<br />
thirumuruganathan<br />
andreas thor<br />
Xinmei tian<br />
masashi toyoda<br />
Frederico Ulliana<br />
Jörg Unbehauen<br />
Jiannan Wang<br />
gerhard Weikum<br />
zeyi Wen<br />
raym<strong>on</strong>d Chi-Wing<br />
W<strong>on</strong>g<br />
Yinghui Wu<br />
mao Ye<br />
Peifeng Ying<br />
man Lung Yiu<br />
Wenyuan Yu<br />
ning zhang<br />
Qijun zhu<br />
Bo z<strong>on</strong>g<br />
Page<br />
15
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
16
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Venue<br />
ERENCE VENUE<br />
The c<strong>on</strong>ference will take place in the Renaissance Arlingt<strong>on</strong> Capital View Hotel<br />
located at 2800 South Potomac Avenue, Arlingt<strong>on</strong>, Virginia 22202 USA. If using a<br />
GPS navigator, you may try to search for the address 2899 Jeffers<strong>on</strong> Davis Highway,<br />
Arlingt<strong>on</strong>, VA 22202 as an alternative address for locating the destinati<strong>on</strong>.<br />
<strong>on</strong>ference will take place in the Renaissance Arlingt<strong>on</strong> Capital View Hotel loc<br />
0 South Potomac Avenue, Arlingt<strong>on</strong>, Virginia 22202 USA<br />
earest Metro Stati<strong>on</strong> is Crystal City Metro (Blue and Red Line)<br />
Page<br />
17
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
The Nearest Metro Stati<strong>on</strong> is Crystal City Metro (Blue and Yellow Lines).<br />
To take metro to the hotel, you may take off at the Crystal City Metro Stati<strong>on</strong>. Complimentary<br />
hotel shuttle to and from Crystal City Metro stati<strong>on</strong> every 20 minutes between<br />
7am-11pm. (Call 1-703-413-1300 if problem).<br />
Page<br />
18
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Venue<br />
Complimentary hotel shuttle to and from Reagan (DCA) airport every (20) thirty minutes<br />
between 5am-11pm. Pick up and drop off at Terminal A (hotel shuttle area) or Gates 5 and<br />
9 <strong>on</strong> Level 1 of Terminal B & C.<br />
Nati<strong>on</strong>’s Capital<br />
Washingt<strong>on</strong>, DC<br />
<strong>ICDE</strong><br />
Hotel<br />
Historic Old Towne<br />
Alexandria, VA<br />
Page<br />
19
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> will be held <strong>on</strong> the<br />
sec<strong>on</strong>d floor of the hotel.<br />
Page<br />
20<br />
Registrati<strong>on</strong><br />
Internet<br />
Room
Program at a Glance<br />
(see next page)<br />
Page<br />
21
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
8aM BreaKFaSt eacH day (Prefuncti<strong>on</strong> Area)<br />
Page<br />
22<br />
9aM — 10aM<br />
10aM — 10:30aM<br />
10:30aM — no<strong>on</strong><br />
no<strong>on</strong> — 2pM<br />
2pM — 3:30pM<br />
3:30pM — 4pM<br />
4pM — 5:30pM<br />
afterno<strong>on</strong> & evening<br />
Sunday, april 1<br />
WOrKSHOpS<br />
DGSS (Studio F), SMDB<br />
(Studio B), STIR (Studio D),<br />
DESWEB (Studio E)<br />
Coffee Break<br />
DGSS (Studio F), SMDB<br />
(Studio B), STIR (Studio D),<br />
and DESWEB (Studio E)<br />
Lunch Break (<strong>on</strong> your own)<br />
DGSS (Studio F), SMDB<br />
(Studio B), STIR (Studio D),<br />
and DESWEB (Studio E)<br />
Coffee Break<br />
DGSS (Studio F), SMDB<br />
(Studio B), STIR (Studio D),<br />
and DESWEB (Studio E)<br />
receptiOn<br />
5:30-8 (Sal<strong>on</strong> 4567)<br />
MOnday, april 2<br />
Keynote 1 (Sal<strong>on</strong> 4567)<br />
Serge Abiteboul<br />
Coffee Break<br />
Sessi<strong>on</strong> 1 (Studio F)<br />
Privacy<br />
Sessi<strong>on</strong> 2 (Studio B)<br />
Web 2.0 Applicati<strong>on</strong>s<br />
Sessi<strong>on</strong> 3 (Studio C)<br />
Storage Management<br />
Sessi<strong>on</strong> 4 (Studio D)<br />
<strong>Data</strong> Streams Processing<br />
Seminar 1 (Sal<strong>on</strong> 123)<br />
Demo Group 1 (Studio E)<br />
Business Lunch and Award<br />
Cerem<strong>on</strong>y (Sal<strong>on</strong> 4567)<br />
Sessi<strong>on</strong> 5 (Studio F) Graphs<br />
Sessi<strong>on</strong> 6 (Studio B)<br />
Uncertain and Probabilistic<br />
<strong>Data</strong>bases<br />
Sessi<strong>on</strong> 7 (Studio C) <strong>Data</strong><br />
Integrati<strong>on</strong> and Extracti<strong>on</strong><br />
Sessi<strong>on</strong> 8 (Studio D)<br />
Spatio-Temporal <strong>Data</strong><br />
Management<br />
Seminar 2 (Sal<strong>on</strong> 123)<br />
Demo Group 2 (Studio E)<br />
Coffee Break<br />
Sessi<strong>on</strong> 9 (Studio F)<br />
Query Processing<br />
Sessi<strong>on</strong> 10 (Studio B) Locati<strong>on</strong><br />
Aware <strong>Data</strong> Processing<br />
Sessi<strong>on</strong> 11 (Studio C) Map-<br />
Reduce based <strong>Data</strong> Processing<br />
Sessi<strong>on</strong> 12 (Studio D)<br />
Social Media<br />
Seminar 3 (Sal<strong>on</strong> 123)<br />
Demo Group 3 (Studio E)<br />
nSF icde <strong>2012</strong> career<br />
panel 7:30-9PM (Sal<strong>on</strong> 123)
tueSday, april 3<br />
Keynote 2 (Sal<strong>on</strong> 4567)<br />
Surajit Chaudhuri<br />
Coffee Break<br />
Sessi<strong>on</strong> 13 (Studio F)<br />
P2P and Distributed<br />
Processing<br />
Sessi<strong>on</strong> 14 (Studio B)<br />
XML and RDF <strong>Data</strong><br />
Management<br />
Sessi<strong>on</strong> 15 (Studio C)<br />
Performance<br />
Industrial Sessi<strong>on</strong> 1<br />
(Studio D) Support for<br />
Large Scale <strong>Data</strong> Analytics<br />
Seminar 4 (Sal<strong>on</strong> 123)<br />
Demo Group 4 (Studio E)<br />
Funders sessi<strong>on</strong> with lunch<br />
(Sal<strong>on</strong> 4567)<br />
Sessi<strong>on</strong> 16 (Studio F) <strong>Data</strong><br />
Extracti<strong>on</strong> and Quality<br />
Sessi<strong>on</strong> 17 (Studio B)<br />
Top-K Processing<br />
Industrial Sessi<strong>on</strong> 2<br />
(Studio C) Evolving Platforms<br />
for New Applicati<strong>on</strong>s<br />
Seminar 5 (Studio 123)<br />
Panel (Studio D) The Future<br />
of Scientific <strong>Data</strong> Bases<br />
Demo Group 1 (Studio E)<br />
Coffee Break<br />
Posters (Sal<strong>on</strong> 4567)<br />
cruiSe and Banquet<br />
5:30PM (Bus leaves hotel)<br />
WedneSday, april 4<br />
Keynote 3 (Sal<strong>on</strong> 4567)<br />
Peter Druschel<br />
Coffee Break<br />
Sessi<strong>on</strong> 18 (Studio F)<br />
Similarity<br />
Sessi<strong>on</strong> 19 (Studio B)<br />
Text and Strings<br />
Sessi<strong>on</strong> 20 (Studio C)<br />
Query Processing II<br />
Industrial Sessi<strong>on</strong> 3<br />
(Studio D) Indexing,<br />
Updates and Processing<br />
Seminar 6 (Sal<strong>on</strong> 123)<br />
Demo Group 2 (Studio E)<br />
Lunch (Sal<strong>on</strong> 4567)<br />
Sessi<strong>on</strong> 21 (Studio F)<br />
<strong>Data</strong> Mining<br />
Sessi<strong>on</strong> 22 (Studio B)<br />
Scientific <strong>Data</strong>, Analysis<br />
and Visualizati<strong>on</strong><br />
Sessi<strong>on</strong> 23 (Studio D)<br />
Similarity Search and<br />
Detecti<strong>on</strong><br />
Demo Group 3 (Studio E)<br />
Coffee Break<br />
Sessi<strong>on</strong> 24 (Studio B)<br />
Sensors Network and<br />
Trajectory<br />
Sessi<strong>on</strong> 25 (Studio D)<br />
Error Reducti<strong>on</strong> and<br />
<strong>Data</strong> Security<br />
Demo Group 4 (Studio E)<br />
Program at a Glance<br />
tHurSday, april 5<br />
WOrKSHOpS<br />
DMC (Studio B),<br />
GDM (Studio D), and<br />
SDMSM (Studio F)<br />
Coffee Break<br />
DMC (Studio B),<br />
GDM (Studio D), and<br />
SDMSM (Studio F)<br />
Lunch Break<br />
DMC (Studio B),<br />
GDM (Studio D), and<br />
SDMSM (Studio F)<br />
Coffee Break<br />
DMC (Studio B),<br />
GDM (Studio D), and<br />
SDMSM (Studio F)<br />
Page<br />
23
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
24
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Sunday, april 1<br />
8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />
9AM - 5:30PM Workshops<br />
Studio F: data-driven decisi<strong>on</strong> Guidance and<br />
Support Systems (dGSS)<br />
Studio B: Self-Managing database Systems (SMdB)<br />
Studio d: Spatio Temporal data integrati<strong>on</strong> and<br />
retrieval (STir)<br />
Studio E: data <strong>Engineering</strong> Meets the Semantic Web<br />
(dESWEB)<br />
5:30PM - 8PM <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> Recepti<strong>on</strong> (Sal<strong>on</strong> 4567)<br />
M<strong>on</strong>day, april 2<br />
8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />
9AM - 10AM Keynote 1 (Sal<strong>on</strong> 4567): Serge Abiteboul — Viewing<br />
the Web as a Distributed Knowledge Base<br />
Sessi<strong>on</strong> Chair: Evaggelia Pitoura<br />
Page<br />
25
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
10AM - 10:30AM Coffee break<br />
10:30AM - No<strong>on</strong> Sessi<strong>on</strong>s 1-4, Seminar 1, Demo Group 1<br />
Page<br />
26<br />
Sessi<strong>on</strong> 1: Privacy (Studio F)<br />
Sessi<strong>on</strong> Chair: Murat Kantarcioglu<br />
Privacy in Social Networks: How Risky is Your Social Graph?<br />
Cuneyt Gurcan akcora (university of insubria)<br />
Barbara Carminati (university of insubria)<br />
Elena Ferrari (university of insubria)<br />
Differentially Private Spatial Decompositi<strong>on</strong>s<br />
Graham Cormode (aT&T labs – research)<br />
Cecilia procopiuc (aT&T labs – research)<br />
Ent<strong>on</strong>g Shen (north Carolina State university)<br />
divesh Srivastava (aT&T labs – research)<br />
Ting yu (north Carolina State university)<br />
Differentially Private Histogram Publicati<strong>on</strong><br />
Jia Xu (northeastern university, China)<br />
Zhenjie Zhang (advanced digital Sciences Center, illinois at<br />
Singapore pte.)<br />
Xiaokui Xiao (nanyang Technological university)<br />
yin yang (advanced digital Sciences Center, illinois at<br />
Singapore pte.)<br />
Ge yu (northeastern university, China)<br />
Privacy-Preserving and C<strong>on</strong>tent-Protecting Locati<strong>on</strong><br />
Based Queries<br />
russell paulet (Victoria university)<br />
Md. Golam Kaosar (Victoria university)<br />
Xun yi (Victoria university)<br />
Elisa Bertino (purdue university)<br />
Sessi<strong>on</strong> 2: Web 2.0 Applicati<strong>on</strong>s (Studio B)<br />
Sessi<strong>on</strong> Chair: Kyuseok Shim<br />
GeoFeed: A Locati<strong>on</strong>-Aware News Feed<br />
Jie Bao (university of Minnesota at Twin Cities)<br />
Mohamed F. Mokbel (university of Minnesota at Twin Cities)<br />
Chi-yin Chow (City university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Entity Search Strategies for Mashup Applicati<strong>on</strong>s<br />
Stefan Endrullis (university of leipzig)<br />
andreas Thor (university of leipzig)<br />
Erhard rahm (university of leipzig)
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
CI-Rank: Ranking Keyword Search Results Based <strong>on</strong><br />
Collective Importance<br />
Xiaohui yu (york university & Shand<strong>on</strong>g university)<br />
Huxia Shi (york university)<br />
Temporal Analytics <strong>on</strong> Big <strong>Data</strong> for Web Advertising<br />
Badrish Chandramouli (Microsoft research)<br />
J<strong>on</strong>athan Goldstein (Microsoft Corporati<strong>on</strong>)<br />
S<strong>on</strong>gyun duan (iBM T. J. Wats<strong>on</strong> research Center)<br />
Sessi<strong>on</strong> 3: Storage Management (Studio C)<br />
Sessi<strong>on</strong> Chair: Alf<strong>on</strong>s Kemper<br />
Lookup Tables: Fine-Grained Partiti<strong>on</strong>ing for<br />
Distributed <strong>Data</strong>bases<br />
aubrey l. Tatarowicz (MiT)<br />
Carlo Curino (MiT)<br />
Evan p. C. J<strong>on</strong>es (MiT)<br />
Sam Madden (MiT)<br />
Temporal Support for Persistent Stored Modules<br />
richard T. Snodgrass (university of ariz<strong>on</strong>a)<br />
dengfeng Gao (iBM Silic<strong>on</strong> Valley lab)<br />
rui Zhang (university of ariz<strong>on</strong>a)<br />
Stephen W. Thomas (Queen’s university, Kingst<strong>on</strong>)<br />
Energy Efficient Storage Management Cooperated with<br />
Large <strong>Data</strong> Intensive Applicati<strong>on</strong>s<br />
norifumi nishikawa (The university of Tokyo)<br />
Miyuki nakano (The university of Tokyo)<br />
Masaru Kitsuregawa (The university of Tokyo)<br />
ISOBAR Prec<strong>on</strong>diti<strong>on</strong>er for Effective and High-throughput<br />
Lossless <strong>Data</strong> Compressi<strong>on</strong><br />
Eric r. Schendel (north Carolina State university)<br />
ye Jin (north Carolina State university)<br />
neil Shah (north Carolina State university)<br />
Jackie Chen (Sandia nati<strong>on</strong>al laboratory)<br />
C.S. Chang (princet<strong>on</strong> plasma physics laboratory,<br />
princet<strong>on</strong>, nJ 08543, uSa)<br />
Seung-Hoe Ku (new york university)<br />
Stephane Ethier (princet<strong>on</strong> plasma physics laboratory)<br />
Scott Klasky (oak ridge nati<strong>on</strong>al laboratory)<br />
robert latham (arg<strong>on</strong>ne nati<strong>on</strong>al laboratory)<br />
robert ross (arg<strong>on</strong>ne nati<strong>on</strong>al laboratory)<br />
nagiza F. Samatova (north Carolina State university &<br />
oak ridge nati<strong>on</strong>al laboratory)<br />
Page<br />
27
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
28<br />
Sessi<strong>on</strong> 4: <strong>Data</strong> Streams Processing (Studio D)<br />
Sessi<strong>on</strong> Chair: Bugra Gedik<br />
Physically Independent Stream Merging<br />
Badrish Chandramouli (Microsoft research)<br />
david Maier (portland State university)<br />
J<strong>on</strong>athan Goldstein (Microsoft Corporati<strong>on</strong>)<br />
On Computing Correlated Aggregates over a <strong>Data</strong> Stream<br />
Srikanta Tirthapura (iowa State university)<br />
david p. Woodruff (iBM almaden research Center)<br />
Accuracy-Aware Uncertain Stream <strong>Data</strong>bases<br />
Tingjian Ge (university of Kentucky)<br />
Fujun liu (university of Kentucky)<br />
On Discovery of Traveling Compani<strong>on</strong>s from Streaming<br />
Trajectories<br />
lu-an Tang (uiuC)<br />
yu Zheng (MSra)<br />
Jing yuan (MSra)<br />
Jiawei Han (uiuC)<br />
alice leung (BBn)<br />
Chih-Chieh Hung (yahoo!)<br />
Wen-Chih peng (nCTu)<br />
Seminar 1 (Sal<strong>on</strong> 123)<br />
<strong>Data</strong> Management Issues <strong>on</strong> the Semantic Web<br />
oktie Hassanzadeh (university of Tor<strong>on</strong>to & iBM research)<br />
anastasios Kementsietsidis (iBM research)<br />
yannis Velegrakis (university of Trento)<br />
Demo Group 1 (Studio E)<br />
SMIX Live – A Self-Managing Index Infrastructure for<br />
Dynamic Workloads<br />
Thomas Kissinger (dresden university of Technology)<br />
Hannes Voigt (dresden university of Technology)<br />
Wolfgang lehner (dresden university of Technology)<br />
Multi-Query Stream Processing <strong>on</strong> FPGAs<br />
Mohammad Sadoghi (university of Tor<strong>on</strong>to)<br />
rohan palaniappan (university of Tor<strong>on</strong>to)<br />
rija Javed (university of Tor<strong>on</strong>to)<br />
naif Tarafdar (university of Tor<strong>on</strong>to),<br />
Harsh Singh (university of Tor<strong>on</strong>to)<br />
Hans-arno Jacobsen (university of Tor<strong>on</strong>to)
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
EUDEMON: A System for Online Video Frame Copy<br />
Detecti<strong>on</strong> by Earth Mover Distance<br />
Jia Xu (northeastern university, China)<br />
Qiushi Bai (northeastern university, China),<br />
yu Gu (northeastern university, China)<br />
anth<strong>on</strong>y Tung (nati<strong>on</strong>al university of Singapore),<br />
Guoren Wang (northeastern university, China),<br />
Ge yu (northeastern university, China),<br />
Zhenjie Zhang (advanced digital Sciences Center, illinois at<br />
Singapore pte.)<br />
A <strong>Data</strong>set Search Engine for the Research<br />
Document Corpus<br />
Meiyu lu (nati<strong>on</strong>al univ. of Singapore)<br />
Srinivas Bangalore (aT&T research labs),<br />
Graham Cormode (aT&T labs – research),<br />
Marios Hadjieleftheriou (aT&T labs – research),<br />
divesh Srivastava (aT&T labs – research)<br />
AskFuzzy: Attractive Visual Fuzzy Query Builder<br />
Keivan Kianmehr (university of Western <strong>on</strong>tario)<br />
negar Koochakzadeh (university of Calgary)<br />
reda alhajj (university of Calgary)<br />
F2DB: The Flash-Forward <strong>Data</strong>base System<br />
ulrike Fischer (dresden university of Technology)<br />
Frank rosenthal (dresden university of Technology)<br />
Wolfgang lehner (dresden university of Technology)<br />
Provenance-Based Debugging and Drill-Down in<br />
<strong>Data</strong>-Oriented Workflows<br />
robert ikeda (Stanford university)<br />
Junsang Cho (Stanford university),<br />
Charlie Fang (Stanford university)<br />
Semih Salihoglu (Stanford university),<br />
Satoshi Torikai (Stanford university)<br />
Jennifer Widom (Stanford university)<br />
No<strong>on</strong> – 2PM Business Lunch & Award Cerem<strong>on</strong>y (Sal<strong>on</strong> 4567)<br />
Page<br />
29
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
2PM - 3:30PM Sessi<strong>on</strong>s 5-8, Seminar 2, Demo Group 2<br />
Page<br />
30<br />
Sessi<strong>on</strong> 5: Graphs (Studio F)<br />
Sessi<strong>on</strong> Chair: Sameh Elnikety<br />
Iterative Graph Feature Mining for Graph Indexing<br />
dayu yuan (penn State university)<br />
prasenjit Mitra (penn State university)<br />
Huiwen yu (penn State university)<br />
C. lee Giles (penn State university)<br />
An Efficient Graph Indexing Method<br />
Xiaoli Wang (nati<strong>on</strong>al university of Singapore)<br />
Xiaofeng ding (Huazh<strong>on</strong>g university of Science and<br />
Technology)<br />
anth<strong>on</strong>y K.H. Tung (nati<strong>on</strong>al university of Singapore)<br />
Shanshan ying (nati<strong>on</strong>al university of Singapore)<br />
Hai Jin (Huazh<strong>on</strong>g university of Science and Technology)<br />
PRAGUE: Towards Blending Practical Visual Subgraph<br />
Query Formulati<strong>on</strong> and Query Processing<br />
Changjiu Jin (nanyang Technological university)<br />
Sourav S Bhowmick (nanyang Technological univ)<br />
Byr<strong>on</strong> Choi (H<strong>on</strong>g K<strong>on</strong>g Baptist university)<br />
Shuigeng Zhou (Fudan university)<br />
Ego-centric Graph Pattern Census<br />
Walaa Eldin Moustafa (university of Maryland, College park)<br />
amol deshpande (university of Maryland, College park)<br />
lise Getoor (university of Maryland, College park)<br />
Sessi<strong>on</strong> 6: Uncertain and Probabilistic<br />
<strong>Data</strong>bases (Studio B)<br />
Sessi<strong>on</strong> Chair: Elena Ferrari<br />
Searching Uncertain <strong>Data</strong> Represented by N<strong>on</strong>-Axis Parallel<br />
Gaussian Mixture Models<br />
Katrin Haegler (university of Munich)<br />
Frank Fiedler (university of Munich)<br />
Christian Boehm (university of Munich)<br />
Aggregate Query Answering <strong>on</strong> Possibilistic <strong>Data</strong> with Cardinality<br />
C<strong>on</strong>straints<br />
Graham Cormode (aT&T labs – research)<br />
Ent<strong>on</strong>g Shen (north Carolina State university)<br />
divesh Srivastava (aT&T labs – research)<br />
Ting yu (north Carolina State university)
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Discovering Threshold-based Frequent Closed Itemsets<br />
over Probabilistic <strong>Data</strong><br />
y<strong>on</strong>gxin T<strong>on</strong>g (H<strong>on</strong>g K<strong>on</strong>g univeristy of Science and<br />
<strong>Engineering</strong>)<br />
lei Chen (H<strong>on</strong>g K<strong>on</strong>g univeristy of Science and <strong>Engineering</strong>)<br />
Bolin ding (university of illinois at urbana-Champaign)<br />
Ranking Query Results in Probabilistic <strong>Data</strong>bases:<br />
Complexity and Efficient Algorithms<br />
dan olteanu (university of oxford)<br />
H<strong>on</strong>gkai Wen (university of oxford)<br />
Sessi<strong>on</strong> 7: <strong>Data</strong> Integrati<strong>on</strong> and Extracti<strong>on</strong> (Studio C)<br />
Sessi<strong>on</strong> Chair: Daisy Zhe Wang<br />
Joint Entity Resoluti<strong>on</strong><br />
Steven Whang (Stanford university)<br />
Hector Garcia-Molina (Stanford university)<br />
A Self-C<strong>on</strong>figuring Schema Matching System<br />
Eric peukert (Sap research dresden)<br />
Julian Eberius (dresden university of Technology)<br />
Erhard rahm (university of leipzig)<br />
Incremental Detecti<strong>on</strong> of Inc<strong>on</strong>sistencies in<br />
Distributed <strong>Data</strong><br />
Wenfei Fan (university of Edinburgh)<br />
Jianzh<strong>on</strong>g li (Harbin institute of Technology)<br />
nan Tang (university of Edinburgh & Qatar Computing research<br />
institute)<br />
Wenyuan yu (university of Edinburgh)<br />
Recomputing Materialized Instances after Changes to<br />
Mappings and <strong>Data</strong><br />
Todd J. Green (university of California, davis)<br />
Zachary G. ives (university of pennsylvania)<br />
Sessi<strong>on</strong> 8: Spatio-Temporal <strong>Data</strong><br />
Management (Studio D)<br />
Sessi<strong>on</strong> Chair: Lei Chen<br />
SWST: A Disk Based Index for Sliding Window<br />
Spatio-Temporal <strong>Data</strong><br />
Manish Singh (university of Michigan, ann arbor)<br />
Qiang Zhu (university of Michigan, dearborn)<br />
H.V. Jagadish (university of Michigan, ann arbor)<br />
Page<br />
31
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
32<br />
Querying Uncertain Spatio-Temporal <strong>Data</strong><br />
Tobias Emrich (ludwig-Maximilians-universität München)<br />
Hans-peter Kriegel (ludwig-Maximilians-universität München)<br />
nikos Mamoulis (university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Matthias renz (ludwig-Maximilians-universität München)<br />
andreas Züfle (ludwig-Maximilians-universität München)<br />
The Min-dist Locati<strong>on</strong> Selecti<strong>on</strong> Query<br />
Jianzh<strong>on</strong>g Qi (university of Melbourne)<br />
rui Zhang (university of Melbourne)<br />
lars Kulik (university of Melbourne)<br />
dan lin (Missouri university of Science and Technology)<br />
yuan Xue (university of Melbourne)<br />
Bi-level Locality Sensitive Hashing for K-Nearest<br />
Neighbor Computati<strong>on</strong><br />
Jia pan (unC Chapel Hill)<br />
dinesh Manocha (unC Chapel Hill)<br />
Seminar 2 (Sal<strong>on</strong> 123)<br />
Discovering Multiple Clustering Soluti<strong>on</strong>s: Grouping<br />
Objects in Different Views of the <strong>Data</strong><br />
Emmanuel Müller (Karlsruhe institute of Technology)<br />
Stephan Günnemann (rWTH aachen university)<br />
ines Färber (rWTH aachen university)<br />
Thomas Seidl (rWTH aachen university)<br />
Demo Group 2 (Studio E)<br />
M 3 : Stream Processing <strong>on</strong> Main-Memory MapReduce<br />
ahmed M. aly (purdue university)<br />
asmaa Sallam (purdue university)<br />
Bala M. Gnanasekaran (purdue university)<br />
l<strong>on</strong>g-Van nguyen-dinh (purdue university)<br />
Walid G. aref (purdue university)<br />
Mourad ouzzani (Qatar Computing research institute)<br />
arif Ghafoor (purdue university)<br />
A Deep Embedding of Queries into Ruby<br />
Torsten Grust (university of Tübingen)<br />
Manuel Mayr (university of Tübingen)
3:30PM - 4PM Coffee Break<br />
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Asking the Right Questi<strong>on</strong>s in Crowd <strong>Data</strong> Sourcing<br />
rubi Boim (Tel-aviv university)<br />
ohad Greenshpan (Tel-aviv university)<br />
Tova Milo (Tel-aviv university)<br />
Slava novgorodov (Tel-aviv university),<br />
neoklis polyzotis (university of California, Santa Cruz)<br />
Wang-Chiew Tan (university of California, Santa Cruz)<br />
LotusX: A Positi<strong>on</strong>-Aware XML Graphical Search System<br />
with Auto-Completi<strong>on</strong><br />
Chunbin lin (renmin university of China)<br />
Jiaheng lu (renmin university of China),<br />
Tok Wang ling (nati<strong>on</strong>al universtiy of Singapore)<br />
Bogdan Cautis (Télécom parisTech)<br />
Efficient Top-k Keyword Search in Graphs with<br />
Polynomial Delay<br />
Mehdi Kargar (york university)<br />
aijun an (york university)<br />
TEDAS: a Twitter Based Event Detecti<strong>on</strong> and<br />
Analysis System<br />
rui li (university of illinois at urbana-Champaign)<br />
Kin Hou lei (Brigham young university),<br />
ravi Khadiwala (university of illinois at urbana-Champaign)<br />
Kevin Chen-Chuan Chang (university of illinois at<br />
urbana-Champaign)<br />
AutoDict: Automated Dicti<strong>on</strong>ary Discovery<br />
Fei Chiang (university of Tor<strong>on</strong>to)<br />
periklis andritsos (university of Tor<strong>on</strong>to),<br />
Erkang Zhu (university of Tor<strong>on</strong>to)<br />
renee Miller (university of Tor<strong>on</strong>to)<br />
4PM - 5:30PM Sessi<strong>on</strong>s 9-12, Seminar 3, Demo Group 3<br />
Sessi<strong>on</strong> 9: Query Processing (Studio F)<br />
Sessi<strong>on</strong> Chair: Walid G. Aref<br />
Learning-based Query Performance Modeling<br />
and Predicti<strong>on</strong><br />
Mert akdere (Brown university)<br />
ugur Cetintemel (Brown university)<br />
Matteo ri<strong>on</strong>dato (Brown university)<br />
Eli upfal (Brown university)<br />
Stanley B. Zd<strong>on</strong>ik (Brown university)<br />
Page<br />
33
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
34<br />
Parametric Plan Caching Using Density-Based Clustering<br />
Gunes aluc (university of Waterloo)<br />
david E. deHaan (Sybase, an Sap Company)<br />
ivan T. Bowman (Sybase, an Sap Company)<br />
Effective and Robust Pruning for Top-Down Join<br />
Enumerati<strong>on</strong> Algorithms<br />
pit Fender (Mannheim university)<br />
Guido Moerkotte (Mannheim university)<br />
Thomas neumann (Technical university of Munich)<br />
Viktor leis (Technical university of Munich)<br />
Towards Preference-aware Relati<strong>on</strong>al <strong>Data</strong>bases<br />
anastasios arvanitis (nati<strong>on</strong>al Technical university of athens)<br />
Georgia Koutrika (iBM almaden research Center)<br />
Sessi<strong>on</strong> 10: Locati<strong>on</strong> Aware <strong>Data</strong><br />
Processing (Studio B)<br />
Sessi<strong>on</strong> Chair: Oktie Hassanzadeh<br />
A Foundati<strong>on</strong> for Efficient Indoor Distance-Aware<br />
Query Processing<br />
Hua lu (aalborg university)<br />
Xin Cao (nanyang Technological university)<br />
Christian S. Jensen (aarhus university)<br />
LARS: A Locati<strong>on</strong>-Aware Recommender System<br />
Justin J. levandoski (Microsoft research)<br />
Mohamed Sarwat (university of Minnesota)<br />
ahmed Eldawy (university of Minnesota)<br />
Mohamed F. Mokbel (university of Minnesota)<br />
Approximate Shortest Distance Computing:<br />
A Query-Dependent Local Landmark Scheme<br />
Miao Qiao (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
H<strong>on</strong>g Cheng (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
lijun Chang (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Jeffrey Xu yu (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Desks: Directi<strong>on</strong>-Aware Spatial Keyword Search<br />
Guoliang li (Tsinghua university)<br />
Jianhua Feng (Tsinghua university)<br />
Jing Xu (Tsinghua university)
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Sessi<strong>on</strong> 11: Map-Reduce based <strong>Data</strong> Processing<br />
(Studio C)<br />
Sessi<strong>on</strong> Chair: Minqi Zhou<br />
Extending Map-Reduce for Efficient Predicate-Based<br />
Sampling<br />
raman Grover (university of California, irvine)<br />
Michael Carey (university of California, irvine)<br />
Fuzzy Joins Using MapReduce<br />
Foto afrati (nati<strong>on</strong>al Technical university athens)<br />
anish das Sarma (Google, inc.-work initiated at yahoo! research)<br />
david Menestrina (Google, inc.)<br />
aditya parameswaran (Stanford university)<br />
Jeffrey d. ullman (Stanford university)<br />
Parallel Top-K Similarity Join Algorithms Using MapReduce<br />
youngho<strong>on</strong> Kim (Seoul nati<strong>on</strong>al university)<br />
Kyuseok Shim (Seoul nati<strong>on</strong>al university)<br />
Load Balancing in MapReduce Based <strong>on</strong> Scalable<br />
Cardinality Estimates<br />
Benjamin Gufler (Technische universität München)<br />
nikolaus augsten (Free university of Bolzano-Bozen)<br />
angelika reiser (Technische universität München)<br />
alf<strong>on</strong>s Kemper (Technische universität München)<br />
Sessi<strong>on</strong> 12: Social Media (Studio D)<br />
Sessi<strong>on</strong> Chair: Zack Ives<br />
Community Detecti<strong>on</strong> with Edge C<strong>on</strong>tent in Social<br />
Media Networks<br />
Guo-Jun Qi (university of illinois at urbana-Champaign)<br />
Charu C. aggarwal (iBM T. J. Wats<strong>on</strong> research Center)<br />
Thomas S. Huang (university of illinois at urbana-Champaign)<br />
Cross Domain Search by Exploiting Wikipedia<br />
Chen liu (nati<strong>on</strong>al university of Singapore)<br />
Sai Wu (nati<strong>on</strong>al university of Singapore)<br />
Shouxu Jiang (Harbin institute of Technology)<br />
anth<strong>on</strong>y K.H. Tung (nati<strong>on</strong>al university of Singapore)<br />
Provenance-based Indexing Support in Micro-blog<br />
Platforms<br />
Junjie yao (peking university)<br />
Bin Cui (peking university)<br />
Zijun Xue (peking university)<br />
Qingyun liu (peking university)<br />
Page<br />
35
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
36<br />
Learning Stochastic Models of Informati<strong>on</strong> Flow<br />
luke dickens (imperial College l<strong>on</strong>d<strong>on</strong>)<br />
ian Molloy (iBM T. J. Wats<strong>on</strong> research Center)<br />
Jorge lobo (iBM T. J. Wats<strong>on</strong> research Center)<br />
pau-Chen Cheng (iBM T. J. Wats<strong>on</strong> research Center)<br />
alessandra russo (imperial College l<strong>on</strong>d<strong>on</strong>)<br />
Seminar 3 (Sal<strong>on</strong> 123)<br />
Detecting Cl<strong>on</strong>es, Copying and Reuse <strong>on</strong> the Web<br />
Xin luna d<strong>on</strong>g (aT&T labs–research)<br />
divesh Srivastava (aT&T labs–research)<br />
Demo Group 3 (Studio E)<br />
Trust & Share: Trusted Informati<strong>on</strong> Sharing in Online<br />
Social Networks<br />
Barbara Carminati (university of insubria)<br />
Elena Ferrari (university of insubria)<br />
Jacopo Girardi (university of insubria)<br />
Evaluati<strong>on</strong> of Clusterings – Metrics and Visual Support<br />
Elke achtert (ludwig-Maximilians-universität München)<br />
Sascha Goldhofer (ludwig-Maximilians-universität München)<br />
Hans-peter Kriegel (ludwig-Maximilians-universität München)<br />
Erich Schubert (ludwig-Maximilians-universität München)<br />
arthur Zimek (ludwig-Maximilians-universität München)<br />
Hort<strong>on</strong>: Online Query Executi<strong>on</strong> Engine For Large<br />
Distributed Graphs<br />
Mohamed Sarwat (university of Minnesota)<br />
Sameh Elnikety (Microsoft research)<br />
yuxi<strong>on</strong>g He (Microsoft research)<br />
Gabriel Kliot (Microsoft research)<br />
MXQuery With Hardware Accelerati<strong>on</strong><br />
Jens Teubner (ETH Zurich)<br />
peter Fischer (university of Freiburg)<br />
<strong>Data</strong> 3 – A Kinect Interface for OLAP using Complex<br />
Event Processing<br />
Steffen Hirte (ilmenau university of Technology)<br />
andreas Seifert (ilmenau university of Technology)<br />
Stephan Baumann (ilmenau university of Technology)<br />
daniel Klan (ilmenau university of Technology)<br />
Kai-uwe Sattler (ilmenau university of Technology)
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Analyzing Query Optimizati<strong>on</strong> Process: Portraits of Join<br />
Enumerati<strong>on</strong> Algorithms<br />
anisoara nica (Sybase, an Sap Company)<br />
ian Charlesworth (university of Waterloo)<br />
Maysum panju (university of Waterloo)<br />
DPCube: Releasing Differentially Private <strong>Data</strong> Cubes for<br />
Health Informati<strong>on</strong><br />
y<strong>on</strong>ghui Xiao (Emory university)<br />
James Gardner (digital reas<strong>on</strong>ing Systems inc.)<br />
li Xi<strong>on</strong>g (Emory university)<br />
7:30PM - 9PM NSF <strong>ICDE</strong> <strong>2012</strong> Career Panel (Sal<strong>on</strong> 123)<br />
Panel Moderator: Philip Bernstein (Microsoft Research)<br />
Panelists: Alexandros Labrindis (CS, UPitt), James M.<br />
Kang (NGA), Srinivasan Parthasarathy (CS, OSU), and<br />
Yuanyuan Tian (IBM Research)<br />
TuESday, april 3<br />
8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />
9AM - 10AM Keynote 2 (Sal<strong>on</strong> 4567): Surajit Chaudhuri — How<br />
Different Is Big <strong>Data</strong>?<br />
Sessi<strong>on</strong> Chair: Beng Chin Ooi<br />
10AM - 10:30AM Coffee Break<br />
10:30AM - No<strong>on</strong> Sessi<strong>on</strong>s 13-15, Industrial Sessi<strong>on</strong> 1, Seminar 4,<br />
Demo Group 4<br />
Sessi<strong>on</strong> 13: P2P and Distributed<br />
Processing (Studio F)<br />
Sessi<strong>on</strong> Chair: Guoliang Li<br />
BestPeer++: A Peer-to-Peer based Large-scale<br />
<strong>Data</strong> Processing<br />
Gang Chen (netEase.com inc. & Zhejiang university)<br />
Tianlei Hu (netEase.com inc. & Zhejiang university)<br />
dawei Jiang (nati<strong>on</strong>al university of Singapore)<br />
peng lu (nati<strong>on</strong>al university of Singapore)<br />
Kian-lee Tan (nati<strong>on</strong>al university of Singapore)<br />
Hoang Tam Vo (nati<strong>on</strong>al university of Singapore)<br />
Sai Wu (Bestpeer pte. ltd. & nati<strong>on</strong>al university of Singapore)<br />
Page<br />
37
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
38<br />
Effective <strong>Data</strong> Density Estimati<strong>on</strong> in Ring-based<br />
2P Networks<br />
Minqi Zhou (East China normal university)<br />
Heng Tao Shen (The university of Queensland)<br />
Xiaofang Zhou (The university of Queensland)<br />
Weining Qian (East China normal university)<br />
aoying Zhou (East China normal university)<br />
Processing of Rank Joins in Highly Distributed Systems<br />
Christos doulkeridis (norwegian university of Science and<br />
Technology (nTnu))<br />
akrivi Vlachou (norwegian university of Science and<br />
Technology (nTnu))<br />
Kjetil nørvåg (norwegian university of Science and<br />
Technology (nTnu))<br />
yannis Kotidis (athens university of Ec<strong>on</strong>omics and<br />
Business (auEB))<br />
neoklis polyzotis (uC Santa Cruz (uCSC))<br />
Load Balancing for MapReduce-based Entity Resoluti<strong>on</strong><br />
lars Kolb (university of leipzig)<br />
andreas Thor (university of leipzig)<br />
Erhard rahm (university of leipzig)<br />
Sessi<strong>on</strong> 14: XML and RDF <strong>Data</strong><br />
Management (Studio B)<br />
Sessi<strong>on</strong> Chair: Dan Olteanu<br />
Mapping XML to a Wide Sparse Table<br />
liang Jeff Chen (uCSd)<br />
philip a. Bernstein (Microsoft Corp.)<br />
peter Carlin (Microsoft Corp.)<br />
dimitrije Filipovic (Microsoft Corp.)<br />
Michael rys (Microsoft Corp.)<br />
nikita Shamgunov (Facebook inc.)<br />
James F. Terwilliger (Microsoft Corp.)<br />
Milos Todic (Microsoft Corp.)<br />
Sasa Tomasevic (Microsoft Corp.)<br />
dragan Tomic (Microsoft Corp.)<br />
Querying XML <strong>Data</strong>: As You Shape It<br />
Curtis E. dyres<strong>on</strong> (utah State university)<br />
Sourav S. Bhowmick (nanyang Technological university)
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Branch Code: A Labeling Scheme for Efficient Query<br />
Answering <strong>on</strong> Trees<br />
yanghua Xiao (Fudan university)<br />
Ji H<strong>on</strong>g (Fudan university)<br />
Wanyun Cui (Fudan university)<br />
Zhenying He (Fudan university)<br />
Wei Wang (Fudan university)<br />
Guod<strong>on</strong>g Feng (Fudan university)<br />
Scalable Multi-Query Optimizati<strong>on</strong> for SPARQL<br />
Wangchao le (university of utah)<br />
anastasios Kementsietsidis (iBM T. J. Wats<strong>on</strong> research Center)<br />
S<strong>on</strong>gyun duan (iBM T. J. Wats<strong>on</strong> research Center)<br />
Feifei li (university of utah)<br />
Sessi<strong>on</strong> 15: Performance (Studio C)<br />
Sessi<strong>on</strong> Chair: Eric Lo<br />
GSLPI: a Cost-based Query Progress Indicator<br />
Jiexing li (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
rimma V. nehme (Microsoft Jim Gray Systems lab)<br />
Jeffrey naught<strong>on</strong> (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Micro-Specializati<strong>on</strong> in DBMSes<br />
rui Zhang (university of ariz<strong>on</strong>a)<br />
richard T. Snodgrass (university of ariz<strong>on</strong>a)<br />
Saumya debray (university of ariz<strong>on</strong>a)<br />
Towards Multi-Tenant Performance SLOs<br />
Willis lang (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Srinath Shankar (Microsoft Jim Gray Systems lab)<br />
Jignesh M. patel (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
ajay Kalhan (Microsoft Corp.)<br />
Multi-Versi<strong>on</strong> C<strong>on</strong>currency via Timestamp Range<br />
C<strong>on</strong>flict Management<br />
david lomet (Microsoft research)<br />
alan Fekete (university of Sydney)<br />
rui Wang (Microsoft research)<br />
peter Ward (university of Sydney)<br />
Industrial Sessi<strong>on</strong> 1: Support for Large Scale <strong>Data</strong><br />
Analytics (Studio D)<br />
Sessi<strong>on</strong> Chair: Arbee L.P. Chen<br />
Exploiting Comm<strong>on</strong> Subexpressi<strong>on</strong>s for Cloud Query Processing<br />
yasin n. Silva (ariz<strong>on</strong>a State university)<br />
per-ake lars<strong>on</strong> (Microsoft research)<br />
Jingren Zhou (Microsoft Corp.)<br />
Page<br />
39
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
40<br />
Vectorwise: a Vectorized Analytical DBMS<br />
Marcin Zukowski (actian netherlands)<br />
Mark van de Wiel (actian Corp.)<br />
peter B<strong>on</strong>cz (CWi)<br />
Scalable and Numerically Stable Descriptive Statistics<br />
in SystemML<br />
yuanyuan Tian (iBM almaden research Center)<br />
Shirish Tatik<strong>on</strong>da (iBM almaden research Center)<br />
Berthold reinwald (iBM almaden research Center)<br />
Seminar 4 (Sal<strong>on</strong> 123)<br />
Mining Knowledge from <strong>Data</strong>: An Informati<strong>on</strong> Network<br />
Analysis Approach<br />
Jiawei Han (university of illinois at urbana-Champaign)<br />
yizhou Sun (university of illinois at urbana-Champaign)<br />
Xifeng yan (university of California at Santa Barbara)<br />
philip S. yu (university of illinois at Chicago)<br />
Demo Group 4 (Studio E)<br />
Nyaya: a System Supporting the Uniform Management of<br />
Large Sets of Semantic <strong>Data</strong><br />
roberto de Virgilio (universita’ roma Tre)<br />
Giorgio orsi (university of oxford)<br />
letizia Tanca (politecnico di Milano)<br />
riccardo Torl<strong>on</strong>e (universita’ roma Tre)<br />
R2DB: A System for Querying and Visualizing Weighted<br />
RDF Graphs<br />
S<strong>on</strong>gling liu (ariz<strong>on</strong>a State university)<br />
Juan Cedeno (ariz<strong>on</strong>a State university)<br />
Selcuk Candan (ariz<strong>on</strong>a State university)<br />
Maria luisa Sapino (university of Turin)<br />
Shengyu Huang (ariz<strong>on</strong>a State university)<br />
Xinsheng li (ariz<strong>on</strong>a State university)<br />
Project Dayt<strong>on</strong>a: <strong>Data</strong> Analytics as a Cloud Service<br />
roger Barga (Microsoft)<br />
Jaliya Ekanayake (Microsoft research)<br />
Wei lu (Microsoft research)<br />
Interactive User Feedback in Ontology Matching Using<br />
Signature Vector<br />
isabel Cruz (university of illinois at Chicago)<br />
Cosmin Stroe (university of illinois at Chicago)<br />
Matteo palm<strong>on</strong>ari (university of Milano-Bicocca)
DObjects+: Enabling Privacy-Preserving <strong>Data</strong><br />
Federati<strong>on</strong> Services<br />
pawel Jurczyk (Google inc.)<br />
li Xi<strong>on</strong>g (Emory university)<br />
Slawomir Goryczka (Emory university)<br />
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Drago<strong>on</strong>: An Informati<strong>on</strong> Accountability System for<br />
High-Performance <strong>Data</strong>bases<br />
Kyriacos pavlou (university of ariz<strong>on</strong>a)<br />
richard Snodgrass (university of ariz<strong>on</strong>a)<br />
Intuitive Interacti<strong>on</strong> With Encrypted Query Executi<strong>on</strong><br />
in <strong>Data</strong>Storm<br />
Ken Smith (MiTrE)<br />
ameet Kini (MiTrE)<br />
William Wang (MiTrE)<br />
Chris Wolf (MiTrE)<br />
M. david allen (MiTrE)<br />
andrew Sillers (MiTrE)<br />
No<strong>on</strong> - 2PM Funders Sessi<strong>on</strong> and Lunch (Sal<strong>on</strong> 4567)<br />
Panel Organizer: Frank Olken (C<strong>on</strong>sultant)<br />
Panelists: Le Gruenwald (Nati<strong>on</strong>al Science Foundati<strong>on</strong>),<br />
Ceren Sust (Department of Energy), and Olga Brazhnik<br />
(Nati<strong>on</strong>al Institutes of Health)<br />
2PM - 3:30PM Sessi<strong>on</strong>s 16-17, Industrial Sessi<strong>on</strong> 2, Seminar 5, Panel,<br />
Demo Group 1<br />
Sessi<strong>on</strong> 16: <strong>Data</strong> Extracti<strong>on</strong> and Quality (Studio F)<br />
Sessi<strong>on</strong> Chair: Anish Das Sarma<br />
Automatic Extracti<strong>on</strong> of Structured Web <strong>Data</strong> with<br />
Domain Knowledge<br />
nora derouiche (Télécom parisTech – CnrS lTCi)<br />
Bogdan Cautis (Télécom parisTech – CnrS lTCi)<br />
Talel abdessalem (Télécom parisTech – CnrS lTCi)<br />
Discovering C<strong>on</strong>servati<strong>on</strong> Rules<br />
lukasz Golab (university of Waterloo)<br />
Howard Karloff (aT&T labs–research)<br />
Flip Korn (aT&T labs–research)<br />
Barna Saha (aT&T labs–research)<br />
divesh Srivastava (aT&T labs–research)<br />
Answering Why-not Questi<strong>on</strong>s <strong>on</strong> Top-k Queries<br />
Zhian He (H<strong>on</strong>g K<strong>on</strong>g polytechnic university)<br />
Eric lo (H<strong>on</strong>g K<strong>on</strong>g polytechnic university)<br />
Page<br />
41
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
42<br />
An Efficient Trie-based Method for Approximate Entity<br />
Extracti<strong>on</strong> with Edit-Distance C<strong>on</strong>straints<br />
d<strong>on</strong>g deng (Tsinghua university)<br />
Guoliang li (Tsinghua university)<br />
Jianhua Feng (Tsinghua university)<br />
Sessi<strong>on</strong> 17: Top-K Processing (Studio B)<br />
Sessi<strong>on</strong> Chair: Tingjian Ge<br />
On Top-k Structural Similarity Search<br />
pei lee (university of British Columbia)<br />
laks V.S. lakshmanan (university of British Columbia)<br />
Jeffrey Xu yu (Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Relevance Matters: Capitalizing <strong>on</strong> Less<br />
(Top-k Matching in Publish/Subscribe)<br />
Mohammad Sadoghi (university of Tor<strong>on</strong>to)<br />
Hans-arno Jacobsen (university of Tor<strong>on</strong>to)<br />
Efficiently M<strong>on</strong>itoring Top-k Pairs over Sliding Windows<br />
Zhitao Shen (unSW)<br />
Muhammad aamir Cheema (unSW)<br />
Xuemin lin (unSW & ECnu)<br />
Wenjie Zhang (unSW)<br />
Haixun Wang (Microsoft research asia)<br />
Processing and Notifying Range Top-k Subscripti<strong>on</strong>s<br />
albert yu (duke university)<br />
pankaj K. agarwal (duke university)<br />
Jun yang (duke university)<br />
Industrial Sessi<strong>on</strong> 2: Evolving Platforms for New<br />
Applicati<strong>on</strong>s (Studio C)<br />
Sessi<strong>on</strong> Chair: Rui Zhang<br />
Earlybird: Real-Time Search at Twitter<br />
Michael Busch (Twitter)<br />
Krishna Gade (Twitter)<br />
Brian lars<strong>on</strong> (Twitter)<br />
patrick lok (Twitter)<br />
Samuel luckenbill (Twitter)<br />
Jimmy lin (Twitter)<br />
<strong>Data</strong> Infrastructure at LinkedIn<br />
linkedin data infrastructure Team
The Credit Suisse Meta-data Warehouse<br />
Claudio Jossen (Credit Suisse aG)<br />
lukas Blunschi (ETH Zurich)<br />
Magdalini Mori (Credit Suisse aG)<br />
d<strong>on</strong>ald Kossmann (ETH Zurich)<br />
Kurt Stockinger (Credit Suisse aG)<br />
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Panel: The Future of Scientific <strong>Data</strong> Bases (Studio D)<br />
Panel Moderator: Michael St<strong>on</strong>ebraker (MIT)<br />
Panelists: Anastasia Ailamaki (EPFL), Jeremy Kepner<br />
(MIT), and Alex Szalay (Johns Hopkins University)<br />
Seminar 5 (Sal<strong>on</strong> 123)<br />
3:30PM - 4PM Coffee Break<br />
Emerging Graph Queries In Linked <strong>Data</strong><br />
arijit Khan (university of California, Santa Barbara)<br />
yinghui Wu (university of California, Santa Barbara)<br />
Xifeng yan (university of California, Santa Barbara)<br />
Demo Group 1 (Studio E)<br />
See “demo Group 1” listing above<br />
4PM - 5:30PM Poster Sessi<strong>on</strong>, all papers (Sal<strong>on</strong> 4567)<br />
5:30PM Departure for cruise and c<strong>on</strong>ference banquet<br />
WEdnESday, april 4<br />
8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />
9AM - 10AM Keynote 3 (Sal<strong>on</strong> 4567): Peter Druschel —<br />
Accountability and Trust in Cooperative<br />
Informati<strong>on</strong> Systems<br />
Sessi<strong>on</strong> Chair: Johannes Gehrke<br />
10AM - 10:30AM Coffee Break<br />
10:30AM - No<strong>on</strong> Sessi<strong>on</strong>s 18-20, Industrial Sessi<strong>on</strong> 3, Seminar 6,<br />
Demo Group 2<br />
Page<br />
43
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
44<br />
Sessi<strong>on</strong> 18: Similarity (Studio F)<br />
Sessi<strong>on</strong> Chair: Matthias Renz<br />
Efficient Exact Similarity Searches using Multiple<br />
Token Orderings<br />
J<strong>on</strong>gik Kim (Ch<strong>on</strong>buk nati<strong>on</strong>al university)<br />
H<strong>on</strong>grae lee (Google inc.)<br />
Efficient Graph Similarity Joins with Edit<br />
Distance C<strong>on</strong>straints<br />
Xiang Zhao (The university of new South Wales & niCTa)<br />
Chuan Xiao (The university of new South Wales)<br />
Xuemin lin (The university of new South Wales & East China<br />
normal university)<br />
Wei Wang (The university of new South Wales)<br />
Parameter-Free Determinati<strong>on</strong> of Distance Thresholds for<br />
Metric Distance C<strong>on</strong>straints<br />
Shaoxu S<strong>on</strong>g (Tsinghua university)<br />
lei Chen (The H<strong>on</strong>g K<strong>on</strong>g university of Science and<br />
Technology)<br />
H<strong>on</strong>g Cheng (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Random Error Reducti<strong>on</strong> in Similarity Search <strong>on</strong> Time<br />
Series: A Statistical Approach<br />
Wush Chi-Hsuan Wu (academia Sinica)<br />
Mi-yen yeh (academia Sinica)<br />
Jian pei (Sim<strong>on</strong> Fraser university)<br />
Sessi<strong>on</strong> 19: Text and Strings (Studio B)<br />
Sessi<strong>on</strong> Chair: Feifei Li<br />
Optimizing Statistical Informati<strong>on</strong> Extracti<strong>on</strong> Programs<br />
Over Evolving Text<br />
Fei Chen (Hp labs China)<br />
Xixuan Feng (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Christopher re (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Min Wang (Hp labs China)<br />
Approximate String Membership Checking: A Multiple<br />
Filter, Optimizati<strong>on</strong>-Based Approach<br />
Ch<strong>on</strong>g Sun (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Jeffrey F. naught<strong>on</strong> (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Siddharth Barman (university of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
On Text Clustering with Side Informati<strong>on</strong><br />
Charu C. aggarwal (iBM T. J. Wats<strong>on</strong> research Center)<br />
yuchen Zhao (university of illinois at Chicago)<br />
philip S. yu (university of illinois at Chicago)
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Fast SLCA and ELCA Computati<strong>on</strong> for XML Keyword<br />
Queries based <strong>on</strong> Set Intersecti<strong>on</strong><br />
Junfeng Zhou (yanshan university)<br />
Zhifeng Bao (nati<strong>on</strong>al university of Singapore)<br />
Wei Wang (The university of new South Wales)<br />
Tok Wang ling (nati<strong>on</strong>al university of Singapore)<br />
Ziyang Chen (yanshan university)<br />
Xud<strong>on</strong>g lin (yanshan university)<br />
Jingfeng Guo (yanshan university)<br />
Sessi<strong>on</strong> 20: Query Processing II (Studio C)<br />
Sessi<strong>on</strong> Chair: Volker Markl<br />
Optimizati<strong>on</strong> of Massive Pattern Queries by Dynamic<br />
C<strong>on</strong>figurati<strong>on</strong> Morphing<br />
nikolay laptev (university of California, los angeles)<br />
Carlo Zaniolo (university of California, los angeles)<br />
Three-level Processing of Multiple Aggregate<br />
C<strong>on</strong>tinuous Queries<br />
Shenoda Guirguis (university of pittsburgh)<br />
Mohamed a. Sharaf (The university of Queensland)<br />
panos K. Chrysanthis (university of pittsburgh)<br />
alexandros labrinidis (university of pittsburgh)<br />
Accelerating Range Queries For Brain Simulati<strong>on</strong>s<br />
Farhan Tauheed (EpFl)<br />
laurynas Biveinis (aalborg university)<br />
Thomas Heinis (EpFl)<br />
Felix Schürmann (EpFl)<br />
Henry Markram (EpFl)<br />
anastasia ailamaki (EpFl)<br />
Keyword Query Reformulati<strong>on</strong> <strong>on</strong> Structured <strong>Data</strong><br />
Junjie yao (peking university)<br />
Bin Cui (peking university)<br />
liansheng Hua (peking university)<br />
yuxin Huang (peking university)<br />
Industrial Sessi<strong>on</strong> 3: Indexing, Updates and<br />
Processing (Studio D)<br />
Efficient Support of XQuery Update Facility in XML<br />
Enabled RDBMS<br />
Zhen Hua liu (oracle)<br />
Hui Chang (oracle)<br />
Balasubramanyam Sthanikam (oracle)<br />
Page<br />
45
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
46<br />
Making Unstructured <strong>Data</strong> SPARQL Using Semantic<br />
Indexing in Oracle <strong>Data</strong>base<br />
Souripriya das (oracle)<br />
Seema Sundara (oracle )<br />
Matthew perry (oracle)<br />
Jagannathan Srinivasan (oracle)<br />
Jayanta Banerjee (oracle)<br />
aravind yalamanchi (oracle)<br />
A meta-language for MDX queries in eLog<br />
Business Soluti<strong>on</strong><br />
S<strong>on</strong>ia Bergamaschi (university of Modena and reggio Emilia)<br />
Matteo interlandi (university of Modena and reggio Emilia)<br />
Mario l<strong>on</strong>go (eBilling S.p.a.)<br />
laura po (university of Modena and reggio Emilia)<br />
Maurizio Vincini (university of Modena and reggio Emilia)<br />
Seminar 6 (Sal<strong>on</strong> 123)<br />
Boolean Matrix Decompositi<strong>on</strong> Problem: Theory, Variati<strong>on</strong>s<br />
and Applicati<strong>on</strong>s to <strong>Data</strong> <strong>Engineering</strong><br />
Jaideep Vaidya (rutgers university)<br />
Demo Group 2 (Studio E)<br />
See “demo Group 2” listing above<br />
No<strong>on</strong> - 2PM Lunch (Provided by <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> with Sal<strong>on</strong> 4567)<br />
2PM - 3:30PM Sessi<strong>on</strong>s 21-23, Demo Group 3<br />
Sessi<strong>on</strong> 21: <strong>Data</strong> Mining (Studio F)<br />
Sessi<strong>on</strong> Chair: Anth<strong>on</strong>y Tung<br />
Predicting Approximate Protein-DNA Binding Cores Using<br />
Associati<strong>on</strong> Rule Mining<br />
po-yuen W<strong>on</strong>g (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Tak-Ming Chan (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Man-H<strong>on</strong> W<strong>on</strong>g (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Kw<strong>on</strong>g-Sak leung (The Chinese university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Upgrading Uncompetitive Products Ec<strong>on</strong>omically<br />
Hua lu (aalborg university)<br />
Christian S. Jensen (aarhus university)
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Attribute-Based Subsequence Matching and Mining<br />
yu peng (The H<strong>on</strong>g K<strong>on</strong>g university of Science and<br />
Technology)<br />
raym<strong>on</strong>d Chi-Wing W<strong>on</strong>g (The H<strong>on</strong>g K<strong>on</strong>g university of<br />
Science and Technology)<br />
liangliang ye (The H<strong>on</strong>g K<strong>on</strong>g university of Science and<br />
Technology)<br />
philip S. yu (university of illinois at Chicago)<br />
Integrating Frequent Pattern Mining from Multiple <strong>Data</strong><br />
Domains for Classificati<strong>on</strong><br />
dhaval patel (nati<strong>on</strong>al university of Singapore)<br />
Wynne Hsu (nati<strong>on</strong>al university of Singapore)<br />
M<strong>on</strong>g li lee (nati<strong>on</strong>al university of Singapore)<br />
Sessi<strong>on</strong> 22: Scientific <strong>Data</strong>, Analysis and<br />
Visualizati<strong>on</strong> (Studio B)<br />
Sessi<strong>on</strong> Chair: Christopher Re<br />
Efficient Versi<strong>on</strong>ing for Scientific Array <strong>Data</strong>bases<br />
adam Seering (MiT CSail)<br />
philippe Cudre-Mauroux (university of Fribourg)<br />
Samuel Madden (MiT CSail)<br />
Michael St<strong>on</strong>ebraker (MiT CSail)<br />
Multidimensi<strong>on</strong>al Analysis of Atypical Events in<br />
Cyber-Physical <strong>Data</strong><br />
lu-an Tang (uiuC)<br />
Xiao yu (uiuC)<br />
Sangkyum Kim (uiuC)<br />
Jiawei Han (uiuC)<br />
Wen-Chih peng (nati<strong>on</strong>al Chiao Tung university)<br />
yizhou Sun (uiuC)<br />
Hector G<strong>on</strong>zalez (Google)<br />
Sebastian Seith (Morning Star)<br />
HiCS: High C<strong>on</strong>trast Subspaces for Density-Based<br />
Outlier Ranking<br />
Fabian Keller (Karlsruhe institute of Technology)<br />
Emmanuel Müller (Karlsruhe institute of Technology)<br />
Klemens Böhm (Karlsruhe institute of Technology)<br />
Extracting Analyzing and Visualizing Triangle K-Core Motifs<br />
within Networks<br />
yang Zhang (The ohio State university)<br />
Srinivasan parthasarathy (The ohio State university)<br />
Page<br />
47
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
3:30PM - 4PM Coffee Break<br />
Page<br />
48<br />
Sessi<strong>on</strong> 23: Similarity Search and Detecti<strong>on</strong> (Studio D)<br />
Sessi<strong>on</strong> Chair: Xuemin Lin<br />
Horiz<strong>on</strong>tal Reducti<strong>on</strong>: Instance-Level Dimensi<strong>on</strong>ality<br />
Reducti<strong>on</strong> for Similarity Search in Large Document<br />
<strong>Data</strong>bases<br />
Min Soo Kim (KaiST)<br />
Kyu-young Whang (KaiST)<br />
yang-Sae Mo<strong>on</strong> (Kangw<strong>on</strong> nati<strong>on</strong>al university)<br />
Adaptive Windows for Duplicate Detecti<strong>on</strong><br />
uwe draisbach (Hasso-plattner-institute)<br />
Felix naumann (Hasso-plattner-institute)<br />
Sascha Szott (Zuse institute)<br />
oliver W<strong>on</strong>neberg (r. lindner GmbH & Co. KG)<br />
Efficient Dual-Resoluti<strong>on</strong> Layer Indexing for Top-k Queries<br />
J<strong>on</strong>gwuk lee (pohang university of Science and Technology<br />
(poSTECH))<br />
Hyunsouk Cho (pohang university of Science and Technology<br />
(poSTECH))<br />
Seung-w<strong>on</strong> Hwang (pohang university of Science and<br />
Technology (poSTECH))<br />
Evaluating Probabilistic Queries over Uncertain Matching<br />
reynold Cheng (The university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Jian G<strong>on</strong>g (The university of H<strong>on</strong>g K<strong>on</strong>g)<br />
david W. Cheung (The university of H<strong>on</strong>g K<strong>on</strong>g)<br />
Jiefeng Cheng (Shenzhen institute of advanced Technology)<br />
Demo Group 3 (Studio E)<br />
See “demo Group 3” listing above<br />
4PM - 5:30PM Sessi<strong>on</strong>s 24-25, Demo Group 4<br />
Sessi<strong>on</strong> 24: Sensors Network and Trajectory<br />
(Studio B)<br />
Sessi<strong>on</strong> Chair: Flip Korn<br />
Detecting Outliers in Sensor Networks using the Geometric<br />
Approach<br />
Sabbas Burdakis (Technical university of Crete)<br />
ant<strong>on</strong>ios deligiannakis (Technical university of Crete)
Efficient Threshold M<strong>on</strong>itoring for Distributed<br />
Probabilistic <strong>Data</strong><br />
Mingwang Tang (university of utah)<br />
Feifei li (university of utah)<br />
Jeff M. phillips (university of utah)<br />
Jeffrey Jestes (university of utah)<br />
Incorporating Durati<strong>on</strong> Informati<strong>on</strong> for Trajectory<br />
Classificati<strong>on</strong><br />
dhaval patel (nati<strong>on</strong>al university of Singapore)<br />
Chang Sheng (dBS Bank)<br />
Wynne Hsu (nati<strong>on</strong>al university of Singapore)<br />
M<strong>on</strong>g li lee (nati<strong>on</strong>al university of Singapore)<br />
Sessi<strong>on</strong> C<strong>on</strong>tents<br />
Reducing Uncertainty of Low-Sampling-Rate Trajectories<br />
Kai Zheng (The university of Queensland)<br />
yu Zheng (Microsoft research asia)<br />
Xing Xie (Microsoft research asia)<br />
Xiaofang Zhou (The university of Queensland)<br />
Sessi<strong>on</strong> 25: Error Reducti<strong>on</strong> and <strong>Data</strong><br />
Security (Studio D)<br />
Sessi<strong>on</strong> Chair: Graham Cormode<br />
Efficient Similarity Search over Encrypted <strong>Data</strong><br />
Mehmet Kuzu (The university of Texas at dallas)<br />
Mohammad Saiful islam (The university of Texas at dallas)<br />
Murat Kantarcioglu (The university of Texas at dallas)<br />
Obfuscating the Topical Intenti<strong>on</strong> in Enterprise Text Search<br />
HweeHwa pang (Singapore Management university)<br />
Xiaokui Xiao (nanyang Technological university)<br />
Jialie Shen (Singapore Management university)<br />
Correlati<strong>on</strong> Support for Risk Evaluati<strong>on</strong> in <strong>Data</strong>bases<br />
Katrin Eisenreich (Sap research)<br />
Jochen adamek (Technische universität Berlin)<br />
philipp rösch (Sap research)<br />
Volker Markl (Technische universität Berlin)<br />
Gregor Hackenbroich (Sap research)<br />
A Game-Theoretic Approach for High-Assurance of <strong>Data</strong><br />
Trustworthiness in Sensor Networks<br />
Hyo-Sang lim (purdue university & Computer and Telecommunicati<strong>on</strong>s<br />
<strong>Engineering</strong> divisi<strong>on</strong>, South Korea)<br />
Gabriel Ghinita (university of Massachusetts at Bost<strong>on</strong>)<br />
Elisa Bertino (purdue university)<br />
Murat Kantarcioglu (university of Texas at dallas)<br />
Page<br />
49
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
THurSday, april 5<br />
Page<br />
50<br />
Demo Group 4 (Studio E)<br />
See “demo Group 4” listing above<br />
8AM - 9AM Breakfast (Prefuncti<strong>on</strong>)<br />
9AM - 5:30PM Workshops<br />
Studio B: data Management in the Cloud (dMC)<br />
Studio d: Graph data Management: Techniques and<br />
applicati<strong>on</strong>s (GdM)<br />
Studio F: Secure data Management <strong>on</strong> Smartph<strong>on</strong>es and<br />
Mobiles (SdMSM)
Keynotes<br />
awarded in 2008 an ERC Advanced Grant, namely Webdam, <strong>on</strong> Foundati<strong>on</strong>s<br />
of Web <strong>Data</strong> Management. He is a member of the French Academy of<br />
Keynote 1: M<strong>on</strong>day, april 2<br />
Sciences since 2008.<br />
Viewing the Web as a Distributed Knowledge Base<br />
Serge abiteboul (Professor at Collège de France and Senior researcher at<br />
INRIA Saclay)<br />
ABstrAct: Informati<strong>on</strong> of interest may be found <strong>on</strong> the Web<br />
in a variety of forms, in many systems, and with different access<br />
protocols. A typical user may have informati<strong>on</strong> <strong>on</strong> many devices<br />
(smartph<strong>on</strong>e, laptop, TV box, etc.), many systems (mailers,<br />
blogs, Web sites, etc.), many social networks (Facebook, Picasa,<br />
etc.). This same user may have access to more informati<strong>on</strong> from<br />
Keynote family, 2 (Tuesday friends, April associati<strong>on</strong>s, 3):<br />
companies, and organizati<strong>on</strong>s. Today, the c<strong>on</strong>trol and<br />
management of the diversity of data and tasks in this setting are bey<strong>on</strong>d the skills<br />
How Different Is Big <strong>Data</strong>?<br />
of casual users. Facing similar issues, companies see the cost of managing and inte-<br />
Surajit Chaudhuri (Microsoft Corp)<br />
grating informati<strong>on</strong> skyrocketing.<br />
TALK ABSTRACT<br />
One buzzword that has been popular in the last couple of years is Big <strong>Data</strong>. In simplest<br />
terms, We Big are <strong>Data</strong> interested symbolizes the aspirati<strong>on</strong> here to build in platforms the and management tools to ingest, store and of such data. Our focus is not <strong>on</strong> har-<br />
analyze data that can be voluminous, diverse, and possibly fast changing. In this talk, I<br />
will vesting try to reflect all <strong>on</strong> a the few of the data technical of problems a particular presented by the explorati<strong>on</strong> user or of Big a group of users and then managing it<br />
<strong>Data</strong>. Some of these challenges in data analytics have been addressed by our community<br />
in the a past centralized in a more traditi<strong>on</strong>al relati<strong>on</strong>al manner. database Instead, c<strong>on</strong>text but <strong>on</strong>ly we with mixed are results. c<strong>on</strong>cerned I with the management of Web<br />
will review these quests and study some of the key less<strong>on</strong>s learned. At the same time,<br />
significant data in developments place such in as a the distributed emergence of cloud infrastructure manner, and availability with of a possibly large number of aut<strong>on</strong>omous,<br />
data rich web services hold the potential for transforming our industry. I will discuss the<br />
heterogeneous systems collaborating to support certain tasks.<br />
unique opportunities they present for Big <strong>Data</strong> Analytics.<br />
BIOGRAPHICAL SKETCH<br />
Surajit Our Chaudhuri thesis is a is Distinguished that managing Scientist at Microsoft the research. richness His current areas and of diversity of user-centric data residing<br />
interest are enterprise data analytics, self-manageability and multi-tenant technology for<br />
<strong>on</strong> the Web can be tamed using a holistic approach based <strong>on</strong> a distributed knowledge<br />
base. All Web informati<strong>on</strong>s are represented as logical facts, and Web data<br />
cloud database services. Surajit is an ACM Fellow, a recipient of the ACM SIGMOD<br />
Page<br />
51
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
management tasks as logical rules. We discuss Webdamlog, a variant of datalog for<br />
distributed data management that we use for this purpose. The automatic reas<strong>on</strong>ing<br />
povided by its inference engine, operating over the Web knowledge base,<br />
greatly benefits a variety of complex data management tasks that currently require<br />
intense work and deep expertise.<br />
This work is part of the Webdam European project, http://webdam.inria.fr/.<br />
Bio: Serge Abiteboul, Telecom Paris, PhD computer science, USC Los Angeles, and<br />
Thèse d’Etat, University of Paris Sud. He has held professor positi<strong>on</strong>s at Stanford<br />
and Ecole Polytechnique. He is <strong>on</strong>e of the co-authors of Foundati<strong>on</strong>s of <strong>Data</strong>bases,<br />
and, recently, of Web <strong>Data</strong> Management. He co-founded in 2000 a start-up,<br />
named Xyleme. He received the 1998 ACM SIGMOD Innovati<strong>on</strong> Award. He has been<br />
program chair of a number of c<strong>on</strong>ferences including ACM PODS-95, ICALP-94,<br />
ICDT-90, ECDL-99 and VLDB-09, <strong>ICDE</strong>-11, track of WWW-12. He has been awarded<br />
in 2008 an ERC Advanced Grant, namely Webdam, <strong>on</strong> Foundati<strong>on</strong>s of Web <strong>Data</strong><br />
Management. He is a member of the French Academy of Sciences since 2008.<br />
F. Codd Innovati<strong>on</strong>s Award, ACM SIGMOD C<strong>on</strong>tributi<strong>on</strong>s Award, and a VLDB<br />
ar Best Paper Award. Surajit received his Ph.D. from Stanford University and<br />
h from the Indian Keynote Institute of Technology, 2: TueSday, Kharagpur. april 3<br />
How Different is Big <strong>Data</strong>?<br />
Surajit Chaudhuri (Microsoft Corp)<br />
tALK ABstrAct: One buzzword that has been popular in the<br />
last couple of years is Big <strong>Data</strong>. In simplest terms, Big <strong>Data</strong><br />
symbolizes the aspirati<strong>on</strong> to build platforms and tools to ingest,<br />
store and analyze data that can be voluminous, diverse, and<br />
possibly fast changing. In this talk, I will try to reflect <strong>on</strong> a few of<br />
the technical problems presented by the explorati<strong>on</strong> of Big <strong>Data</strong>.<br />
Some of these challenges in data analytics have been addressed<br />
by our community in the past in a more traditi<strong>on</strong>al relati<strong>on</strong>al database c<strong>on</strong>text but<br />
<strong>on</strong>ly with mixed results. I will review these quests and study some of the key less<strong>on</strong>s<br />
learned. At the same time, significant developments such as the emergence of<br />
ote 3 (Wednesday cloud infrastructure April 4) and availability of data rich web services hold the potential for<br />
untability<br />
transforming<br />
and Trust in Cooperative<br />
our industry.<br />
Informati<strong>on</strong><br />
I will discuss<br />
Systems<br />
the unique opportunities they present for<br />
Big <strong>Data</strong> Analytics.<br />
r Druschel (Max Planck Institute for Software Systems (MPI-SWS)<br />
rslautern and BioGrAPHicAL Saarbrücken, Germany) sKEtcH: Surajit Chaudhuri is a Distinguished Scientist at Microsoft<br />
research. His current areas of interest are enterprise data analytics, self-<br />
erati<strong>on</strong> and trust play an increasingly important role in today’s informati<strong>on</strong><br />
ms. For instance, manageability peer-to-peer systems and multi-tenant like BitTorrent, Sopcast technology and Skype for are cloud database services. Surajit is<br />
red by resource an ACM c<strong>on</strong>tributi<strong>on</strong>s Fellow, from a participating recipient users; of the federated ACM SIGMOD systems like Edgar F. Codd Innovati<strong>on</strong>s Award,<br />
ternet have ACM to respect SIGMOD the interests, C<strong>on</strong>tributi<strong>on</strong>s policies and laws Award, of participating and a VLDB 10 year Best Paper Award. Surajit<br />
izati<strong>on</strong>s and received countries; in his the Ph.D. Cloud, from users entrust Stanford their data University and computati<strong>on</strong> and B.Tech from the Indian Institute of<br />
rd-part infrastructure.<br />
Technology, Kharagpur.<br />
s talk, we c<strong>on</strong>sider accountability as a way to facilitate transparency and trust<br />
perative systems. We look at practical techniques to account for the integrity<br />
tributed, cooperative computati<strong>on</strong>s, and look at some of the difficulties and<br />
problems in accountability.<br />
talk describes joint work with Paarijaat Aditya, Ioan- nis Avramopoulos,<br />
ael Backes, Andreas Haeberlen, Petr Kuznetsov, Yin Lin, Bruce Maggs,<br />
fer Rexford, Rodrigo Rodrigues, Dominique Unruh, Bill Wish<strong>on</strong> and<br />
chen Zhao.<br />
Page<br />
52
Bio: Peter Druschel is the founding director of the Max Planck Institute for<br />
Software Systems (MPI-SWS) in Germany. Previ- ously, he was a Professor of<br />
Computer Science and Electrical and Computer <strong>Engineering</strong> at Rice University in<br />
Houst<strong>on</strong>, Texas. He received the Dipl-Ing. (FH) in <strong>Data</strong> Systems Engi- neering<br />
from Fachhochschule Munich, Germany in 1986 and the Ph.D. degree in<br />
Computer Science from the University of Ariz<strong>on</strong>a in 1994. His research interests<br />
include distributed systems and operating systems. He is the recipient of an NSF<br />
CAREER Award, Alfred P. Sloan Fellowship and the ACM SIGOPS Mark Weiser<br />
Award, and a member of Academia Europaea and the German Academy of<br />
Sciences Leopoldina.<br />
Keynote 3: WedneSday, april 4<br />
Keynotes<br />
Accountability and trust in cooperative<br />
informati<strong>on</strong> systems<br />
peter druschel (Max Planck Institute for Software Systems (MPI-SWS)<br />
Kaiserslautern and Saarbrücken, Germany)<br />
Cooperati<strong>on</strong> and trust play an increasingly important role in<br />
today’s informati<strong>on</strong> systems. For instance, peer-to-peer systems<br />
like BitTorrent, Sopcast and Skype are powered by resource<br />
c<strong>on</strong>tributi<strong>on</strong>s from participating users; federated systems like<br />
the Internet have to respect the interests, policies and laws of<br />
participating organizati<strong>on</strong>s and countries; in the Cloud, users entrust their data and<br />
computati<strong>on</strong> to third-part infrastructure.<br />
In this talk, we c<strong>on</strong>sider accountability as a way to facilitate transparency and trust<br />
in cooperative systems. We look at practical techniques to account for the integrity<br />
of distributed, cooperative computati<strong>on</strong>s, and look at some of the difficulties and<br />
open problems in accountability.<br />
This talk describes joint work with Paarijaat Aditya, loannis Avramopoulos, Michael<br />
Backes, Andreas Haeberlen, Petr Kuznetsov, Yin Lin, Bruce Maggs, Jennifer Rexford,<br />
Rodrigo Rodrigues, Dominique Unruh, Bill Wish<strong>on</strong> and Mingchen Zhao.<br />
Bio: Peter Druschel is the founding director of the Max Planck Institute for Software<br />
Systems (MPI-SWS) in Germany. Previously, he was a Professor of Computer<br />
Science and Electrical and Computer <strong>Engineering</strong> at Rice University in Houst<strong>on</strong>,<br />
Texas. He received the DiplIng. (FH) in <strong>Data</strong> Systems <strong>Engineering</strong> from Fachhochschule<br />
Munich, Germany in 1986 and the Ph.D. degree in Computer Science from the<br />
University of Ariz<strong>on</strong>a in 1994. His research interests include distributed systems and<br />
operating systems. He is the recipient of an NSF CAREER Award, Alfred P. Sloan<br />
Fellowship and the ACM SIGOPS Mark Weiser Award, and a member of Academia<br />
Europaea and the German Academy of Sciences Leopoldina.<br />
Page<br />
53
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
54
Seminars<br />
Seminar 1:<br />
<strong>Data</strong> ManageMent ISSueS <strong>on</strong> the SeMantIc Web<br />
Seminar 1: <strong>Data</strong> Management Issues <strong>on</strong> the Semantic Web<br />
Seminar 1: <strong>Data</strong> Management Issues <strong>on</strong> the Semantic Web<br />
Oktie<br />
Oktie Hassanzadeh<br />
HassanzadeH<br />
is a Research<br />
is a Research<br />
Staff Member<br />
Staff Member<br />
at IBM T.J.<br />
at IBM T.J.<br />
Oktie Wats<strong>on</strong> Hassanzadeh Research Center. is a Research His research Staff Member interests at are IBM in T.J. the<br />
Wats<strong>on</strong> Research Center. His research interests are in the areas<br />
areas Wats<strong>on</strong> of Research data Center. cleaning His and research integrati<strong>on</strong>, interests Web are in data the<br />
of management areas<br />
data cleaning<br />
of data and and <strong>on</strong>line cleaning<br />
integrati<strong>on</strong>, data and analytics. integrati<strong>on</strong>,<br />
Web He data has management received Web data the and<br />
<strong>on</strong>line IBM management PhD data fellowship analytics. and <strong>on</strong>line in He 2010, data has and analytics. received is a recipient He the has IBM received of PhD the 2010 fellowship the in<br />
2010, Yahoo! IBM PhD and Key fellowship is a Scientific recipient in Challenges 2010, of the and 2010 award. is a Yahoo! recipient He is Key of a two the Scientific 2010 -time Challenges<br />
recipient Yahoo! award. Key of the Scientific He first is prize a Challenges two-time at the Triplificati<strong>on</strong> recipient award. He of Challenge, is the a two first -time an prize at the<br />
Triplificati<strong>on</strong> annual recipient c<strong>on</strong>test of the Challenge, first that prize awards an at annual the prizes Triplificati<strong>on</strong> to c<strong>on</strong>test the most Challenge, that promising awards an prizes to<br />
the projects annual most c<strong>on</strong>test in promising the area that of awards projects Linked prizes <strong>Data</strong>. in the He to area the is a of most graduate Linked promising of <strong>Data</strong>. the He is a<br />
University<br />
graduate<br />
projects in of<br />
of<br />
the Tor<strong>on</strong>to<br />
the<br />
area<br />
University<br />
of (M.Sc., Linked Ph.D.)<br />
of<br />
<strong>Data</strong>.<br />
Tor<strong>on</strong>to<br />
He and is Sharif<br />
(M.Sc.,<br />
a graduate University<br />
Ph.D.)<br />
of<br />
and<br />
the of<br />
Sharif<br />
University Technology of (B.Sc.). Tor<strong>on</strong>to (M.Sc., Ph.D.) and Sharif University of<br />
University of Technology (B.Sc.).<br />
Technology (B.Sc.).<br />
Dr Anastasios Kementsietsidis is a Research Staff Member<br />
at Dr IBM Anastasios T.J. Wats<strong>on</strong> Kementsietsidis Research is Center a Research at Hawthorne, Staff Member NY.<br />
dr. Anastasios at anastasiOs IBM T.J. has Wats<strong>on</strong> a kementsietsidis PhD Research in computer Center at is science a Hawthorne, Research from NY. the Staff Member<br />
at University Anastasios IBM T.J. Wats<strong>on</strong> of has Tor<strong>on</strong>to. a Research PhD He in is currently computer Center at interested science Hawthorne, in from various NY. the Anastasios<br />
has aspects University a PhD of in of computer RDF Tor<strong>on</strong>to. data He science management is currently from (including, interested the University in querying, various of Tor<strong>on</strong>to. He<br />
is currently storing aspects and of interested RDF benchmarking data in management various RDF aspects data). (including, In of the RDF querying, past, data he manage-<br />
worked (and is still interested in c<strong>on</strong>tinuing working) <strong>on</strong> data<br />
ment<br />
storing<br />
(including,<br />
and benchmarking<br />
querying, storing<br />
RDF data).<br />
and benchmarking<br />
In the past, he<br />
RDF data).<br />
integrati<strong>on</strong>, worked (and cleaning, is still interested provenance in c<strong>on</strong>tinuing and annotati<strong>on</strong>, working) security, <strong>on</strong> data<br />
In the<br />
as well<br />
past,<br />
as<br />
he<br />
(distributed)<br />
worked<br />
query<br />
(and<br />
evaluati<strong>on</strong><br />
is still interested<br />
and optimizati<strong>on</strong><br />
in c<strong>on</strong>tinuing<br />
<strong>on</strong><br />
work-<br />
integrati<strong>on</strong>, cleaning, provenance and annotati<strong>on</strong>, security,<br />
ing) relati<strong>on</strong>al as <strong>on</strong> well data as (distributed) or integrati<strong>on</strong>, semi-structured query cleaning, evaluati<strong>on</strong> data. provenance and He optimizati<strong>on</strong> has and several annotati<strong>on</strong>,<br />
<strong>on</strong><br />
security, publicati<strong>on</strong>s relati<strong>on</strong>al as well or in the as semi-structured (distributed) leading database data. query c<strong>on</strong>ferences, He evaluati<strong>on</strong> has including several and optimizati<strong>on</strong><br />
a publicati<strong>on</strong>s <strong>on</strong> best relati<strong>on</strong>al paper in award the or leading semi-structured in <strong>ICDE</strong> database 2007, a c<strong>on</strong>ferences, best data. demo He has including award several in publica-<br />
EDBT 2006, and his CIKM ti<strong>on</strong>s 2009 a best in paper the paper leading was award a runner-up database in <strong>ICDE</strong> for 2007, c<strong>on</strong>ferences, a best a paper best demo award. including award He has a in best paper<br />
served EDBT 2006, <strong>on</strong> the and program his CIKM committee award 2009 paper in of <strong>ICDE</strong> several was 2007, a leading runner-up a best c<strong>on</strong>ferences for demo a best award and paper workshops. in award. EDBT He 2006, has and his<br />
served <strong>on</strong> the program committee of several leading c<strong>on</strong>ferences and workshops.<br />
Yannis Velegrakis is a faculty member of the Department of<br />
Informati<strong>on</strong> Yannis Velegrakis <strong>Engineering</strong> is a faculty and member Computer of the Science Department of the of Page<br />
University Informati<strong>on</strong> of <strong>Engineering</strong> Trento. He holds and a Computer PhD degree Science in Computer of the 55<br />
Science University from of Trento. the University He holds of Tor<strong>on</strong>to. a PhD degree His research in Computer areas<br />
of Science expertise from the include University informati<strong>on</strong> of Tor<strong>on</strong>to. integrati<strong>on</strong>, His research mappings areas<br />
across of expertise heterogeneous include informati<strong>on</strong> data sources, integrati<strong>on</strong>, interoperability, mappings
Anastasios has a PhD in computer science from the<br />
University of Tor<strong>on</strong>to. He is currently interested in various<br />
aspects of RDF data management (including, querying,<br />
storing and benchmarking RDF data). In the past, he<br />
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> worked (and is still interested in c<strong>on</strong>tinuing working) <strong>on</strong> data<br />
integrati<strong>on</strong>, cleaning, provenance and annotati<strong>on</strong>, security,<br />
as well as (distributed) query evaluati<strong>on</strong> and optimizati<strong>on</strong> <strong>on</strong><br />
relati<strong>on</strong>al or semi-structured data. He has several<br />
CIKM publicati<strong>on</strong>s 2009 in paper the leading was a database runner-up c<strong>on</strong>ferences, for a best including paper award. He has<br />
a best paper award in <strong>ICDE</strong> 2007, a best demo award in<br />
served <strong>on</strong> the program committee of several leading c<strong>on</strong>ferences<br />
EDBT 2006, and his CIKM 2009 paper was a runner-up for a best paper award. He has<br />
served <strong>on</strong> the program committee and workshops.<br />
of several leading c<strong>on</strong>ferences and workshops.<br />
Yannis Velegrakis is a faculty is a member faculty of member the Department of the of Department of<br />
Informati<strong>on</strong> <strong>Engineering</strong> and and Computer Computer Science Science of the of the Univer-<br />
University of Trento. He holds a PhD degree in Computer<br />
sity<br />
Science<br />
of Trento.<br />
from the<br />
He<br />
University<br />
holds a<br />
of<br />
PhD<br />
Tor<strong>on</strong>to.<br />
degree<br />
His research<br />
in Computer<br />
areas<br />
Science from<br />
the of University expertise include of Tor<strong>on</strong>to. informati<strong>on</strong> His research integrati<strong>on</strong>, areas mappings of expertise include<br />
informati<strong>on</strong> across heterogeneous integrati<strong>on</strong>, data mappings sources, across interoperability, heterogeneous data<br />
sources, keyword interoperability, searching, semantic keyword web, social searching, applicati<strong>on</strong>s, semantic and web, social<br />
large-scale data management. Prior to joining the<br />
applicati<strong>on</strong>s, and large-scale data management. Prior to joining<br />
University of Trento, he held a researcher positi<strong>on</strong> at AT&T<br />
the Research University Labs of in Trento, the US. he He held has also a researcher spent time positi<strong>on</strong> as a at AT&T<br />
Research visitor at the Labs University in the US. of California, He has Santa-Cruz, also spent the time IBM as a visitor at the<br />
University Almaden Research of California, Center, Santa-Cruz, and the Center the of IBM Advanced Almaden Research<br />
Center, Studies and of the the IBM Center Tor<strong>on</strong>to of Lab. Advanced He was a Studies member of the<br />
IBM Tor<strong>on</strong>to<br />
committee for the CIMI cultural profile of the ANSI/NISO Z39.50 standard. He has<br />
Lab. He was a member of the committee for the CIMI cultural<br />
served in many program committees of nati<strong>on</strong>al and internati<strong>on</strong>al c<strong>on</strong>ferences and as<br />
reviewer for numerous internati<strong>on</strong>al profile of journals. the ANSI/NISO He is a general Z39.50 co-chair standard. for VLDB He 2013 has served in many<br />
and a PC co-chair for WebDB program <strong>2012</strong>. He committees has also been of a nati<strong>on</strong>al general co-chair and internati<strong>on</strong>al for DESWEB c<strong>on</strong>ferences<br />
2010 and 2011 and for SWAE2007. and as reviewer He holds 2 for US numerous patents and internati<strong>on</strong>al has been a Marie journals. Curie He is a gen-<br />
Fellow for the period 2006-2008. eral co-chair for VLDB 2013 and a PC co-chair for WebDB <strong>2012</strong>.<br />
He has also been a general co-chair for DESWEB 2010 and 2011<br />
and for SWAE2007. He holds 2 US patents and has been a Marie<br />
Curie Fellow for the period 2006-2008.<br />
Seminar 2:<br />
DIScoverIng MultIple cluSterIng SolutI<strong>on</strong>S:<br />
groupIng objectS In DIfferent vIeWS of the <strong>Data</strong><br />
Seminar 2: Discovering Multiple Clustering Soluti<strong>on</strong>s: Grouping Objects in<br />
Different Views of the <strong>Data</strong><br />
emmanuel Emmanuel Müller müller is a senior is a senior researcher researcher at the institute at the for institute for<br />
program structures and and data data organizati<strong>on</strong> organizati<strong>on</strong> at the Karlsruhe at the Karlsruhe Insti-<br />
Institute of Technology (KIT), Germany. In the past years,<br />
tute<br />
he was<br />
of Technology<br />
a research assistant<br />
(KIT),<br />
in<br />
Germany.<br />
computer science<br />
In the<br />
at<br />
past<br />
the data<br />
years, he was a<br />
research management assistant and data in explorati<strong>on</strong> computer group science at RWTH at the Aachen data management<br />
and University, data explorati<strong>on</strong> Germany. His group research at RWTH interests Aachen cover efficient University, Germany.<br />
data His mining research in high interests dimensi<strong>on</strong>al cover data, efficient detecti<strong>on</strong> data of clusters mining in high di-<br />
in subspace projecti<strong>on</strong>s and outlier detecti<strong>on</strong>. Leading the<br />
mensi<strong>on</strong>al data, detecti<strong>on</strong> of clusters in subspace projecti<strong>on</strong>s and<br />
open-source initiative OpenSubspace he provides a general<br />
outlier c<strong>on</strong>tributi<strong>on</strong> detecti<strong>on</strong>. to the Leading research the community open-source especially initiative by a OpenSubspace<br />
repeatable he provides and comparable a general evaluati<strong>on</strong> c<strong>on</strong>tributi<strong>on</strong> study <strong>on</strong> recent to the data research community<br />
mining especially approaches. by a Dr. repeatable Müller received and his comparable Diplom (MSc) evaluati<strong>on</strong> in study<br />
<strong>on</strong> 2007 recent and his data PhD mining in 2010 approaches. from RWTH Aachen Dr. Müller University. received his Diplom<br />
He is active member of program committees such as SDM, ECML PKDD, and recent<br />
(MSc) in 2007 and his PhD in 2010 from RWTH Aachen Univer-<br />
MultiClust-Workshops.<br />
sity. He is active member of program committees such as SDM,<br />
ECML PKDD, and recent MultiClust-Workshops.<br />
Stephan Günnemann is a PhD student and research<br />
assistant in computer science at the data management and<br />
data explorati<strong>on</strong> group at RWTH Aachen University,<br />
Germany. His research interests include the mining of n<strong>on</strong>redundant<br />
and multiple clustering soluti<strong>on</strong>s for high<br />
dimensi<strong>on</strong>al and structured data. He c<strong>on</strong>tributes to the open<br />
source initiative OpenSubspace for the evaluati<strong>on</strong> and<br />
explorati<strong>on</strong> of subspace clustering algorithms. Stephan<br />
Günnemann received his Diplom (MSc) in 2008 from RWTH<br />
Aachen University.<br />
Page<br />
56<br />
Ines Färber is a PhD student and research assistant in computer<br />
science at the data management and data explorati<strong>on</strong> group at<br />
RWTH Aachen University, Germany. Her research interests
epeatable c<strong>on</strong>tributi<strong>on</strong> and to comparable the research evaluati<strong>on</strong> community study especially <strong>on</strong> recent data by a<br />
mining repeatable approaches. and comparable Dr. Müller evaluati<strong>on</strong> received his study Diplom <strong>on</strong> recent (MSc) data in<br />
2007 mining and approaches. his PhD in Dr. 2010 Müller from received RWTH Aachen his Diplom University. (MSc) in<br />
He is active member of program 2007 committees and his PhD such in 2010 as SDM, from ECML RWTH PKDD, Aachen and University. recent<br />
MultiClust-Workshops.<br />
He is active member of program committees such as SDM, ECML PKDD, and recent<br />
MultiClust-Workshops.<br />
associate editor.<br />
Seminars<br />
Stephan Günnemann is a PhD student and research<br />
stepHan günnemann is a PhD student and research assistant<br />
assistant Stephan in Günnemann computer science is a at PhD the data student management and research and<br />
in data computer assistant explorati<strong>on</strong> in computer science group at science the at data RWTH at the management data Aachen management University, and and data explorati<strong>on</strong><br />
Germany. data group explorati<strong>on</strong> His at research RWTH group Aachen interests at RWTH University, include the Aachen mining Germany. University, of n<strong>on</strong>His<br />
research<br />
interests redundant Germany. include His and research multiple the mining interests clustering of include n<strong>on</strong>-redundant soluti<strong>on</strong>s the mining for of and high n<strong>on</strong>multiple<br />
clustering dimensi<strong>on</strong>al redundant soluti<strong>on</strong>s and structured multiple for high data. clustering dimensi<strong>on</strong>al He c<strong>on</strong>tributes soluti<strong>on</strong>s and to the for structured open high data. He<br />
source dimensi<strong>on</strong>al initiative and OpenSubspace structured data. He for c<strong>on</strong>tributes the evaluati<strong>on</strong> to the and open<br />
c<strong>on</strong>tributes explorati<strong>on</strong> source initiative of to the subspace OpenSubspace<br />
open source clustering for<br />
initiative algorithms. the evaluati<strong>on</strong><br />
OpenSubspace Stephan and<br />
for the<br />
evaluati<strong>on</strong> Günnemann explorati<strong>on</strong> and received of explorati<strong>on</strong> subspace his Diplom clustering of (MSc) subspace in algorithms. 2008 clustering from Stephan RWTH algorithms.<br />
Stephan Aachen Günnemann University. Günnemann received his received Diplom (MSc) his Diplom in 2008 (MSc) from RWTH in 2008 from<br />
RWTH Aachen Aachen University. University.<br />
Ines Färber is a PhD student and research assistant in computer<br />
science Ines ines Färber at the is data a PhD is management a student PhD student and and research data and explorati<strong>on</strong> assistant research in group computer assistant at in comput-<br />
RWTH science er science Aachen at the at data University, the management data management Germany. and data Her explorati<strong>on</strong> research and data interests group explorati<strong>on</strong> at group<br />
include RWTH mining Aachen of alternative University, and Germany. multi-view Her clustering research soluti<strong>on</strong>s interests<br />
at RWTH Aachen University, Germany. Her research interests<br />
for include high dimensi<strong>on</strong>al mining of alternative data. She and c<strong>on</strong>tributes multi-view to the clustering OpenSubspace soluti<strong>on</strong>s<br />
initiative for include high for dimensi<strong>on</strong>al mining evaluati<strong>on</strong> of data. alternative and She explorati<strong>on</strong> c<strong>on</strong>tributes and of multi-view to multiple the OpenSubspace clustering<br />
soluti<strong>on</strong>s<br />
soluti<strong>on</strong>s. initiative for high Ines for dimensi<strong>on</strong>al evaluati<strong>on</strong> Färber received and data. explorati<strong>on</strong> her She Diplom c<strong>on</strong>tributes of (MSc) multiple in 2009 to clustering the from OpenSubspace<br />
RWTH soluti<strong>on</strong>s. initiative Aachen Ines for University. Färber evaluati<strong>on</strong> received and her explorati<strong>on</strong> Diplom (MSc) of in multiple 2009 from clustering<br />
RWTH soluti<strong>on</strong>s. Aachen Ines University. Färber received her Diplom (MSc) in 2009 from<br />
RWTH Aachen University.<br />
tHOmas<br />
Thomas Seidl<br />
seidl<br />
is<br />
is<br />
a<br />
a<br />
professor<br />
professor<br />
for<br />
for<br />
computer<br />
computer<br />
science<br />
science<br />
and<br />
and<br />
head<br />
head<br />
of the<br />
of data the management data management and data and data explorati<strong>on</strong> explorati<strong>on</strong> group group at RWTH at RWTH Aachen<br />
Aachen University, University, Germany. Germany. His research His research interests interests include include data mining data and<br />
mining database and technology database technology for multimedia for multimedia and spatio-temporal and spatio-tem- databases<br />
poral in engineering, databases in communicati<strong>on</strong> engineering, communicati<strong>on</strong> and life science and applicati<strong>on</strong>s. life science Prof.<br />
applicati<strong>on</strong>s. Seidl received Prof. his Seidl Diplom received (MSc) his in 1992 Diplom from (MSc) TU Muenchen in 1992 from and his<br />
TU PhD Muenchen (1997) and and venia his PhD legendi (1997) (2001) and venia from legendi LMU Muenchen. (2001) from He is<br />
LMU active Muenchen. member He of is several active member program of committees several program including commit- ACM<br />
tees SIGKDD, including <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> ACM <strong>ICDE</strong>, SIGKDD, SDM, <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> recent <strong>ICDE</strong>, 0MultiClust-Workshops SDM, recent 0MultiClust- and<br />
Workshops others. He is and member others. of He the is member editorial board of the of editorial The VLDB board Journal of as<br />
The VLDB Journal as associate editor.<br />
Seminar 3:<br />
eMergIng graph QuerIeS In lInkeD <strong>Data</strong><br />
Seminar 3: Emerging Graph Queries In Linked <strong>Data</strong><br />
and ’11.<br />
arijit Arijit kHan Khan is a PhD PhD student of the of the Department of Computer of Computer<br />
Science, University University of California, of California, Santa Santa Barbara Barbara (UCSB). (UCSB). He is cur- He is<br />
rently currently working working with Professor with Professor Xifeng Yan Xifeng in Graph Yan Mining. in Graph Arijit Mining.<br />
received<br />
Arijit received<br />
his Bachelor<br />
his<br />
degree<br />
Bachelor<br />
in Computer<br />
degree in<br />
Science<br />
Computer<br />
and<br />
Science<br />
Engineer-<br />
and<br />
ing from Jadavpur University, India in 2008. He is the recipient of<br />
<strong>Engineering</strong> from Jadavpur University, India in 2008. He is the<br />
the prestigious CITRIX GO-TO fellowship award for the academic<br />
recipient of the prestigious CITRIX GO-TO fellowship award for<br />
year 2008-2009 and P1 fellowship award for the Spring Quarter<br />
the academic year 2008-2009 and P1 fellowship award for the<br />
in 2009-10 from the Department of Computer Science, UCSB.<br />
He<br />
Spring<br />
was also<br />
Quarter<br />
awarded<br />
in<br />
gold<br />
2009-10<br />
medals<br />
from<br />
by<br />
the<br />
Tata<br />
Department<br />
C<strong>on</strong>sultancy<br />
of<br />
Services<br />
Computer<br />
Ltd Science, for being UCSB. the best He student was also of the awarded Department gold of medals Computer by Tata<br />
Science C<strong>on</strong>sultancy & <strong>Engineering</strong>, Services Jadavpur Ltd for University, being the for best 2008-2009. student He of the<br />
published Department papers of in Computer SIGMOD’10 Science and ’11. & <strong>Engineering</strong>, Jadavpur<br />
University, for 2008-2009. He published papers in SIGMOD’10<br />
Page<br />
57<br />
Yinghui Wu is a research scientist of the Department of<br />
Computer Science, University of California, Santa Barbara<br />
(UCSB). He is currently working with Professor Xifeng Yan
Department Spring Quarter of Computer in 2009-10 Science from the & Department <strong>Engineering</strong>, of Computer Jadavpur<br />
University, Science, UCSB. for 2008-2009. He was He also published awarded papers gold medals in SIGMOD’10 by Tata<br />
and ’11.<br />
C<strong>on</strong>sultancy Services Ltd for being the best student of the<br />
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Department of Computer Science & <strong>Engineering</strong>, Jadavpur<br />
University, Yinghui for Wu 2008-2009. is a research He published scientist of papers the Department in SIGMOD’10 of<br />
and ’11.<br />
Computer Science, University of California, Santa Barbara<br />
(UCSB). He is currently working with Professor Xifeng Yan<br />
YingHui in Yinghui graph Wu Wu data is is management. a a research scientist Yinghui of got of the his Department PhD from the of of Computer<br />
University Computer<br />
Science, of Science, Edinburgh, University<br />
University UK of California, in 2010. of California, His Santa Barbara<br />
Santa research Barbara interests (UCSB).<br />
(UCSB). He is currently working with Professor Xifeng Yan<br />
He<br />
lie<br />
in<br />
is currently<br />
in the area<br />
graph data<br />
working<br />
of database<br />
management.<br />
with Professor<br />
theory and<br />
Yinghui got<br />
Xifeng<br />
graph<br />
his PhD<br />
Yan<br />
database<br />
from<br />
in graph<br />
management, with emphasis <strong>on</strong> graph database models and the<br />
data University management. of Edinburgh, Yinghui UK got in 2010. his PhD His from research the interests University of<br />
query languages. He published papers in SIGMOD, VLDB,<br />
Edinburgh, <strong>ICDE</strong> lie in and the UK ICDT. area in 2010. of database His research theory interests and graph lie database in the area of<br />
database management, theory with and emphasis graph database <strong>on</strong> graph database management, models with and emphasis<br />
<strong>on</strong> query graph languages. database He published models and papers query in SIGMOD, languages. VLDB, He published<br />
papers <strong>ICDE</strong> in and SIGMOD, ICDT. VLDB, <strong>ICDE</strong> and ICDT.<br />
Xifeng Yan is an assistant professor at the University of<br />
XiFeng<br />
California<br />
Yan is<br />
at<br />
an<br />
Santa<br />
assistant<br />
Barbara.<br />
professor<br />
He holds<br />
at the<br />
the<br />
University<br />
Venkatesh<br />
of California<br />
at Narayanamurti Santa Barbara. Chair He in holds Computer the Venkatesh Science. He Narayanamurti received Chair<br />
in Computer his Xifeng Ph.D. Yan Science. degree is an in assistant He Computer received professor Science his Ph.D. at from the degree the University in Computer of<br />
Science of California from Illinois the at University Santa Urbana-Champaign Barbara. of Illinois He in holds at 2006. Urbana-Champaign the He Venkatesh was a in<br />
2006. research Narayanamurti He was staff a research member Chair in staff at Computer the member IBM T. Science. J. at Wats<strong>on</strong> the He IBM Research received T. J. Wats<strong>on</strong> Research<br />
Center his<br />
Center<br />
Ph.D. between degree<br />
between 2006 in Computer<br />
2006 and 2008. and<br />
Science<br />
2008. He has from<br />
He been has<br />
the working University<br />
been <strong>on</strong> working <strong>on</strong><br />
of Illinois at Urbana-Champaign in 2006. He was a<br />
modeling, modeling, managing, managing, and and mining mining large-scale large-scale graphs graphs in bioinformat-<br />
in<br />
bioinformatics, research staff member social networks, at the IBM informati<strong>on</strong> T. J. Wats<strong>on</strong> networks, Research and<br />
ics, social<br />
Center<br />
networks,<br />
between 2006<br />
informati<strong>on</strong><br />
and 2008.<br />
networks,<br />
He has been<br />
and<br />
working<br />
computer<br />
<strong>on</strong><br />
systems.<br />
computer systems. His works were extensively<br />
His works<br />
referenced, modeling, were managing, extensively<br />
with over 5,000 and referenced, mining citati<strong>on</strong>s large-scale per<br />
with<br />
Google<br />
over graphs Scholar.<br />
5,000 in citati<strong>on</strong>s<br />
per Google He bioinformatics, received Scholar. NSF social He received networks, CAREER NSF informati<strong>on</strong> Award, CAREER IBM networks, Award, Inventi<strong>on</strong> and IBM Inventi<strong>on</strong><br />
Achievement computer systems. Award, ACM-SIGMOD His works Dissertati<strong>on</strong> were Dissertati<strong>on</strong> extensively Runner- Runner-Up<br />
Up Award, and <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> ICDM Award, 10-year referenced, and Highest <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> with ICDM Impact over 10-year Paper 5,000 Award. citati<strong>on</strong>s Highest per Impact Google Paper Scholar. Award.<br />
He received NSF CAREER Award, IBM Inventi<strong>on</strong><br />
Achievement Award, ACM-SIGMOD Dissertati<strong>on</strong> Runner-<br />
Up Seminar Award, and <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> 4: ICDM 10-year Highest Impact Paper Award.<br />
Seminar boolean 4: Boolean MatrIx Matrix DecoMpoSItI<strong>on</strong> Decompositi<strong>on</strong> probleM: Problem: Theory, Variatio<br />
Applicati<strong>on</strong>s theory, to varIatI<strong>on</strong>S <strong>Data</strong> <strong>Engineering</strong> anD applIcatI<strong>on</strong>S to<br />
<strong>Data</strong> engIneerIng<br />
Dr. Jaideep Vaidya is an Associate Professor of C<br />
dr. jaideep VaidYa is an Associate Professor of Computer Informati<strong>on</strong><br />
Informati<strong>on</strong> Systems at Rutgers Systems University. at Rutgers He received University. his Masters He rece<br />
and Ph.D. Masters from Purdue and University Ph.D. from and his Purdue Bachelors University degree from and his B<br />
the University of Mumbai. His research interests are in <strong>Data</strong> Min-<br />
degree from the University of Mumbai. His research<br />
ing, <strong>Data</strong> Management, Privacy, and Security. He has published<br />
over 60 are papers in in <strong>Data</strong> internati<strong>on</strong>al Mining, c<strong>on</strong>ferences <strong>Data</strong> Management, and archival journals, Privacy, and<br />
and has He received has three published best paper over awards 60 from papers the premier in internati<strong>on</strong>al c<strong>on</strong>- c<strong>on</strong><br />
ferences in data mining, databases, and digital government. He is<br />
and archival journals, and has received three be<br />
also the recipient of a NSF Career Award and a Rutgers Board of<br />
Trustees awards Research Fellowship from the for Scholarly premier Excellence. c<strong>on</strong>ferences in data<br />
databases, and digital government. He is also the recip<br />
NSF Career Award and a Rutgers Board of Trustees R<br />
Fellowship for Scholarly Excellence.<br />
Page<br />
58
Seminar 5:<br />
MInIng knoWleDge froM <strong>Data</strong>:<br />
an InforMatI<strong>on</strong> netWork analySIS approach<br />
Seminar 5: Mining Knowledge from <strong>Data</strong>: An Informati<strong>on</strong> Network Analysis<br />
Approach<br />
Seminars<br />
jiaWei Jiawei Han Han is is Abel Abel Bliss Bliss Professor in in <strong>Engineering</strong>, in the in the Depart-<br />
Department of Computer Science at the University of Illinois.<br />
ment of Computer Science at the University of Illinois. He has<br />
He has been researching into data mining, informati<strong>on</strong> network<br />
been analysis, researching and database into systems, data mining, with over informati<strong>on</strong> 600 publicati<strong>on</strong>s. network analysis,<br />
Seminar 5: Mining Knowledge and database from <strong>Data</strong>: An systems, Informati<strong>on</strong> with Network over Analysis<br />
Seminar 3: Emerging Graph He Queries served In as Linked the founding <strong>Data</strong> 600 publicati<strong>on</strong>s. He served as<br />
Approach<br />
Editor-in-Chief of ACM<br />
the Transacti<strong>on</strong>s founding <strong>on</strong> Editor-in-Chief Knowledge Discovery of ACM from Transacti<strong>on</strong>s <strong>Data</strong> (TKDD) <strong>on</strong> and Knowledge<br />
Jiawei Arijit Discovery <strong>on</strong> the Khan Han editorial is is from Abel a PhD boards Bliss <strong>Data</strong> student Professor (TKDD) of several of in and the other <strong>Engineering</strong>, Department <strong>on</strong> journals. the editorial in the Jiawei of Computer boards has of sev-<br />
Department Science, eral received other University of IBM journals. Computer Faculty of Science Jiawei California, Awards, at has the HP Santa received University Innovati<strong>on</strong> Barbara of IBM Illinois. Awards, (UCSB). Faculty ACM He Awards, is HP<br />
He currently SIGKDD has been<br />
Innovati<strong>on</strong><br />
working Innovati<strong>on</strong> researching<br />
Awards,<br />
with into Award<br />
ACM<br />
Professor data (2004), mining,<br />
SIGKDD<br />
Xifeng informati<strong>on</strong> <str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />
Innovati<strong>on</strong><br />
Yan Computer network in Graph Society<br />
Award<br />
Mining.<br />
analysis, (2004), <str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />
Arijit Technical received and Achievement database his Bachelor systems, Award with degree (2005), over in 600 Computer and publicati<strong>on</strong>s. <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Science Computer and<br />
He Computer Society served W. as Society Wallace the founding McDowell Technical Editor-in-Chief Award Achievement (2009), of and ACM<br />
<strong>Engineering</strong> from Jadavpur University, India in 2008.<br />
Award Daniel He<br />
(2005),<br />
is C. the<br />
and<br />
Transacti<strong>on</strong>s <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Drucker Computer Eminent<br />
<strong>on</strong> Knowledge<br />
Faculty Society Discovery<br />
Award W. Wallace (2011).<br />
from <strong>Data</strong><br />
He McDowell (TKDD)<br />
is a Fellow<br />
and Award of ACM (2009), and<br />
<strong>on</strong> recipient the editorial of the boards prestigious of several CITRIX other journals. GO-TO Jiawei fellowship has award for<br />
and a Fellow of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. He Daniel is currently C. Drucker the Director Eminent of Faculty Informati<strong>on</strong> Award Network (2011). Academic<br />
received the academic IBM Faculty year Awards, 2008-2009 HP Innovati<strong>on</strong> and P1 fellowship Awards, ACM award He for is the a Fellow of<br />
Research Center (INARC) SIGKDD Spring ACM supported and Innovati<strong>on</strong> Quarter a by Fellow the in Award 2009-10 Network of (2004), <str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. Science-Collaborative from He <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> the is Computer currently Department Society the Technology of Director Computer of Informa-<br />
Alliance (NS-CTA) program Technical Science, ti<strong>on</strong> Network<br />
of Achievement U.S. UCSB. Army<br />
Academic He Research Award was also (2005), Research<br />
Lab. awarded and His <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Center<br />
book gold Computer with medals (INARC)<br />
Micheline by supported Tata by<br />
Kamber and Jian Pei, "<strong>Data</strong> Society C<strong>on</strong>sultancy<br />
Mining: W. Wallace C<strong>on</strong>cepts<br />
the Network Services McDowell and<br />
Science-Collaborative Ltd<br />
Techniques" Award for being (2009), (Morgan<br />
the and<br />
Technology best Daniel Kaufmann)<br />
student C. has<br />
Alliance of the<br />
been used worldwide as<br />
(NS-CTA)<br />
Drucker a textbook. Eminent Faculty Award (2011). He is a Fellow of ACM<br />
Department program of of U.S. Computer Army Research Science Lab. & <strong>Engineering</strong>, His book with Jadavpur<br />
and a Fellow of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. He is currently the Director of Informati<strong>on</strong> Network Academic Micheline<br />
Research Center (INARC) University,<br />
Kamber supported Yizhou by for<br />
and Sun the 2008-2009.<br />
Jian Network is a Pei, Ph.D. Science-Collaborative He published papers<br />
“<strong>Data</strong> candidate Mining: at the C<strong>on</strong>cepts University Technology in SIGMOD’10<br />
of and Illinois Techniques”<br />
and Alliance ’11. (NS-CTA) program of<br />
(Morgan at U.S. Urbana-Champaign. Army Research Lab.<br />
Kaufmann) has Her His<br />
been principal book<br />
used research with Micheline<br />
worldwide interest as is a in textbook.<br />
Kamber and Jian Pei, "<strong>Data</strong> Mining: C<strong>on</strong>cepts and Techniques" (Morgan Kaufmann) has<br />
large-scale informati<strong>on</strong> and social networks, and more<br />
been used worldwide as a textbook. Yinghui<br />
generally<br />
Wu<br />
in data<br />
is a<br />
mining,<br />
research<br />
database<br />
scientist<br />
systems,<br />
of the<br />
applied<br />
Department of<br />
Computer statistics, machine Science, learning, University informati<strong>on</strong> of California, retrieval, Santa and Barbara<br />
Yizhou Sun is a Ph.D. candidate at the University of Illinois<br />
YizHOu (UCSB). network sun He science, is a currently Ph.D. with a candidate focus working <strong>on</strong> modeling with at Professor the novel University problems Xifeng of Yan Illinois at<br />
at Urbana-Champaign. Her principal research interest is in<br />
Urbana-Champaign. large-scale in and graph proposing informati<strong>on</strong> data management. scalable and Her social algorithms principal networks, Yinghui for research and large-scale, got more his PhD interest real- from is the in largescale<br />
generally University world informati<strong>on</strong> applicati<strong>on</strong>s. in data of Edinburgh, mining, and Yizhou database social UK has systems, in networks, over 2010. 30 applied His publicati<strong>on</strong>s and research more in interests generally in data<br />
mining, statistics, lie book in<br />
database chapters, the machine area learning, journals, systems,<br />
of database informati<strong>on</strong> and applied major theory retrieval, c<strong>on</strong>ferences statistics,<br />
and and graph such machine<br />
database as learning, in-<br />
network management,<br />
formati<strong>on</strong><br />
SIGKDD, science,<br />
retrieval,<br />
SIGMOD, with a emphasis focus<br />
and<br />
VLDB, <strong>on</strong> modeling<br />
network<br />
NIPS <strong>on</strong> graph and novel<br />
science,<br />
so database <strong>on</strong>, problems and<br />
with<br />
tutorials models and<br />
and a focus <strong>on</strong> model-<br />
query <strong>on</strong><br />
proposing<br />
"mining languages. scalable<br />
heterogeneous He algorithms published informati<strong>on</strong><br />
for large-scale, papers networks" in real- SIGMOD, in VLDB,<br />
ingworld novel applicati<strong>on</strong>s.<br />
premier<br />
problems Yizhou<br />
c<strong>on</strong>ferences.<br />
and has proposing over 30 publicati<strong>on</strong>s scalable in<br />
<strong>ICDE</strong> and ICDT.<br />
algorithms for largescale,<br />
book real-world chapters, journals, applicati<strong>on</strong>s. and major c<strong>on</strong>ferences Yizhou has such over as 30 publicati<strong>on</strong>s in<br />
SIGKDD, SIGMOD, VLDB, NIPS and so <strong>on</strong>, and tutorials<br />
book <strong>on</strong> "mining chapters, heterogeneous journals, informati<strong>on</strong> and major networks" c<strong>on</strong>ferences in such as SIGKDD,<br />
SIGMOD, premier Xifeng c<strong>on</strong>ferences. VLDB, Yan is NIPS an assistant and so professor <strong>on</strong>, and at tutorials the University <strong>on</strong> “mining of heterogeneous<br />
California informati<strong>on</strong> at Santa networks” Barbara. He in premier holds the c<strong>on</strong>ferences.<br />
Venkatesh<br />
Narayanamurti Chair in Computer Science. He received<br />
Xifeng his Yan Ph.D. is an degree assistant in professor Computer at the University Science of from the<br />
California University at Santa of Illinois Barbara. at Urbana-Champaign He holds the Venkatesh in 2006. He<br />
XiFeng Narayanamurti Xifeng Yan<br />
was Yan a research is Chair is an<br />
an assistant in assistant Computer<br />
staff member professor Science. professor He at<br />
at the at received the University of<br />
IBM the T. University J. Wats<strong>on</strong> of Califor-<br />
his California Ph.D. degree at Santa in Computer Barbara. Science He holds from the the Venkatesh<br />
nia University at Narayanamurti<br />
Research Santa of Barbara. Illinois<br />
Center<br />
Chair at<br />
between<br />
Urbana-Champaign He in holds Computer<br />
2006 the and Venkatesh Science. in<br />
2008.<br />
2006.<br />
He<br />
He He<br />
has Narayanamurti received<br />
been Chair<br />
in Computer was his<br />
working a Ph.D. research <strong>on</strong> Science. degree<br />
modeling, staff member in He Computer<br />
managing, received at the Science IBM and his T. mining Ph.D. J. from Wats<strong>on</strong> degree the<br />
large-scale<br />
University in Computer<br />
Science Research graphs<br />
from Center in<br />
the<br />
bioinformatics, between University 2006 of and social<br />
Illinois 2008. networks, He at has Urbana-Champaign been informati<strong>on</strong><br />
of Illinois at Urbana-Champaign in 2006. He was a in<br />
working networks, <strong>on</strong> modeling, and computer managing, and systems. mining large-scale His works were<br />
2006. research<br />
extensively<br />
He was staff a<br />
referenced,<br />
research member staff at the<br />
with<br />
member IBM T. J.<br />
over 5,000<br />
at Wats<strong>on</strong> the<br />
citati<strong>on</strong>s<br />
IBM Research T.<br />
per<br />
J. Wats<strong>on</strong> Re-<br />
graphs in bioinformatics, social networks, informati<strong>on</strong><br />
search networks, Center<br />
Google Center between and Scholar. between computer 2006<br />
He received 2006 and systems. 2008. and NSF His 2008. He<br />
CAREER works has He been were Award, has working been IBM<br />
<strong>on</strong> working <strong>on</strong><br />
Inventi<strong>on</strong> Achievement Award, modeling, extensively modeling,<br />
ACM-SIGMOD managing, referenced, managing,<br />
Dissertati<strong>on</strong> and with and mining over Runner-Up<br />
mining 5,000 large-scale citati<strong>on</strong>s large-scale<br />
Award, per graphs and<br />
graphs<br />
<str<strong>on</strong>g>IEEE</str<strong>on</strong>g> in bioinformat-<br />
in<br />
ICDM 10-year Highest Impact<br />
Google<br />
ics, social Paper bioinformatics, Scholar.<br />
networks, Award.<br />
He social received networks, NSF CAREER informati<strong>on</strong> Award, IBM networks, and<br />
informati<strong>on</strong> networks, and computer systems.<br />
Inventi<strong>on</strong> Achievement Award, ACM-SIGMOD computer Dissertati<strong>on</strong> systems. Runner-Up His works Award, and were <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> extensively<br />
ICDM 10-year Highest Impact His Paper works<br />
referenced, Award. were extensively<br />
with over 5,000<br />
referenced,<br />
citati<strong>on</strong>s per<br />
with<br />
Google<br />
over<br />
Scholar.<br />
5,000 citati<strong>on</strong>s<br />
per Google He received Scholar. NSF He received CAREER NSF Award, CAREER IBM Award, Inventi<strong>on</strong> IBM Inventi<strong>on</strong><br />
Achievement Award, ACM-SIGMOD Dissertati<strong>on</strong> Runner- Runner-Up<br />
Up Award, and <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> ICDM Award, 10-year and Highest <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> ICDM Impact 10-year Paper Award.<br />
Highest Impact Paper Award.<br />
Page<br />
59
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
pHilip Philip s. Yu S. received Yu received his Ph.D. his Ph.D. degree degree in E.E. in from E.E. Stanford from University.<br />
Stanford He is a University. Professor in He Computer is a Professor Science in at Computer the University<br />
of Illinois Science at Chicago at the University and also of holds Illinois the at Wexler Chicago Chair and in also Informati<strong>on</strong><br />
Technology. holds the Wexler Dr. Yu Chair spent in Informati<strong>on</strong> most of his Technology. career at IBM, Dr. where<br />
he was Yu manager spent most of the of Software his career Tools at IBM, and where Techniques he was group at<br />
the Wats<strong>on</strong> manager Research of the Software Center. His Tools research and Techniques interests include group data<br />
mining, at the database Wats<strong>on</strong> and Research privacy. Center. He has His published research more interests than 650<br />
papers<br />
include<br />
in refereed<br />
data mining,<br />
journals<br />
database<br />
and c<strong>on</strong>ferences.<br />
and privacy.<br />
He holds<br />
He has<br />
or has ap-<br />
published more than 650 papers in refereed journals<br />
plied for more than 350 US patents. Dr. Yu is a Fellow of the ACM<br />
and c<strong>on</strong>ferences. He holds or has applied for more than<br />
and the <str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. He is the Editor-in-Chief of ACM Transacti<strong>on</strong>s <strong>on</strong><br />
350 US patents. Dr. Yu is a Fellow of the ACM and the<br />
Knowledge Discovery from <strong>Data</strong>. He was the Editor-in-Chief of<br />
<str<strong>on</strong>g>IEEE</str<strong>on</strong>g>. He is the Editor-in-Chief of ACM Transacti<strong>on</strong>s <strong>on</strong> Knowledge Discovery from<br />
<strong>Data</strong>. He was the Editor-in-Chief<br />
<str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Transacti<strong>on</strong>s<br />
of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />
<strong>on</strong><br />
Transacti<strong>on</strong>s<br />
Knowledge<br />
<strong>on</strong><br />
and<br />
Knowledge<br />
<strong>Data</strong> <strong>Engineering</strong><br />
and <strong>Data</strong><br />
(2001-<br />
<strong>Engineering</strong> (2001-2004). He<br />
2004).<br />
received<br />
He received<br />
a Research<br />
a Research<br />
C<strong>on</strong>tributi<strong>on</strong>s<br />
C<strong>on</strong>tributi<strong>on</strong>s<br />
Award from<br />
Award<br />
<str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />
from<br />
Intl.<br />
<str<strong>on</strong>g>IEEE</str<strong>on</strong>g><br />
<str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong> <strong>Data</strong> Mining Intl. (2003). <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong> <strong>Data</strong> Mining (2003).<br />
Seminar 6:<br />
Seminar 6: Detecting Cl<strong>on</strong>es, Copying and Reuse <strong>on</strong> the Web<br />
DetectIng cl<strong>on</strong>eS, copyIng anD reuSe <strong>on</strong> the Web<br />
Seminar 6: Detecting Cl<strong>on</strong>es, Xin luna Copying dOng Xin and Reuse Luna is a researcher <strong>on</strong> D<strong>on</strong>g the Web is at AT&T a researcher Labs-Research. at AT&T She re- Labs-Rese<br />
ceived her Ph.D. from University of Washingt<strong>on</strong> in 2007, received<br />
Xin Luna D<strong>on</strong>g She is received a researcher her at AT&T Ph.D. Labs-Research. from University of Washing<br />
a Master’s Degree from Peking University in China in 2001, and<br />
She received 2007, her Ph.D. received from University a Master’s of Washingt<strong>on</strong> Degree in from Peking Univ<br />
received<br />
2007, received<br />
a Bachelor’s<br />
a Master’s<br />
Degree<br />
Degree<br />
from<br />
from<br />
Nankai<br />
Peking<br />
University<br />
University<br />
in China<br />
in 1998. Her research in China interests in 2001, include and databases, received informati<strong>on</strong> a Bachelor’s Degree<br />
in China in 2001, and received a Bachelor’s Degree from<br />
retrieval Nankai and University Nankai machine in China University learning, in 1998. with Her in an research China emphasis interests in <strong>on</strong> 1998. data Her integra- research inte<br />
ti<strong>on</strong>, include data cleaning, databases, include pers<strong>on</strong>al informati<strong>on</strong> databases, informati<strong>on</strong> retrieval informati<strong>on</strong> management, and machine retrieval and Web and ma<br />
search. learning, She with has led an the emphasis Solom<strong>on</strong> <strong>on</strong> project, data integrati<strong>on</strong>, whose goal data is to detect<br />
learning, with an emphasis <strong>on</strong> data integrati<strong>on</strong>,<br />
copying cleaning, between pers<strong>on</strong>al structured informati<strong>on</strong> sources management, and to leverage and Web the results<br />
in various<br />
search. She<br />
aspects cleaning, has led<br />
of<br />
the<br />
data<br />
Solom<strong>on</strong> pers<strong>on</strong>al integrati<strong>on</strong>,<br />
project, informati<strong>on</strong> and<br />
whose<br />
the<br />
goal<br />
Semex<br />
is to management, pers<strong>on</strong>al in- and<br />
detect copying between structured sources and to<br />
formati<strong>on</strong> management search. She system, has which led the w<strong>on</strong> Solom<strong>on</strong> the Best Demo project, award<br />
leverage the results in various aspects of data integrati<strong>on</strong>, whose goa<br />
(<strong>on</strong>e and of the top-3) Semex detect in pers<strong>on</strong>al Sigmod’05. copying informati<strong>on</strong> She has management between co-chaired system, structured WebDB’10 and sources an<br />
has which served w<strong>on</strong> in the program Best Demo committees award (<strong>on</strong>e of Sigmod’12, of top-3) in Sigmod’11,<br />
VLDB’11, PVLDB’10, leverage WWW’10, the results <strong>ICDE</strong>’10, in VLDB’09, various etc. aspects of data integr<br />
Sigmod’05. She has co-chaired WebDB’10 and has served in the program committees<br />
of Sigmod’12, Sigmod’11, VLDB’11, PVLDB’10, and the WWW’10, Semex <strong>ICDE</strong>’10, pers<strong>on</strong>al VLDB’09, informati<strong>on</strong> etc. management sy<br />
which w<strong>on</strong> the Best Demo award (<strong>on</strong>e of top-<br />
Sigmod’05. She has co-chaired WebDB’10 and has served in the program comm<br />
of Sigmod’12, Sigmod’11, VLDB’11, PVLDB’10, WWW’10, <strong>ICDE</strong>’10, VLDB’09, etc.<br />
Page<br />
60<br />
diVesH Divesh sriVastaVa Srivastava is the is the head head of the of <strong>Data</strong>base the <strong>Data</strong>base Research Research De-<br />
Department at AT&T Labs-Research. He received his<br />
partment at AT&T Labs-Research. He received his Ph.D. from the<br />
Ph.D. from the University of Wisc<strong>on</strong>sin, Madis<strong>on</strong>, and his<br />
University<br />
B.Tech<br />
of<br />
from<br />
Wisc<strong>on</strong>sin,<br />
the Indian<br />
Madis<strong>on</strong>,<br />
Institute<br />
and his<br />
of<br />
B.Tech<br />
Technology,<br />
from the Indian<br />
Institute Bombay. of Technology, His research Bombay. interests span His research a variety of interests topics span a<br />
variety in data of management.<br />
topics in data management.<br />
Divesh Srivastava is the head of the <strong>Data</strong>base Res<br />
Department at AT&T Labs-Research. He receive<br />
Ph.D. from the University of Wisc<strong>on</strong>sin, Madis<strong>on</strong>, an<br />
B.Tech from the Indian Institute of Techn<br />
Bombay. His research interests span a variety of<br />
in data management.
er Panel<br />
.D.<br />
t at<br />
. Over<br />
has<br />
ct at<br />
ftware<br />
uring<br />
ished<br />
wo<br />
nd<br />
abase<br />
and<br />
t.<br />
his<br />
hed in<br />
work<br />
arch<br />
nd<br />
ditor-<br />
w, a<br />
r of<br />
Panels<br />
PANEL 1: NSF <strong>ICDE</strong> <strong>2012</strong> CarEEr PaNEl<br />
PhiliP A. Bernstein, Ph.D. (Microsoft Research) is a Distinguished<br />
Scientist at Microsoft Corporati<strong>on</strong>. Over the past 35<br />
years, he has been a product architect at Microsoft and Digital<br />
Equipment Corp., a professor at Harvard University and Wang<br />
Institute of Graduate Studies, and a VP Software at Sequoia Systems.<br />
During that time, he has published over 150 papers and two<br />
books <strong>on</strong> the theory and implementati<strong>on</strong> of database systems,<br />
especially <strong>on</strong> transacti<strong>on</strong> processing and metadata management.<br />
The sec<strong>on</strong>d editi<strong>on</strong> of his book “Transacti<strong>on</strong> Processing” with Eric<br />
Newcomer was published in June 2009. His latest work focuses<br />
<strong>on</strong> database systems for cloud computing, <strong>on</strong> web search over<br />
structured data, and <strong>on</strong> object-to-relati<strong>on</strong>al mappings. He is an<br />
Editor-in-Chief of the VLDB Journal, an ACM Fellow, a winner<br />
of the ACM SIGMOD Innovati<strong>on</strong>s Award, and a member of the<br />
Washingt<strong>on</strong> State Academy of Sciences and the Nati<strong>on</strong>al Academy<br />
of <strong>Engineering</strong>. He received a B.S. degree from Cornell and<br />
M.Sc. and Ph.D. from University of Tor<strong>on</strong>to.<br />
Page<br />
61
.S.<br />
ty<br />
l of<br />
.D. from<br />
nto.<br />
h.D. the<br />
ialy)<br />
l<br />
g receive<br />
e<br />
, his M.S.<br />
ute of<br />
5, and<br />
niversity<br />
10, all of<br />
in<br />
of . His<br />
tly are in the<br />
mporal<br />
2005nal<br />
SF<br />
n and<br />
ship)<br />
ersity of<br />
currently<br />
in the<br />
BM<br />
t Nati<strong>on</strong>al<br />
ence<br />
at<br />
ia.<br />
h.D. (IBM<br />
h) ale<br />
a<br />
mber at<br />
earch er<br />
ry e &<br />
rge scale<br />
sity<br />
nd<br />
eived is her<br />
Science &<br />
University ent<br />
8. She is<br />
,<br />
ievement<br />
f<br />
ta<br />
el<br />
g.<br />
ard<br />
t<br />
nd<br />
g<br />
g<br />
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
62<br />
JAmes m. KAng, Ph.D. (Nati<strong>on</strong>al Geospatial-Intelligence Agency)<br />
received his B.S. at Purdue University in 2000, his M.S. at Rochester<br />
Institute of Technology in 2005, and his Ph.D. at the University<br />
of Minnesota in 2010, all of his degrees were in Computer<br />
Science. His research interests are in the areas of Spatio-Temporal<br />
<strong>Data</strong> Mining and <strong>Data</strong>bases. From 2005-2007, he was an NSF<br />
IGERT (Integrative Graduate Educati<strong>on</strong> and Research Traineeship)<br />
Fellow at the University of Minnesota. He is currently a project<br />
scientist in the Basic and Applied Research Office at Nati<strong>on</strong>al<br />
Geospatial-Intelligence Agency (NGA) in Springfield, Virginia.<br />
YuAnYuAn tiAn, Ph.D. (IBM Almaden Research) is a Research<br />
Staff Member at IBM Almaden Research Center. Her primary<br />
research area is large scale data processing and analytics. She<br />
received her PhD in Computer Science & <strong>Engineering</strong> from University<br />
of Michigan in 2008. She is the recipient of Distinguished<br />
Achievement Award from University of Michigan in 2008 for her<br />
research and academic accomplishments.<br />
srinivAsAn PArthAsArAthY, Ph.D. (Ohio State University)<br />
Dr. Srinivasan Parthasarathy (PhD, University of Rochester), is<br />
currently a Professor of Computer Science and <strong>Engineering</strong> at the<br />
Ohio State University (OSU). His research interests are broadly in<br />
the areas of <strong>Data</strong> Mining, <strong>Data</strong>bases, Bioinformatics and Parallel<br />
and Distributed Computing. He is a recipient of an Ameritech<br />
Faculty Fellowship in 2001, a US Nati<strong>on</strong>al Science Foundati<strong>on</strong><br />
CAREER award in 2003, a US Department of Energy Early Career<br />
Award in 2004, multiple IBM Faculty Awards in 2007 and 2010,<br />
and a Google Research Award in 2009. His papers have received<br />
six best paper awards or similar h<strong>on</strong>ors from am<strong>on</strong>g ten nominati<strong>on</strong>s<br />
in leading c<strong>on</strong>ferences in the field, including <strong>on</strong>es at SIAM<br />
internati<strong>on</strong>al c<strong>on</strong>ference <strong>on</strong> data mining (SDM), <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> internati<strong>on</strong>al<br />
c<strong>on</strong>ference <strong>on</strong> data mining (ICDM), Intelligent Systems for<br />
Molecular Biology (ISMB), the Very Large <strong>Data</strong>bases <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
(VLDB) and at the ACM Knowledge Discovery and <strong>Data</strong> Mining<br />
(SIGKDD). He has served <strong>on</strong> the program, organizati<strong>on</strong>al and<br />
steering committees of leading c<strong>on</strong>ferences in the fields of data<br />
mining, databases, and high performance computing. He currently<br />
serves <strong>on</strong> the editorial boards of several journals including the<br />
<strong>Data</strong> Mining and Knowledge Discovery Journal (DMKDJ), the Distributed<br />
and Parallel <strong>Data</strong>bases Journal (DAPDJ), the Journal of<br />
Parallel and Distributed Computing (JPDC), and the ACM Transacti<strong>on</strong>s<br />
<strong>on</strong> Knowledge Discovery and <strong>Data</strong> Mining (ACM-TKDD).
Dr.<br />
Alexandros Labrinidis received<br />
his Ph.D degree in Computer<br />
Science<br />
from the University of Maryland,<br />
College Park in 2002. He is<br />
currently<br />
an associate professor at the<br />
Department of Computer Science<br />
of the<br />
University of Pittsburgh and codirector<br />
of the Advanced <strong>Data</strong><br />
Management Technologies Lab.<br />
He is also an adjunct associate<br />
professor<br />
at Carnegie Mell<strong>on</strong> University<br />
(CS Dept).<br />
Dr. Labrinidis' research<br />
focuses <strong>on</strong> user-centric<br />
data management for<br />
network-centric<br />
applicati<strong>on</strong>s, including webdatabases,<br />
data stream<br />
management systems,<br />
sensor networks, and<br />
scientific data management<br />
(with an emphasis <strong>on</strong> big<br />
data). He has published<br />
over 60 papers<br />
at peer-reviewed journals,<br />
c<strong>on</strong>ferences, and<br />
workshops; he is<br />
the<br />
recipient of an NSF<br />
CAREER award in 2008.<br />
Dr. Labrinidis is<br />
currently the<br />
Secretary/Treasurer for<br />
ACM SIGMOD, and has<br />
served<br />
as the Editor of SIGMOD<br />
Record, and in numerous<br />
program<br />
committees of internati<strong>on</strong>al<br />
c<strong>on</strong>ferences/workshops.<br />
Panels<br />
Dr. AlexAnDros lABriniDis received his Ph.D degree in Computer<br />
Science from the University of Maryland, College Park in<br />
2002. He is currently an associate professor at the Department of<br />
Computer Science of the University of Pittsburgh and co-director<br />
of the Advanced <strong>Data</strong> Management Technologies Lab. He is also<br />
an adjunct associate professor at Carnegie Mell<strong>on</strong> University<br />
(CS Dept). Dr. Labrinidis’ research focuses <strong>on</strong> user-centric data<br />
management for network-centric applicati<strong>on</strong>s, including webdatabases,<br />
data stream management systems, sensor networks,<br />
and scientific data management (with an emphasis <strong>on</strong> big data).<br />
He has published over 60 papers at peer-reviewed journals, c<strong>on</strong>ferences,<br />
and workshops; he is the recipient of an NSF CAREER<br />
award in 2008. Dr. Labrinidis is currently the Secretary/Treasurer<br />
for ACM SIGMOD, and has served as the Editor of SIGMOD<br />
Record, and in numerous program committees of internati<strong>on</strong>al<br />
c<strong>on</strong>ferences/workshops.<br />
PANEL 2: FuNDErS SESSIoN<br />
Dr. FrAnK olKen (C<strong>on</strong>sultant, Panel Organizer) is a veteran database researcher.<br />
He has a PhD. in Computer Science from Univ. of California Berkeley. He has<br />
worked <strong>on</strong> a variety of topic in scientific and statistical databases including random<br />
sampling from relati<strong>on</strong>al databases, bioinformatics, building energy management<br />
systems, power grid informatics, workflow management, file migrati<strong>on</strong>, metadata<br />
registries, etc. Most of his 35 year career was at Lawrence Berkeley Nati<strong>on</strong>al Laboratory.<br />
He has also worked <strong>on</strong> standards development for metadata registries, RDF<br />
and XML schema languages, etc.<br />
From 2006 to 2010 he was detailed to the at the U.S. Nati<strong>on</strong>al Science Foundati<strong>on</strong><br />
as a program director in the Computer and Informati<strong>on</strong> Science and <strong>Engineering</strong><br />
(CISE) Directorate, Informati<strong>on</strong> and Intelligent Systems (IIS) Divisi<strong>on</strong>, Informati<strong>on</strong><br />
Integrati<strong>on</strong> and Informatics (III) program, where he managed proposal reviews<br />
and awards in the areas of database management, graph database and mining,<br />
data intensive computing, etc. His current research interests include semantic web<br />
technologies, rule systems, graph data management and mining, electr<strong>on</strong>ic health<br />
records, and social science data management and analytics.<br />
He can reached at: frankolken@gmail.com, @frankolken <strong>on</strong> twitter, and <strong>on</strong> LinkeIn, Facebook<br />
and Google+.<br />
Dr. le gruenwAlD (Nati<strong>on</strong>al Science Foundati<strong>on</strong>) is a Program Director and the<br />
Cluster Lead of the Informati<strong>on</strong> Integrati<strong>on</strong> and Informatics (III) Program, in the<br />
Intelligent Informati<strong>on</strong> Systems (IIS) Divisi<strong>on</strong> of the Computer and Informati<strong>on</strong> Science<br />
and <strong>Engineering</strong> (CISE) Directorate at the Nati<strong>on</strong>al Science Foundati<strong>on</strong> (NSF).<br />
The IIS program supports research in areas such as <strong>Data</strong>bases, <strong>Data</strong> Mining, Informatics,<br />
Informati<strong>on</strong> Retrieval, and Social Media. She is also the Presidential and Dr.<br />
David W. Franke Professor in the School of Computer Science at The University of<br />
Oklahoma (OU). She received her Ph.D. in Computer Science from Southern Meth-<br />
Page<br />
63
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
odist University in 1990. Prior to joining OU, she was a Member of Technical Staff in<br />
the <strong>Data</strong>base Management Group at the Advanced Switching Laboratory of NEC,<br />
America, a Software Engineer at WRT, and a Lecturer in the Computer Science and<br />
<strong>Engineering</strong> Department at Southern Methodist University.<br />
Dr. Gruenwald’s major research interests include Mobile and Sensor <strong>Data</strong>bases, <strong>Data</strong><br />
Security, Privacy and C<strong>on</strong>fidentiality, Stream <strong>Data</strong> Management, <strong>Data</strong> Mining, Real-<br />
Time Distributed <strong>Data</strong>bases, Aut<strong>on</strong>omic <strong>Data</strong> Management, Multimedia <strong>Data</strong>bases<br />
and Web <strong>Data</strong>bases. She has published numerous technical papers in these areas.<br />
She can be reached at: lgruenwa@nsf.gov<br />
Dr. Ceren sust (Department of Enegry) joined Department of Energy (DOE)<br />
Office of Advanced Scientific Computing Research (ASCR) in January 2011 after<br />
completing a Nati<strong>on</strong>al Research Council Postdoctoral Fellowship at the Center for<br />
Nanoscale Science and Technology (CNST) at the Nati<strong>on</strong>al Institute of Standards<br />
and Technology (NIST).<br />
She has diverse research experience in chemistry, chemical engineering, materials<br />
science and applied physics. At ASCR, she currently manages the Scientific Discovery<br />
through Advanced Computing (SciDAC) portfolio.<br />
She can be reached at: ceren.susut-bennett@science.doe.gov<br />
Dr. olgA BrAzhniK (Nati<strong>on</strong>al Institutes of Health) has over 30 years of professi<strong>on</strong>al<br />
career in computati<strong>on</strong>al sciences and health, biomedical and clinical informatics.<br />
He started as a physicist, applying theoretical and computati<strong>on</strong>al methods<br />
in biology and medicine; and earned a Ph.D. in Computati<strong>on</strong>al Physics from Moscow<br />
State University, Russia. Researching and developing technologies for transforming<br />
data into knowledge, she worked at the University of Chicago, Virginia Tech,<br />
Virginia Bioinformatics Institute, and the US Air Force Surge<strong>on</strong> General Office.<br />
She joined the Nati<strong>on</strong>al Institutes of Health (NIH) in 2004. Over her years with NIH,<br />
she managed grants, cooperative agreements and c<strong>on</strong>tracts in areas of health and<br />
biomedical informatics, semantics, visualizati<strong>on</strong>, multi-media, knowledge engineering,<br />
social network analysis and collaborative technologies. Am<strong>on</strong>g her other duties,<br />
she currently directs Small Business Innovati<strong>on</strong> Research (SBIR) program at the<br />
Nati<strong>on</strong>al Center for Advancing Translati<strong>on</strong>al Sciences (NCATS).<br />
Following her passi<strong>on</strong> for developing collective intelligence about human health,<br />
Dr. Brazhnik keeps exploring ways in which ubiquitous computing and cutting edge<br />
technology enable us to employ individual creativity, wisdom of the crowds, art,<br />
holistic approaches and solid science to benefit human health and wellbeing. She<br />
introduced numerous novel informatics and collaborative technologies to the NIH<br />
community and recently organized Crowdsourcing: the Art and Science of Open<br />
Innovati<strong>on</strong> (http://videocast.nih.gov/summary.asp?live=10366).<br />
She can be reached at: brazhnik@mail.nih.gov<br />
Page<br />
64
Panels<br />
PANEL 3: THE FuTurE oF SCIENTIFIC DaTa BaSES<br />
Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique<br />
AnAstAsiA AilAmAKi is a Professor of Computer Sciences at the<br />
Federale de Lausanne (EPFL) in Switzerland. Her research interests are in designing robust<br />
Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland.<br />
systems to support Her data-intensive research interests applicati<strong>on</strong>s, are in designing and in particular robust systems (a) in maximizing to support the<br />
potential of multicore data-intensive hardware and applicati<strong>on</strong>s, solid-state and drive in particular storage for (a) scalable in maximizing query the and<br />
transacti<strong>on</strong> processing, potential and of (b) multicore in automating hardware physical and solid-state design to drive support storage demanding for<br />
scientific applicati<strong>on</strong>s. scalable She query has received and transacti<strong>on</strong> a European processing, Young and Investigator (b) in automat- Award from the<br />
ing physical design to support demanding scientific applicati<strong>on</strong>s.<br />
European Science Foundati<strong>on</strong> (2007), a Finmeccanica endowed chair from the Computer<br />
She has received a European Young Investigator Award from the<br />
Science Department European at Carnegie Science Mell<strong>on</strong> Foundati<strong>on</strong> (2007), (2007), an Alfred a Finmeccanica P. Sloan Research endowed Fellowship<br />
(2005), seven best-paper chair from awards the Computer at top c<strong>on</strong>ferences Science Department (2001-2011), at Carnegie and an NSF Mell<strong>on</strong> CAREER<br />
award (2002). She (2007), earned an her Alfred Ph.D. P. in Sloan Computer Research Science Fellowship from the (2005), University seven of Wisc<strong>on</strong>sin<br />
Madis<strong>on</strong> in 2000. She best-paper is a member awards of at <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> top c<strong>on</strong>ferences and ACM, (2001-2011), and has also and been an a NSF CRA-W mentor<br />
Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique<br />
Federale de Lausanne (EPFL) CAREER in award Switzerland. (2002). Her She research earned interests her Ph.D. are in designing Computer robust Science<br />
systems to support data-intensive from the University applicati<strong>on</strong>s, of Wisc<strong>on</strong>sin-Madis<strong>on</strong> and in particular (a) in in maximizing 2000. She the is a mem-<br />
potential of multicore hardware ber of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> and and solid-state ACM, and drive has storage also been for scalable a CRA-W query mentor. and<br />
transacti<strong>on</strong> processing, and (b) in automating physical design to support demanding<br />
scientific applicati<strong>on</strong>s. She has received a European Young Investigator Award from the<br />
European Science Foundati<strong>on</strong> JeremY (2007), KePner a Finmeccanica received a endowed B.A. with chair distincti<strong>on</strong> from the in Computer Astrophys-<br />
Science Department at ics Carnegie from Pom<strong>on</strong>a Mell<strong>on</strong> (2007), College an (Clarem<strong>on</strong>t, Alfred P. Sloan CA). Research After receiving Fellowship a DoE<br />
(2005), seven best-paper Computati<strong>on</strong>al awards at top c<strong>on</strong>ferences Science Graduate (2001-2011), Fellow and in an 1994 NSF he CAREER obtained his<br />
award (2002). She earned Ph.D. her from Ph.D. the in Computer Dept. of Astrophysics Science from the at Princet<strong>on</strong> University University of Wisc<strong>on</strong>sin- in<br />
Madis<strong>on</strong> in 2000. She is 1998 a member and then of <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> joined and MIT. ACM, His and research has also is focused been a CRA-W <strong>on</strong> the mentor. development<br />
of advanced libraries for the applicati<strong>on</strong> of massively<br />
parallel computing to a variety of data intensive signal processing<br />
problems <strong>on</strong> which he has published many articles. Jeremy<br />
is most proud of the opportunity he has had to be the principal<br />
architect, PI or otherwise co-lead several very talented teams.<br />
These teams have produced a number of innovative technologies<br />
that have broken new ground in several domains.<br />
Jeremy Kepner received a B.A. with distincti<strong>on</strong> in Astrophysics from Pom<strong>on</strong>a College<br />
(Clarem<strong>on</strong>t, CA). After receiving a DoE Computati<strong>on</strong>al Science Graduate Fellow in 1994<br />
he obtained his Ph.D. from the Dept. of Astrophysics at Princet<strong>on</strong> University in 1998 and<br />
then joined MIT. His research is focused <strong>on</strong> the development of advanced libraries for the<br />
applicati<strong>on</strong> of massively parallel computing to a variety of data intensive signal processing<br />
problems <strong>on</strong> which he has published many articles. Jeremy is most proud of the opportunity<br />
he has had to be the principal architect, PI or otherwise co-lead several very talented teams.<br />
These teams have produced a number of innovative technologies that have broken new<br />
AlexAnDer szAlAY is the Alumni Centennial Professor of<br />
Astr<strong>on</strong>omy at the Johns Hopkins University, and Professor in the<br />
Department of Computer Science. He is a cosmologist, working<br />
<strong>on</strong> the statistical measures of the spatial distributi<strong>on</strong> of galaxies<br />
and galaxy formati<strong>on</strong>. He was born and educated in Hungary. He<br />
Jeremy Kepner received is the a B.A. architect with distincti<strong>on</strong> for the Science in Astrophysics Archive from of the Pom<strong>on</strong>a Sloan College Digital Sky<br />
(Clarem<strong>on</strong>t, ground in CA). several After domains. Survey. receiving His a papers DoE Computati<strong>on</strong>al cover areas from Science theoretical Graduate cosmology Fellow in 1994 to<br />
he obtained his Ph.D. from observati<strong>on</strong>al the Dept. of astr<strong>on</strong>omy, Astrophysics spatial at Princet<strong>on</strong> statistics University and computer in 1998 science. and<br />
then joined MIT. His research He is a Corresp<strong>on</strong>ding is focused <strong>on</strong> the Member development of the of Hungarian advanced libraries Academy for the of<br />
applicati<strong>on</strong> of massively Sciences, parallel computing and a Fellow to a of variety the American of data intensive Academy signal of processing Arts and Sci-<br />
problems <strong>on</strong> which he has ences. published In 2004 many he received articles. Jeremy an Alexander is most proud V<strong>on</strong> Humboldt of the opportunity Award in<br />
Alexander he has Szalay had to is be the the Alumni principal Physical Centennial architect, Sciences, Professor PI of or in Astr<strong>on</strong>omy otherwise 2007 the at co-lead Microsoft the Johns several Hopkins Jim very Gray talented Award. teams. In 2008<br />
University, and Professor in the<br />
These teams have produced he Department became of<br />
a number Doctor Computer<br />
of innovative H<strong>on</strong>oris Science.<br />
technologies Clausa He is a cosmologist, of the that Eötvös have broken University. new<br />
working <strong>on</strong> the statistical measures of the spatial distributi<strong>on</strong> of galaxies and galaxy<br />
formati<strong>on</strong>. ground He in was several born and domains. educated in Hungary. He is the architect for the Science<br />
Archive of the Sloan Digital Sky Survey. His papers cover areas from theoretical<br />
cosmology to observati<strong>on</strong>al astr<strong>on</strong>omy, spatial statistics and computer science. He is a<br />
Corresp<strong>on</strong>ding Member of the Hungarian Academy of Sciences, and a Fellow of the<br />
American Academy of Arts and Sciences. In 2004 he received an Alexander V<strong>on</strong> Humboldt<br />
Award in Physical Sciences, in 2007 the Microsoft Jim Gray Award. In 2008 he became<br />
Doctor H<strong>on</strong>oris Clausa of the Eötvös University.<br />
Page<br />
65
Archive of the Sloan Digital Sky Survey. His papers cover areas from theoretical<br />
cosmology to observati<strong>on</strong>al astr<strong>on</strong>omy, spatial statistics and computer science. He is a<br />
Corresp<strong>on</strong>ding Member of the Hungarian Academy of Sciences, and a Fellow of the<br />
American Academy of Arts and Sciences. In 2004 he received an Alexander V<strong>on</strong> Humboldt<br />
Award in Physical Sciences, in 2007 the Microsoft Jim Gray Award. In 2008 he became<br />
Doctor H<strong>on</strong>oris <strong>ICDE</strong> Clausa <strong>2012</strong> of the <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Eötvös University.<br />
miChAel st<strong>on</strong>eBrAKer has been a pi<strong>on</strong>eer of data base<br />
research and technology for more than a quarter of a century.<br />
He was the main architect of the INGRES relati<strong>on</strong>al DBMS, and<br />
the object-relati<strong>on</strong>al DBMS, POSTGRES. These prototypes were<br />
developed at the University of California at Berkeley where<br />
St<strong>on</strong>ebraker was a Professor of Computer Science for twenty five<br />
years. More recently at M.I.T. he was a co-architect of the Aurora/<br />
Borealis stream processing engine, the C-Store column-oriented<br />
DBMS, and the H-Store transacti<strong>on</strong> processing engine. Currently,<br />
he is working <strong>on</strong> science-oriented DBMSs, OLTP DBMSs, and<br />
search engines for accessing the deep web. He is the founder of<br />
five venture-capital backed startups, which commercialized his<br />
prototypes. Presently he serves as Chief Technology Officer of<br />
VoltDB, Paradigm4, Inc. and Goby.com.<br />
Dr. St<strong>on</strong>ebraker has been a pi<strong>on</strong>eer of data base research and technology for more than a<br />
quarter of a century. He was the main Professor architect of the St<strong>on</strong>ebraker INGRES relati<strong>on</strong>al DBMS, is the and author the of scores of research papers<br />
object-relati<strong>on</strong>al DBMS, POSTGRES. <strong>on</strong> These data prototypes base were technology, developed at the operating University<br />
systems and the architecture<br />
of system software services. He was awarded the ACM System<br />
Software Award in 1992, for his work <strong>on</strong> INGRES. Additi<strong>on</strong>ally,<br />
he was awarded the first annual Innovati<strong>on</strong> award by the ACM<br />
SIGMOD special interest group in 1994, and was elected to the<br />
Nati<strong>on</strong>al Academy of <strong>Engineering</strong> in 1997. He was awarded the<br />
<str<strong>on</strong>g>IEEE</str<strong>on</strong>g> John V<strong>on</strong> Neumann award in 2005, and is presently an Adjunct<br />
Professor of Computer Science at M.I.T.<br />
Page<br />
66
Awards<br />
InfluentIAl PAPer AwArd<br />
Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das:<br />
dBXplorer: A System for Keyword-Based Search over relati<strong>on</strong>al databases. ICde 2002.<br />
Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan<br />
Keyword Searching and Browsing in databases using BAnKS. ICde 2002.<br />
Citati<strong>on</strong><br />
together, these two papers from ICde 2002 laid the foundati<strong>on</strong>s for keyword search<br />
over relati<strong>on</strong>al databases, paving the way for a significant body of follow-<strong>on</strong> work in the<br />
area of Informati<strong>on</strong> retrieval and databases. the soluti<strong>on</strong>s presented in these papers<br />
are elegant and highly effective.<br />
Page<br />
67
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
BeSt PAPer AwArd<br />
Winner<br />
“Temporal Analytics <strong>on</strong> Big <strong>Data</strong> for Web Advertising”<br />
Badrish Chandramouli (Microsoft research) J<strong>on</strong>athan Goldstein (Microsoft Corporati<strong>on</strong>)<br />
S<strong>on</strong>gyun duan (IBM t. J. wats<strong>on</strong> research Center)<br />
Citati<strong>on</strong><br />
the paper beautifully combines the Map-reduce framework and ideas from data-stream<br />
management systems for scalable temporal analytics <strong>on</strong> big data for effective behavioral<br />
targeting <strong>on</strong> the web.<br />
rUnner-UP<br />
“Recomputing Materialized Instances after Changes to Mappings and <strong>Data</strong>”<br />
todd J. Green (university of California, davis) Zachary G. Ives (university of Pennsylvania)<br />
Citati<strong>on</strong><br />
the paper elegantly applies novel ideas for optimizing queries with materialized views<br />
to the practical problem of incrementally adapting declarative schema mappings in collaborative<br />
data sharing systems.<br />
Page<br />
68
Abstracts<br />
SeSSi<strong>on</strong> 1: PrivAcy<br />
Privacy in Social Networks: How Risky is Your Social Graph?<br />
cuneyt Gurcan Akcora (University of insubria)<br />
Barbara carminati (University of insubria)<br />
Elena Ferrari (University of insubria)<br />
Several efforts have been made for more privacy aware Online Social Networks<br />
(OSNs) to protect pers<strong>on</strong>al data against various privacy threats. However, despite<br />
the relevance of these proposals, we believe there is still the lack of a c<strong>on</strong>ceptual<br />
model <strong>on</strong> top of which privacy tools have to be designed. Central to this model<br />
should be the c<strong>on</strong>cept of risk. Therefore, in this paper, we propose a risk measure for<br />
OSNs. The aim is to associate a risk level with social network users in order to provide<br />
other users with a measure of how much it might be risky, in terms of disclosure<br />
of private informati<strong>on</strong>, to have interacti<strong>on</strong>s with them. We compute risk levels based<br />
<strong>on</strong> similarity and benefit measures, by also taking into account the user risk attitudes.<br />
In particular, we adopt an active learning approach for risk estimati<strong>on</strong>, where<br />
user risk attitude is learned from few required user interacti<strong>on</strong>s. The risk estimati<strong>on</strong><br />
process discussed in this paper has been developed into a Facebook applicati<strong>on</strong> and<br />
tested <strong>on</strong> real data. The experiments show the effectiveness of our proposal.<br />
Page<br />
69
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Differentially Private Spatial Decompositi<strong>on</strong>s<br />
Graham cormode (AT&T Labs – research)<br />
cecilia Procopiuc (AT&T Labs – research)<br />
Ent<strong>on</strong>g Shen (North carolina State University)<br />
Divesh Srivastava (AT&T Labs – research)<br />
Ting yu (North carolina State University)<br />
Differential privacy has recently emerged as the de facto standard for private data<br />
release. This makes it possible to provide str<strong>on</strong>g theoretical guarantees <strong>on</strong> the<br />
privacy and utility of released data. While it is well-understood how to release data<br />
based <strong>on</strong> counts and simple functi<strong>on</strong>s under this guarantee, it remains to provide<br />
general purpose techniques to release data that is useful for a variety of queries. In<br />
this paper, we focus <strong>on</strong> spatial data such as locati<strong>on</strong>s and more generally any multidimensi<strong>on</strong>al<br />
data that can be indexed by a tree structure. Directly applying existing<br />
differential privacy methods to this type of data simply generates noise. We<br />
propose instead the class of “private spatial decompositi<strong>on</strong>s’’: these adapt standard<br />
spatial indexing methods such as quadtrees and kd-trees to provide a private descripti<strong>on</strong><br />
of the data distributi<strong>on</strong>. Equipping such structures with differential privacy<br />
requires several steps to ensure that they provide meaningful privacy guarantees.<br />
Various basic steps, such as choosing splitting points and describing the distributi<strong>on</strong><br />
of points within a regi<strong>on</strong>, must be d<strong>on</strong>e privately, and the guarantees of the<br />
different building blocks composed to provide an overall guarantee. C<strong>on</strong>sequently,<br />
we expose the design space for private spatial decompositi<strong>on</strong>s, and analyze some<br />
key examples. A major c<strong>on</strong>tributi<strong>on</strong> of our work is to provide new techniques for<br />
parameter setting and post-processing the output to improve the accuracy of query<br />
answers. Our experimental study dem<strong>on</strong>strates that it is possible to build such<br />
decompositi<strong>on</strong>s efficiently, and use them to answer a variety of queries privately<br />
with high accuracy.<br />
Differentially Private Histogram Publicati<strong>on</strong><br />
Jia Xu (Northeastern University, china)<br />
Zhenjie Zhang (Advanced Digital Sciences center, illinois at Singapore Pte.)<br />
Xiaokui Xiao (Nanyang Technological University)<br />
yin yang (Advanced Digital Sciences center, illinois at Singapore Pte.)<br />
Ge yu (Northeastern University, china)<br />
Differential privacy (DP) is a promising scheme for releasing the results of statistical<br />
queries <strong>on</strong> sensitive data, with str<strong>on</strong>g privacy guarantees against adversaries with<br />
arbitrary background knowledge. Existing studies <strong>on</strong> DP mostly focus <strong>on</strong> simple aggregati<strong>on</strong>s<br />
such as counts. This paper investigates the publicati<strong>on</strong> of DP-compliant<br />
histograms, which is an important analytical tool for showing the distributi<strong>on</strong> of a<br />
random variable, e.g., hospital bill size for certain patients. Compared to simple aggregati<strong>on</strong>s<br />
whose results are purely numerical, a histogram query is inherently more<br />
complex, since it must also determine its structure, i.e., the ranges of the bins. As<br />
we dem<strong>on</strong>strate in the paper, a DP-compliant histogram with finer bins may actually<br />
lead to significantly lower accuracy than a coarser <strong>on</strong>e, since the former requires<br />
str<strong>on</strong>ger perturbati<strong>on</strong>s in order to satisfy DP. Moreover, the histogram structure itself<br />
may reveal sensitive informati<strong>on</strong>, which further complicates the problem. Motivated<br />
by this, we propose two novel algorithms, namely NoiseFirst and StructureFirst, for<br />
Page<br />
70
Abstracts<br />
computing DP-compliant histograms. Their main difference lies in the relative order<br />
of the noise injecti<strong>on</strong> and the histogram structure computati<strong>on</strong> steps. NoiseFirst<br />
has the additi<strong>on</strong>al benefit that it can improve the accuracy of an already published<br />
DP-complaint histogram computed using a naiive method. Going <strong>on</strong>e step further,<br />
we extend both soluti<strong>on</strong>s to answer arbitrary range queries. Extensive experiments,<br />
using several real data sets, c<strong>on</strong>firm that the proposed methods output highly accurate<br />
query answers, and c<strong>on</strong>sistently outperform existing competitors.<br />
Privacy-Preserving and C<strong>on</strong>tent-Protecting Locati<strong>on</strong> Based Queries<br />
russell Paulet (victoria University)<br />
Md. Golam Kaosar (victoria University)<br />
Xun yi (victoria University)<br />
Elisa Bertino (Purdue University)<br />
In this paper we present a soluti<strong>on</strong> to <strong>on</strong>e of the locati<strong>on</strong>-based query problems.<br />
This problem is defined as follows: (i) a user wants to query a database of locati<strong>on</strong><br />
data, known as Points Of Interest (POI), and does not want to reveal his/her locati<strong>on</strong><br />
to the server due to privacy c<strong>on</strong>cerns; (ii) the owner of the locati<strong>on</strong> data, that<br />
is, the locati<strong>on</strong> server, does not want to simply distribute its data to all users. The<br />
locati<strong>on</strong> server desires to have some c<strong>on</strong>trol over its data, since the data is its asset.<br />
Previous soluti<strong>on</strong>s have used a trusted an<strong>on</strong>ymiser to address privacy, but introduced<br />
the impracticality of trusting a third party. More recent soluti<strong>on</strong>s have used<br />
homomorphic encrypti<strong>on</strong> to remove this weakness. Briefly, the user submits his/her<br />
encrypted coordinates to the server and the server would determine the user’s locati<strong>on</strong><br />
homomorphically, and then the user would acquire the corresp<strong>on</strong>ding record<br />
using Private Informati<strong>on</strong> Retrieval techniques. We propose a major enhancement<br />
up<strong>on</strong> this result by introducing a similar two stage approach, where the homomorphic<br />
comparis<strong>on</strong> step is replaced with Oblivious Transfer to achieve a more secure<br />
soluti<strong>on</strong> for both parties. The soluti<strong>on</strong> we present is efficient and practical in many<br />
scenarios. We also include the results of a working prototype to illustrate the efficiency<br />
of our protocol.<br />
SeSSi<strong>on</strong> 2: WEB 2.0 APPLicATioNS<br />
GeoFeed: A Locati<strong>on</strong>-Aware News Feed<br />
Jie Bao (University of Minnesota at Twin cities)<br />
Mohamed F. Mokbel (University of Minnesota at Twin cities)<br />
chi-yin chow (city University of H<strong>on</strong>g K<strong>on</strong>g)<br />
This paper presents the GeoFeed system; a locati<strong>on</strong>-aware news feed system that<br />
provides a new platform for its users to get spatially related message updates from<br />
either their friends or favorite news sources. GeoFeed distinguishes itself from all<br />
existing news feed systems in that it takes into account the spatial extents of messages<br />
and user locati<strong>on</strong>s when deciding up<strong>on</strong> the selected news feed. GeoFeed<br />
is equipped with three different approaches for delivering the news feed to its<br />
users, namely, spatial pull, spatial push, and shared push. Then, the main challenge<br />
of GeoFeed is to decide <strong>on</strong> when to use each of these three approaches to which<br />
users. GeoFeed is equipped with a smart decisi<strong>on</strong> model that decides about using<br />
these approaches in a way that: (a) minimizes the system overhead for delivering<br />
Page<br />
71
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
the locati<strong>on</strong>-aware news feed, and (b) guarantees a certain resp<strong>on</strong>se time for each<br />
user to obtain the requested locati<strong>on</strong>-aware news feed. Experimental results, based<br />
<strong>on</strong> real and synthetic data, show that GeoFeed is favorable over existing news feed<br />
systems, with a minimal system overhead.<br />
Temporal Analytics <strong>on</strong> Big <strong>Data</strong> for Web Advertising<br />
Badrish chandramouli (Microsoft research)<br />
J<strong>on</strong>athan Goldstein (Microsoft corp.)<br />
S<strong>on</strong>gyun Duan (iBM T. J. Wats<strong>on</strong> research center)<br />
“Big <strong>Data</strong>” in map-reduce (M-R) clusters is often fundamentally temporal in nature,<br />
as are many analytics tasks over such data. For instance, display advertising uses<br />
Behavioral Targeting (BT) to select ads for users based <strong>on</strong> prior searches, page<br />
views, etc. Previous work <strong>on</strong> BT has focused <strong>on</strong> techniques that scale well for offline<br />
data using M-R. However, this approach has limitati<strong>on</strong>s for BT-style applicati<strong>on</strong>s that<br />
deal with temporal data: (1) many queries are temporal and not easily expressible in<br />
M-R, and moreover, the set-oriented nature of M-R fr<strong>on</strong>t-ends such as SCOPE is not<br />
suitable for temporal processing; (2) as commercial systems mature, they may need<br />
to also directly analyze and react to real-time data feeds since a high turnaround<br />
time can result in missed opportunities, but it is difficult for current soluti<strong>on</strong>s to<br />
naturally also operate over real-time streams. Our c<strong>on</strong>tributi<strong>on</strong>s are twofold. First,<br />
we propose a novel framework called TiMR (pr<strong>on</strong>ounced timer), that combines a<br />
time-oriented data processing system with a M-R framework. Users write and submit<br />
analysis algorithms as temporal queries - these queries are succinct, scale-outagnostic,<br />
and easy to write. They scale well <strong>on</strong> large-scale offline data using TiMR,<br />
and can work unmodified over real-time streams. We also propose new cost-based<br />
query fragmentati<strong>on</strong> and temporal partiti<strong>on</strong>ing schemes for improving efficiency<br />
with TiMR. Sec<strong>on</strong>d, we show the feasibility of this approach for BT, with new temporal<br />
algorithms that exploit new targeting opportunities. Experiments using real data<br />
from a commercial ad platform show that TiMR is very efficient and incurs ordersof-magnitude<br />
lower development effort. Our BT soluti<strong>on</strong> is easy and succinct, and<br />
performs up to several times better than current schemes in terms of memory,<br />
learning time, and click-through-rate/coverage.<br />
Entity Search Strategies for Mashup Applicati<strong>on</strong>s<br />
Stefan Endrullis (University of Leipzig)<br />
Andreas Thor (University of Leipzig)<br />
Erhard rahm (University of Leipzig)<br />
Programmatic data integrati<strong>on</strong> approaches such as mashups have become a viable<br />
approach to dynamically integrate web data at runtime. Key data sources<br />
for mashups include entity search engines and hidden databases that need to be<br />
queried via source-specific search interfaces or web forms. Current mashups are<br />
typically restricted to simple query approaches such as using keyword search. Such<br />
approaches may need a high number of queries if many objects have to be found.<br />
Furthermore, the effectiveness of the queries may be limited, i.e., they may miss<br />
relevant results. We therefore propose more advanced search strategies that aim at<br />
finding a set of entities with high efficiency and high effectiveness. Our strategies<br />
use different kinds of queries that are determined by source-specific query genera-<br />
Page<br />
72
Abstracts<br />
tors. Furthermore, the queries are selected based <strong>on</strong> the characteristics of input<br />
entities. We introduce a flexible model for entity search strategies that includes<br />
a ranking of candidate queries determined by different query generators. We<br />
describe different query generators and outline their use within four entity search<br />
strategies. These strategies apply different query ranking and selecti<strong>on</strong> approaches<br />
to optimize efficiency and effectiveness. We evaluate our search strategies in detail<br />
for two domains: product search and publicati<strong>on</strong> search. The comparis<strong>on</strong> with a<br />
standard keyword search shows that the proposed search strategies provide significant<br />
improvements in both domains.<br />
CI-Rank: Ranking Keyword Search Results Based <strong>on</strong> Collective Importance<br />
Xiaohui yu (york University & Shand<strong>on</strong>g University)<br />
Huxia Shi (york University)<br />
Keyword search over databases, popularized by keyword search in WWW, allows<br />
ordinary users to access database informati<strong>on</strong> without the knowledge of structured<br />
query languages and database schemas. Most of the previous studies in this area<br />
use IR-style ranking, which fail to c<strong>on</strong>sider the importance of the query answers. In<br />
this paper, we propose Ci-Rank, a new approach for keyword search in databases,<br />
which c<strong>on</strong>siders the importance of individual nodes in a query answer and the<br />
cohesiveness of the result structure in a balanced way. Ci-Rank is built up<strong>on</strong> a carefully<br />
designed model call Random Walk with Message Passing that helps capture<br />
the relati<strong>on</strong>ships between different nodes in the query answer. We develop a branch<br />
and bound algorithm to support the efficient generati<strong>on</strong> of top-k query answers.<br />
Indexing methods are also introduced to further speed up the run-time processing<br />
of queries. Extensive experiments c<strong>on</strong>ducted <strong>on</strong> two real data sets with a real user<br />
query log c<strong>on</strong>firm the effectiveness and efficiency of Ci-Rank.<br />
SeSSi<strong>on</strong> 3: STorAGE MANAGEMENT<br />
Lookup Tables: Fine-Grained Partiti<strong>on</strong>ing for Distributed <strong>Data</strong>bases<br />
Aubrey L. Tatarowicz (MiT)<br />
carlo curino (MiT)<br />
Evan P. c. J<strong>on</strong>es (MiT)<br />
Sam Madden (MiT)<br />
The standard way to get linear scaling in a distributed OLTP DBMS is to horiz<strong>on</strong>tally<br />
partiti<strong>on</strong> data across several nodes. Ideally, this partiti<strong>on</strong>ing will result in each query<br />
being executed at just <strong>on</strong>e node, to avoid the overheads of distributed transacti<strong>on</strong>s<br />
and allow nodes to be added without increasing the amount of required coordinati<strong>on</strong>.<br />
For some applicati<strong>on</strong>s, simple strategies, such as hashing <strong>on</strong> primary key, provide<br />
this property. Unfortunately, for many applicati<strong>on</strong>s, including social networking<br />
and order-fulfillment, many-to-many relati<strong>on</strong>ships cause simple strategies to result<br />
in a large fracti<strong>on</strong> of distributed queries. Instead, what is needed is a fine-grained<br />
partiti<strong>on</strong>ing, where related individual tuples (e.g., cliques of friends) are co-located<br />
together in the same partiti<strong>on</strong>. Maintaining such a fine-grained partiti<strong>on</strong>ing requires<br />
the database to store a large amount of metadata about which partiti<strong>on</strong> each tuple<br />
resides in. We call such metadata a lookup table, and present the design of a data<br />
distributi<strong>on</strong> layer that efficiently stores these tables and maintains them in the<br />
Page<br />
73
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
presence of inserts, deletes, and updates. We show that such tables can provide<br />
scalability for several difficult to partiti<strong>on</strong> database workloads, including Wikipedia,<br />
Twitter, and TPC-E. Our implementati<strong>on</strong> provides 40% to 300% better performance<br />
<strong>on</strong> these workloads than either simple range or hash partiti<strong>on</strong>ing and shows greater<br />
potential for further scale-out.<br />
Temporal Support for Persistent Stored Modules<br />
richard T. Snodgrass (University of Ariz<strong>on</strong>a)<br />
Dengfeng Gao (iBM Silic<strong>on</strong> valley Lab)<br />
rui Zhang (University of Ariz<strong>on</strong>a)<br />
Stephen W. Thomas (Queen’s University, Kingst<strong>on</strong>)<br />
We show how to extend temporal support of SQL to the Turing-complete porti<strong>on</strong><br />
of SQL, that of persistent stored modules (PSM). Our approach requires minor new<br />
syntax bey<strong>on</strong>d that already in SQL/Temporal to define and to invoke PSM routines,<br />
thereby extending the current, sequenced, and n<strong>on</strong>-sequenced semantics of<br />
queries to PSM routines. Temporal upward compatibility (existing applicati<strong>on</strong>s work<br />
as before when <strong>on</strong>e or more tables are rendered temporal) is ensured. We provide<br />
a transformati<strong>on</strong> that c<strong>on</strong>verts Temporal SQL/PSM to c<strong>on</strong>venti<strong>on</strong>al SQL/PSM. To<br />
support sequenced evaluati<strong>on</strong> of PSM routines, we define two different slicing approaches,<br />
maximal slicing and per-statement slicing. We compare these approaches<br />
empirically using a comprehensive benchmark and provide a heuristic for choosing<br />
between them.<br />
Energy Efficient Storage Management Cooperated with Large <strong>Data</strong><br />
Intensive Applicati<strong>on</strong>s<br />
Norifumi Nishikawa (The University of Tokyo)<br />
Miyuki Nakano (The University of Tokyo)<br />
Masaru Kitsuregawa (The University of Tokyo)<br />
Power, especially that c<strong>on</strong>sumed for storing data, and cooling costs for datacenters<br />
have increased rapidly. The main applicati<strong>on</strong>s running at datacenters are data intensive<br />
applicati<strong>on</strong>s such as large file servers or database systems. Recently, power<br />
management of the data intensive applicati<strong>on</strong>s has been emphasized in the literature.<br />
Such reports discuss the importance of power savings. However, these reports<br />
lack research <strong>on</strong> power management models for the efficient use of data intensive<br />
applicati<strong>on</strong>s’ I/O behaviors. This paper proposes a novel energy efficient storage<br />
management system that m<strong>on</strong>itors both applicati<strong>on</strong>- and device-level I/O patterns<br />
at run time, and uses not <strong>on</strong>ly the device-level I/O pattern but also applicati<strong>on</strong>level<br />
patterns. First, the design of the proposed model combined with such large data<br />
intensive applicati<strong>on</strong>s will be shown. The key features of the model are i) classifying<br />
applicati<strong>on</strong>-level I/O into four patterns using run-time access behaviors such as the<br />
length of idle time and read/write frequency, and ii) adopting an appropriate power-saving<br />
method-based <strong>on</strong> these applicati<strong>on</strong> level I/O patterns. Next, the proposed<br />
method is quantitatively evaluated with typical data intensive applicati<strong>on</strong>s such as<br />
file servers, OLTP, and DSS. It is shown that energy efficient storage management<br />
is effective in achieving large power savings compared with traditi<strong>on</strong>al approaches<br />
while an applicati<strong>on</strong> is running.<br />
Page<br />
74
Abstracts<br />
ISOBAR Prec<strong>on</strong>diti<strong>on</strong>er for Effective and High-throughput Lossless<br />
<strong>Data</strong> Compressi<strong>on</strong><br />
Eric r. Schendel (North carolina State University)<br />
ye Jin (North carolina State University)<br />
Neil Shah (North carolina State University)<br />
Jackie chen (Sandia Nati<strong>on</strong>al Laboratory)<br />
c.S. chang (Princet<strong>on</strong> Plasma Physics Laboratory, Princet<strong>on</strong>, NJ 08543, USA)<br />
Seung-Hoe Ku (New york University)<br />
Stephane Ethier (Princet<strong>on</strong> Plasma Physics Laboratory)<br />
Scott Klasky (oak ridge Nati<strong>on</strong>al Laboratory)<br />
robert Latham (Arg<strong>on</strong>ne Nati<strong>on</strong>al Laboratory)<br />
robert ross (Arg<strong>on</strong>ne Nati<strong>on</strong>al Laboratory)<br />
Nagiza F. Samatova (North carolina State University & oak ridge Nati<strong>on</strong>al Laboratory)<br />
Efficient handling of large volumes of data is a necessity for exascale scientific applicati<strong>on</strong>s<br />
and database systems. To address the growing imbalance between the<br />
amount of available storage and the amount of data being produced by high speed<br />
(FLOPS) processors <strong>on</strong> the system, data must be compressed to reduce the total<br />
amount of data placed <strong>on</strong> the file systems. General-purpose lossless compressi<strong>on</strong><br />
frameworks, such as zlib and bzlib2, are comm<strong>on</strong>ly used <strong>on</strong> datasets requiring lossless<br />
compressi<strong>on</strong>. Quite often, however, many scientific data sets compress poorly,<br />
referred to as hard-to-compress datasets, due to the negative impact of highly entropic<br />
c<strong>on</strong>tent represented within the data. An important problem in better lossless<br />
data compressi<strong>on</strong> is to identify the hard-to-compress informati<strong>on</strong> and subsequently<br />
optimize the compressi<strong>on</strong> techniques at the byte-level. To address this challenge,<br />
we introduce the In-Situ Orthog<strong>on</strong>al Byte Aggregate Reducti<strong>on</strong> Compressi<strong>on</strong><br />
(ISOBAR-compress) methodology as a prec<strong>on</strong>diti<strong>on</strong>er of lossless compressi<strong>on</strong> to<br />
identify and optimize the compressi<strong>on</strong> efficiency and throughput of hard-to-compress<br />
datasets.<br />
SeSSi<strong>on</strong> 4: DATA STrEAMS ProcESSiNG<br />
Physically Independent Stream Merging<br />
Badrish chandramouli (Microsoft research)<br />
David Maier (Portland State University)<br />
J<strong>on</strong>athan Goldstein (Microsoft corp.)<br />
A facility for merging equivalent data streams can support multiple capabilities<br />
in a data stream management system (DSMS), such as query-plan switching and<br />
high availability. One can logically view a data stream as a temporal table of events,<br />
each associated with a lifetime (time interval) over which the event c<strong>on</strong>tributes to<br />
output. In many applicati<strong>on</strong>s, the “same” logical stream may present itself physically<br />
in multiple physical forms, for example, due to disorder arising in transmissi<strong>on</strong> or<br />
from combining multiple sources; and modificati<strong>on</strong>s of earlier events. Merging such<br />
streams correctly is challenging when the streams may differ physically in timing,<br />
order, and compositi<strong>on</strong>. This paper introduces a new stream operator called Logical<br />
Merge (LMerge) that takes multiple logically c<strong>on</strong>sistent streams as input and<br />
outputs a single stream that is compatible with all of them. LMerge can handle the<br />
Page<br />
75
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
dynamic attachment and detachment of input streams. We present a range of algorithms<br />
for LMerge that can exploit compile-time stream properties for efficiency.<br />
Experiments with StreamInsight, a commercial DSMS, show that LMerge is sometimes<br />
orders-of-magnitude more efficient than enforcing determinism <strong>on</strong> inputs,<br />
and that there is benefit to using specialized algorithms when stream variability<br />
is limited. We also show that LMerge and its extensi<strong>on</strong>s can provide performance<br />
benefits in several real-world applicati<strong>on</strong>s.<br />
On Computing Correlated Aggregates over a <strong>Data</strong> Stream<br />
Srikanta Tirthapura (iowa State University)<br />
David P. Woodruff (iBM Almaden research center)<br />
On a stream of two dimensi<strong>on</strong>al data items (x,y) where x is an item identifier, and<br />
y is a numerical attribute, a correlated aggregate query requires us to first apply<br />
a selecti<strong>on</strong> predicate al<strong>on</strong>g the sec<strong>on</strong>d (y) dimensi<strong>on</strong>, followed by an aggregati<strong>on</strong><br />
al<strong>on</strong>g the first (x) dimensi<strong>on</strong>. For selecti<strong>on</strong> predicates of the form (y < c) or (y > c),<br />
where parameter c is provided at query time, we present new streaming algorithms<br />
and lower bounds for estimating statistics of the resulting substream of elements<br />
that satisfy the predicate. We provide the first sublinear space algorithms for a large<br />
family of statistics in this model, including frequency moments. We experimentally<br />
validate our algorithms, showing that their memory requirements are significantly<br />
smaller than existing linear storage schemes for large datasets, while simultaneously<br />
achieving fast per-record processing time. We also study the problem when<br />
the items have weights. Allowing negative weights allows for analyzing values which<br />
occur in the symmetric difference of two datasets. We give a str<strong>on</strong>g space lower<br />
bound which holds even if the algorithm is allowed up to a logarithmic number of<br />
passes over the data(before the query is presented). We complement this with a<br />
small space algorithm which uses a logarithmic number of passes.<br />
Accuracy-Aware Uncertain Stream <strong>Data</strong>bases<br />
Tingjian Ge (University of Kentucky)<br />
Fujun Liu (University of Kentucky)<br />
Previous work has introduced probability distributi<strong>on</strong>s as first-class comp<strong>on</strong>ents in<br />
uncertain stream database systems. A lacking element is the fact of how accurate<br />
these probability distributi<strong>on</strong>s are. This indeed has a profound impact <strong>on</strong> the accuracy<br />
of query results presented to end users. While there is some previous work<br />
that studies unreliable intermediate query results in the tuple uncertainty model,<br />
to the best of our know-ledge, we are the first to c<strong>on</strong>sider an uncertain stream<br />
database in which accuracy is taken into c<strong>on</strong>siderati<strong>on</strong> all the way from the learned<br />
distributi<strong>on</strong>s based <strong>on</strong> raw data samples to the query results. We perform an initial<br />
study of various comp<strong>on</strong>ents in an accuracy-aware uncertain stream database<br />
system, including the representati<strong>on</strong> of accuracy informati<strong>on</strong> and how to obtain<br />
query results’ accuracy. In additi<strong>on</strong>, we propose novel predicates based <strong>on</strong> hypothesis<br />
testing for decisi<strong>on</strong>-making using data with limited accuracy. We augment our<br />
study with a comprehensive set of experimental evaluati<strong>on</strong>s.<br />
Page<br />
76
On Discovery of Traveling Compani<strong>on</strong>s from Streaming Trajectories<br />
Lu-An Tang (UiUc)<br />
yu Zheng (MSrA)<br />
Jing yuan (MSrA)<br />
Jiawei Han (UiUc)<br />
Alice Leung (BBN)<br />
chih-chieh Hung (yahoo!)<br />
Wen-chih Peng (NcTU)<br />
Abstracts<br />
The advance of object tracking technologies leads to huge volumes of spatio-temporal<br />
data collected in the form of trajectory data stream. In this study, we investigate<br />
the problem of discovering object groups that travel together (i.e., traveling<br />
compani<strong>on</strong>s) from trajectory stream. Such technique has broad applicati<strong>on</strong>s in the<br />
areas of scientific study, transportati<strong>on</strong> management and military surveillance. To<br />
discover traveling compani<strong>on</strong>s, the m<strong>on</strong>itoring system should cluster the objects<br />
of each snapshot and intersect the clustering results to retrieve moving-together<br />
objects. Since both clustering and intersecti<strong>on</strong> steps involve high computati<strong>on</strong>al<br />
overhead, the key issue of compani<strong>on</strong> discovery is to improve the algorithm’s efficiency.<br />
We propose the models of closed compani<strong>on</strong> candidates and smart intersecti<strong>on</strong><br />
to accelerate data processing. A new data structure termed traveling buddy<br />
is designed to facilitate scalable and flexible compani<strong>on</strong> discovery <strong>on</strong> trajectory<br />
stream. The traveling buddies are micro-groups of objects that are tightly bound together.<br />
By <strong>on</strong>ly storing the object relati<strong>on</strong>ships rather than their spatial coordinates,<br />
the buddies can be dynamically maintained al<strong>on</strong>g trajectory stream with low cost.<br />
Based <strong>on</strong> traveling buddies, the system can discover compani<strong>on</strong>s without accessing<br />
the object details. The proposed methods are evaluated with extensive experiments<br />
<strong>on</strong> both real and synthetic datasets. The buddy-based method is an order of<br />
magnitude faster than existing methods. It also outperforms other competitors with<br />
higher precisi<strong>on</strong> and recall in compani<strong>on</strong> discovery.<br />
SeSSi<strong>on</strong> 5: GrAPHS<br />
Iterative Graph Feature Mining for Graph Indexing<br />
Dayu yuan (Penn State University)<br />
Prasenjit Mitra (Penn State University)<br />
Huiwen yu (Penn State University)<br />
c. Lee Giles (Penn State University)<br />
Subgraph search is a popular query scenario <strong>on</strong> graph databases. Given a query<br />
graph q, the subgraph search algorithm returns all database graphs having q as a<br />
subgraph. In order to quickly process the subgraph search, subgraph features are<br />
mined to index the graph database. Many subgraph feature mining approaches<br />
have been proposed. They are all mine-at- <strong>on</strong>ce algorithms in which the whole<br />
feature set is mined with <strong>on</strong>e run of the mining before building a stable graph index.<br />
However, due to the change of the envir<strong>on</strong>ments (such as the update of the graph<br />
database and the increase of available memory), the index need to be updated to<br />
accommodate those changes. Since most of the “mine-at-<strong>on</strong>ce” algorithms involve<br />
frequent subgraph or subtree mining over the whole graph database, and c<strong>on</strong>-<br />
Page<br />
77
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
structing and deploying a new index involve expensive disk operati<strong>on</strong>s, it is not efficient<br />
to re-mine the features and rebuild the index from scratch. We observe that,<br />
under most cases, it is sufficient to update a small part of the graph index. In this<br />
paper, we propose an “iterative subgraph mining” algorithm, finding <strong>on</strong>e feature<br />
to insert into (or remove from) the index iteratively. Since the majority of indexing<br />
features and the index structure are not changed, the algorithm can be frequently<br />
invoked. We first introduce the objective functi<strong>on</strong> that guides the feature mining.<br />
Then, a basic branch and bound algorithm is proposed to mine the features. Finally,<br />
we design an advanced search algorithm, which quickly finds a near-optimum<br />
subgraph feature and reduces the search space. Experiments show that our feature<br />
mining algorithm is 5 times faster than GIndex <strong>on</strong> updating the graph index, and<br />
features mined by the iterative algorithm have high filtering rate <strong>on</strong> the subgraph<br />
search problem.<br />
An Efficient Graph Indexing Method<br />
Xiaoli Wang (Nati<strong>on</strong>al University of Singapore)<br />
Xiaofeng Ding (Huazh<strong>on</strong>g University of Science and Technology)<br />
Anth<strong>on</strong>y K.H. Tung (Nati<strong>on</strong>al University of Singapore)<br />
Shanshan ying (Nati<strong>on</strong>al University of Singapore)<br />
Hai Jin (Huazh<strong>on</strong>g University of Science and Technology)<br />
Graphs are popular models for representing complex structure data and similarity<br />
search for graphs has become a fundamental research problem. Many techniques<br />
have been proposed to support similarity search based <strong>on</strong> the graph edit distance.<br />
However, they all suffer from certain drawbacks: high computati<strong>on</strong>al complexity,<br />
poor scalability in terms of database size, or not taking full advantage of indexes. To<br />
address these problems, in this paper, we propose SEGOS, an indexing and query<br />
processing framework for graph similarity search. First, an effective two-level index<br />
is c<strong>on</strong>structed off-line based <strong>on</strong> sub-unit decompositi<strong>on</strong> of graphs. Then, a novel<br />
search strategy based <strong>on</strong> the index is proposed. Two algorithms adapted from TA<br />
and CA methods are seamlessly integrated into the proposed strategy to enhance<br />
graph search. More specially, the proposed framework is easy to be pipelined to<br />
support c<strong>on</strong>tinuous graph pruning. Extensive experiments are c<strong>on</strong>ducted <strong>on</strong> two<br />
real datasets to evaluate the effectiveness and scalability of our approaches.<br />
PRAGUE: Towards Blending Practical Visual Subgraph Query<br />
Formulati<strong>on</strong> and Query Processing<br />
changjiu Jin (Nanyang Technological University)<br />
Sourav S. Bhowmick (Nanyang Technological University)<br />
Byr<strong>on</strong> choi (H<strong>on</strong>g K<strong>on</strong>g Baptist University)<br />
Shuigeng Zhou (Fudan University)<br />
In a previous paper, we laid out the visi<strong>on</strong> of a novel graph query processing paradigm<br />
where instead of processing a visual query graph after its c<strong>on</strong>structi<strong>on</strong>, it interleaves<br />
visual query formulati<strong>on</strong> and processing by exploiting the latency offered<br />
by the GUI to filter irrelevant matches and prefetch partial query results [8]. Our<br />
first attempt at implementing this visi<strong>on</strong>, called GBLENDER [8], shows significant<br />
improvement in system resp<strong>on</strong>se time (SRT) for subgraph c<strong>on</strong>tainment queries.<br />
However, GBLENDER suffers from two key drawbacks, namely inability to handle<br />
Page<br />
78
Abstracts<br />
visual subgraph similarity queries and inefficient support for visual query modificati<strong>on</strong>,<br />
limiting its usage in practical envir<strong>on</strong>ment. In this paper, we propose a novel<br />
algorithm called PRAGUE (PRactical visuAl Graph QUery blEnder), that addresses<br />
these limitati<strong>on</strong>s by exploiting a novel data structure called spindle-shaped graphs<br />
(SPIG). A SPIG succinctly records various informati<strong>on</strong> related to the set of supergraphs<br />
of a newly added edge in the visual query fragment. Specifically, PRAGUE<br />
realizes a unified visual framework to support SPIG-based processing of modificati<strong>on</strong>-efficient<br />
subgraph c<strong>on</strong>tainment and similarity queries. Extensive experiments<br />
<strong>on</strong> real-world and synthetic datasets dem<strong>on</strong>strate effectiveness of PRAGUE.<br />
Ego-centric Graph Pattern Census<br />
Walaa Eldin Moustafa (University of Maryland, college Park)<br />
Amol Deshpande (University of Maryland, college Park)<br />
Lise Getoor (University of Maryland, college Park)<br />
There is increasing interest in analyzing networks of all types including social, biological,<br />
sensor, computer, and transportati<strong>on</strong> networks. Broadly speaking, we may<br />
be interested in global network-wide analysis (e.g., centrality analysis, community<br />
detecti<strong>on</strong>) where the properties of the entire network are of interest, or local egocentric<br />
analysis where the focus is <strong>on</strong> studying the properties of nodes (egos) by<br />
analyzing their neighborhood subgraphs. In this paper we propose and study egocentric<br />
pattern census queries, a new type of graph analysis query, where a given<br />
structural pattern is searched for in every node’s neighborhood and the counts are<br />
reported or used in further analysis. This kind of analysis is useful in many domains<br />
in social network analysis including opini<strong>on</strong> leader identificati<strong>on</strong>, node classificati<strong>on</strong>,<br />
link predicti<strong>on</strong>, and role identificati<strong>on</strong>. We propose an SQL-based declarative<br />
language to support this class of queries, and develop a series of efficient query<br />
evaluati<strong>on</strong> algorithms for it. We evaluate our algorithms <strong>on</strong> a variety of synthetically<br />
generated graphs. We also show an applicati<strong>on</strong> of our language in a real-world<br />
scenario for predicting future collaborati<strong>on</strong>s from DBLP data.<br />
SeSSi<strong>on</strong> 6: UNcErTAiN AND ProBABiLiSTic DATABASES<br />
Searching Uncertain <strong>Data</strong> Represented by N<strong>on</strong>-Axis Parallel Gaussian<br />
Mixture Models<br />
Katrin Haegler (University of Munich)<br />
Frank Fiedler (University of Munich)<br />
christian Böhm (University of Munich)<br />
Efficient similarity search in uncertain data is a central problem in many modern<br />
applicati<strong>on</strong>s such as biometric identificati<strong>on</strong>, stock market analysis, sensor networks,<br />
medical imaging, etc. In such applicati<strong>on</strong>s, the feature vector of an object<br />
is not exactly known but is rather defined by a probability density functi<strong>on</strong> like a<br />
Gaussian Mixture Model (GMM). Previous work is limited to axis-parallel Gaussian<br />
distributi<strong>on</strong>s, hence, correlati<strong>on</strong>s between different features are not c<strong>on</strong>sidered in<br />
the similarity search. In this paper, we propose a novel, efficient similarity search<br />
technique for general GMMs without independence assumpti<strong>on</strong> for the attributes,<br />
named SUDN, which approximates the actual comp<strong>on</strong>ents of a GMM in a c<strong>on</strong>servative<br />
but tight way. A filter-refinement architecture guarantees no false dismissals,<br />
Page<br />
79
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
due to c<strong>on</strong>servativity, as well as a good filter selectivity, due to the tightness of<br />
our approximati<strong>on</strong>s. An extensive experimental evaluati<strong>on</strong> of SUDN dem<strong>on</strong>strates<br />
a c<strong>on</strong>siderable speed-up of similarity queries <strong>on</strong> general GMMs and an increase in<br />
accuracy compared to existing approaches.<br />
Aggregate Query Answering <strong>on</strong> Possibilistic <strong>Data</strong> with<br />
Cardinality C<strong>on</strong>straints<br />
Graham cormode (AT&T Labs – research)<br />
Ent<strong>on</strong>g Shen (North carolina State University)<br />
Divesh Srivastava (AT&T Labs – research)<br />
Ting yu (North carolina State University)<br />
Uncertainties in data can arise for a number of reas<strong>on</strong>s: when data is incomplete,<br />
c<strong>on</strong>tains c<strong>on</strong>flicting informati<strong>on</strong> or has been deliberately perturbed or coarsened to<br />
remove sensitive details. An important case which arises in many real applicati<strong>on</strong>s<br />
is when the data describes a set of possibilities, but with cardinality c<strong>on</strong>straints.<br />
These c<strong>on</strong>straints represent correlati<strong>on</strong>s between tuples encoding, e.g. that at most<br />
two possible records are correct, or that there is an (unknown) <strong>on</strong>e-to-<strong>on</strong>e mapping<br />
between a set of tuples and attribute values. Although there has been much effort to<br />
handle uncertain data, current systems are not equipped to handle such correlati<strong>on</strong>s,<br />
bey<strong>on</strong>d simple mutual exclusi<strong>on</strong> and co-existence c<strong>on</strong>straints. Vitally, they have little<br />
support for efficiently handling aggregate queries <strong>on</strong> such data. In this paper, we aim<br />
to address some of these deficiencies, by introducing LICM (Linear Integer C<strong>on</strong>straint<br />
Model), which can succinctly represent many types of tuple correlati<strong>on</strong>s, particularly<br />
a class of cardinality c<strong>on</strong>straints. We motivate and explain the model with<br />
examples from data cleaning and masking sensitive data, to show that it enables<br />
modeling and querying such data, which was not previously possible. We develop an<br />
efficient strategy to answer c<strong>on</strong>junctive and aggregate queries <strong>on</strong> possibilistic data<br />
by describing how to implement relati<strong>on</strong>al operators over data in the model. LICM<br />
compactly integrates the encoding of correlati<strong>on</strong>s, query answering and lineage<br />
recording. In combinati<strong>on</strong> with off-the-shelf linear integer programming solvers, our<br />
approach provides exact bounds for aggregate queries. Our prototype implementati<strong>on</strong><br />
dem<strong>on</strong>strates that query answering with LICM can be effective and scalable.<br />
Discovering Threshold-based Frequent Closed Itemsets over<br />
Probabilistic <strong>Data</strong><br />
y<strong>on</strong>gxin T<strong>on</strong>g (H<strong>on</strong>g K<strong>on</strong>g Univeristy of Science and <strong>Engineering</strong>)<br />
Lei chen (H<strong>on</strong>g K<strong>on</strong>g Univeristy of Science and <strong>Engineering</strong>)<br />
Bolin Ding (University of illinois at Urbana-champaign)<br />
In recent years, many new applicati<strong>on</strong>s, such as sensor network m<strong>on</strong>itoring and<br />
moving object search, show a growing amount of importance of uncertain data<br />
management and mining. In this paper, we study the problem of discovering<br />
threshold-based frequent closed itemsets over probabilistic data. Frequent itemset<br />
mining over probabilistic database has attracted much attenti<strong>on</strong> recently. However,<br />
existing soluti<strong>on</strong>s may lead an exp<strong>on</strong>ential number of results due to the downward<br />
closure property over probabilistic data. Moreover, it is hard to directly extend the<br />
successful experiences from mining exact data to a probabilistic envir<strong>on</strong>ment due<br />
to the inherent uncertainty of data. Thus, in order to obtain a reas<strong>on</strong>able result set<br />
Page<br />
80
Abstracts<br />
with small size, we study discovering frequent closed itemsets over probabilistic<br />
data. We prove that even a sub-problem of this problem, computing the frequent<br />
closed probability of an itemset, is #P-Hard. Therefore, we develop an efficient<br />
mining algorithm based <strong>on</strong> depth-first search strategy to obtain all probabilistic<br />
frequent closed itemsets. To reduce the search space and avoid redundant computati<strong>on</strong>,<br />
we further design several probabilistic pruning and bounding techniques.<br />
Finally, we verify the effectiveness and efficiency of the proposed methods through<br />
extensive experiments.<br />
Ranking Query Answers in Probabilistic <strong>Data</strong>bases: Complexity and<br />
Efficient Algorithms<br />
Dan olteanu (oxford)<br />
H<strong>on</strong>gkai Wen (oxford)<br />
In many applicati<strong>on</strong>s of probabilistic databases, the probabilities are mere degrees<br />
of uncertainty in the data and are not otherwise meaningful to the user. Often, users<br />
care <strong>on</strong>ly about the ranking of answers in decreasing order of their probabilities<br />
or about a few most likely answers. In this paper, we investigate the problem of<br />
ranking query answers in probabilistic databases. We give a dichotomy for ranking<br />
in case of c<strong>on</strong>junctive queries without repeating relati<strong>on</strong> symbols: it is either<br />
in polynomial time or \#P-hard. Surprisingly, our syntactic characterisati<strong>on</strong> of<br />
tractable queries is not the same as for probability computati<strong>on</strong>. The key observati<strong>on</strong><br />
is that there are queries for which probability computati<strong>on</strong> is \#P-hard, yet<br />
ranking can be computed in polynomial time. This is possible whenever probability<br />
computati<strong>on</strong> for distinct answers has a comm<strong>on</strong> factor that is hard to compute but<br />
irrelevant for ranking. We complement this tractability analysis with an effective<br />
ranking technique for c<strong>on</strong>junctive queries. Given a query, we c<strong>on</strong>struct a share plan,<br />
which exposes subqueries whose probability computati<strong>on</strong> can be shared or ignored<br />
across query answers. Our technique combines share plans with incremental approximate<br />
probability computati<strong>on</strong> of subqueries. We implemented our technique<br />
in the SPROUT query engine and report <strong>on</strong> performance gains of orders of magnitude<br />
over M<strong>on</strong>te Carlo simulati<strong>on</strong> using FPRAS and exact probability computati<strong>on</strong><br />
based <strong>on</strong> knowledge compilati<strong>on</strong>.<br />
SeSSi<strong>on</strong> 7: DATA iNTEGrATioN AND EXTrAcTioN<br />
Joint Entity Resoluti<strong>on</strong><br />
Steven Euij<strong>on</strong>g Whang (Stanford University)<br />
Hector Garcia-Molina (Stanford University)<br />
Entity resoluti<strong>on</strong> (ER) is the problem of identifying which records in a database<br />
represent the same entity. Often, records of different types are involved (e.g.,<br />
authors, publicati<strong>on</strong>s, instituti<strong>on</strong>s, venues), and resolving records of <strong>on</strong>e type can<br />
impact the resoluti<strong>on</strong> of other types of records. In this paper we propose a flexible,<br />
modular resoluti<strong>on</strong> framework where existing ER algorithms developed for a given<br />
record type can be plugged in and used in c<strong>on</strong>cert with other ER algorithms. Our<br />
approach also makes it possible to run ER <strong>on</strong> subsets of similar records at a time,<br />
important when the full data is too large to resolve together. We study the scheduling<br />
and coordinati<strong>on</strong> of the individual ER algorithms in order to resolve the full data<br />
Page<br />
81
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
set. We then evaluate our joint ER techniques <strong>on</strong> synthetic and real data and show<br />
the scalability of our approach.<br />
A Self-C<strong>on</strong>figuring Schema Matching System<br />
Eric Peukert (SAP research Dresden)<br />
Julian Eberius (Dresden University of Technology)<br />
Erhard rahm (University of Leipzig)<br />
Mapping complex metadata structures is crucial in a number of domains such as<br />
data integrati<strong>on</strong>, <strong>on</strong>tology alignment or model management. To speed up the generati<strong>on</strong><br />
of such mappings, automatic matching systems were developed to compute<br />
mapping suggesti<strong>on</strong>s that can be corrected by a user. However, c<strong>on</strong>structing and<br />
tuning match strategies still requires a high manual effort by matching experts as<br />
well as correct mappings to evaluate generated mappings. We therefore propose<br />
a self-c<strong>on</strong>figuring schema matching system that is able to automatically adapt to<br />
the given mapping problem at hand. Our approach is based <strong>on</strong> analyzing the input<br />
schemas as well as intermediate matching results. A variety of matching rules use<br />
the analysis results to automatically c<strong>on</strong>struct and adapt an underlying matching<br />
process for a given match task. We comprehensively evaluate our approach <strong>on</strong><br />
different mapping problems from the schema, <strong>on</strong>tology and model management<br />
domains. The evaluati<strong>on</strong> shows that our system is able to robustly return good quality<br />
mappings across different mapping problems and domains.<br />
Incremental Detecti<strong>on</strong> of Inc<strong>on</strong>sistencies in Distributed <strong>Data</strong><br />
Wenfei Fan (University of Edinburgh)<br />
Jianzh<strong>on</strong>g Li (Harbin institute of Technology)<br />
Nan Tang (University of Edinburgh & Qatar computing research institute)<br />
Wenyuan yu (University of Edinburgh)<br />
This paper investigates the problem of incremental detecti<strong>on</strong> of errors in distributed<br />
data. Given a distributed database D, a set Σ of c<strong>on</strong>diti<strong>on</strong>al functi<strong>on</strong>al dependencies<br />
(CFDs), the set V of violati<strong>on</strong>s of the CFDs in D, and updates Δ D to D, it<br />
is to find, with minimum data shipment, changes Δ V to V in resp<strong>on</strong>se to Δ D. The<br />
need for the study is evident since real-life data is often dirty, distributed and is<br />
frequently updated. It is often prohibitively expensive to recompute the entire set<br />
of violati<strong>on</strong>s when D is updated. We show that the incremental detecti<strong>on</strong> problem<br />
is NP-complete for D partiti<strong>on</strong>ed either vertically or horiz<strong>on</strong>tally, even when Σ and D<br />
are fixed. Nevertheless, we show that it is bounded and better still, actually optimal:<br />
there exist algorithms to detect errors such that their computati<strong>on</strong>al cost and<br />
data shipment are both linear in the size of Δ D and Δ V, independent of the size of<br />
the database D. We provide such incremental algorithms for vertically partiti<strong>on</strong>ed<br />
data, and show that the algorithms are optimal. We further propose optimizati<strong>on</strong><br />
techniques for the incremental algorithm over vertical partiti<strong>on</strong>s to reduce data<br />
shipment. We verify experimentally, using real-life data <strong>on</strong> Amaz<strong>on</strong> Elastic Compute<br />
Cloud (EC2), that our algorithms substantially outperform their batch counterparts<br />
even when Δ V is reas<strong>on</strong>ably large.<br />
Page<br />
82
Abstracts<br />
Recomputing Materialized Instances after Changes to Mappings and <strong>Data</strong><br />
Todd J. Green (University of california, Davis)<br />
Zachary G. ives (University of Pennsylvania)<br />
A major challenge faced by today’s informati<strong>on</strong> systems is that of evoluti<strong>on</strong> as<br />
data usage evolves or new data resources become available. Modern organizati<strong>on</strong>s<br />
sometimes exchange data with <strong>on</strong>e another via declarative mappings am<strong>on</strong>g<br />
their databases, as in data exchange and collaborative data sharing systems. Such<br />
mappings are frequently revised and refined as new data becomes available, new<br />
cross-reference tables are created, and correcti<strong>on</strong>s are made. A fundamental questi<strong>on</strong><br />
is how to handle changes to these mapping definiti<strong>on</strong>s, when the organizati<strong>on</strong>s<br />
each materialize the results of applying the mappings to the available data. We<br />
c<strong>on</strong>sider how to incrementally recompute these database instances in this setting,<br />
reusing (if possible) previously computed instances to speed up computati<strong>on</strong>. We<br />
develop a principled soluti<strong>on</strong> that performs cost-based explorati<strong>on</strong> of recomputati<strong>on</strong><br />
versus reuse, and simultaneously handles updates to source data and mapping<br />
definiti<strong>on</strong>s through a single, unified mechanism. Our soluti<strong>on</strong> also takes advantage<br />
of provenance informati<strong>on</strong>, when present, to speed up computati<strong>on</strong> even further.<br />
We present an implementati<strong>on</strong> that takes advantage of an off-the-shelf DBMS’s<br />
query processing system, and we show experimentally that our approach provides<br />
substantial performance benefits.<br />
SeSSi<strong>on</strong> 8: SPATio-TEMPorAL DATA MANAGEMENT<br />
SWST: A Disk Based Index for Sliding Window Spatio-Temporal <strong>Data</strong><br />
Manish Singh (University of Michigan, Ann Arbor)<br />
Qiang Zhu (University of Michigan, Dearborn)<br />
H.v. Jagadish (University of Michigan, Ann Arbor)<br />
Numerous applicati<strong>on</strong>s such as wireless communicati<strong>on</strong> and telematics need to<br />
keep track of evoluti<strong>on</strong> of spatio-temporal data for a limited past. Limited retenti<strong>on</strong><br />
may even be required by regulati<strong>on</strong>s. In general, each data entry can have its own<br />
user specified lifetime. It is desired that expired entries are automatically removed<br />
by the system through some garbage collecti<strong>on</strong> mechanism. This kind of limited<br />
retenti<strong>on</strong> can be achieved by using a sliding window semantics similar to that from<br />
stream data processing. However, due to the large volume and relatively l<strong>on</strong>g lifetime<br />
of data in the aforementi<strong>on</strong>ed applicati<strong>on</strong>s (in c<strong>on</strong>trast to the real-time transient<br />
streaming data), the sliding window here needs to be maintained for data <strong>on</strong><br />
disk rather than in memory. It is a new challenge to provide fast access to the informati<strong>on</strong><br />
from the recent past and, at the same time, facilitate efficient deleti<strong>on</strong> of the<br />
expired entries. In this paper, we propose a disk based, two-layered, sliding window<br />
indexing scheme for discretely moving spatio-temporal data. Our index can support<br />
efficient processing of standard timeslice and interval queries and delete expired<br />
entries with almost no overhead. In existing historical spatio-temporal indexing<br />
techniques, deleti<strong>on</strong> is either infeasible or very inefficient. Our sliding window based<br />
processing model can support both current and past entries, while many existing<br />
historical spatio-temporal indexing techniques cannot keep these two types of data<br />
together in the same index. Our experimental comparis<strong>on</strong> with the best known historical<br />
index (i.e., the MV3R tree) for discretely moving spatio-temporal data shows<br />
Page<br />
83
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
that our index is about five times faster in terms of inserti<strong>on</strong> time and comparable<br />
in terms of search performance. MV3R follows a partial persistency model, whereas<br />
our index can support very efficient deleti<strong>on</strong> and update.<br />
Querying Uncertain Spatio-Temporal <strong>Data</strong><br />
Tobias Emrich (Ludwig-Maximilians-Universität München)<br />
Hans-Peter Kriegel (Ludwig-Maximilians-Universität München)<br />
Nikos Mamoulis (University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Matthias renz (Ludwig-Maximilians-Universität München)<br />
Andreas Züfle (Ludwig-Maximilians-Universität München)<br />
The problem of modeling and managing uncertain data has received a great deal<br />
of interest, due to its manifold applicati<strong>on</strong>s in spatial, temporal, multimedia and<br />
sensor databases. There exists a wide range of work covering spatial uncertainty in<br />
the static (snapshot) case, where <strong>on</strong>ly <strong>on</strong>e point of time is c<strong>on</strong>sidered. In c<strong>on</strong>trast,<br />
the problem of modeling and querying uncertain spatio-temporal data has <strong>on</strong>ly<br />
been treated as a simple extensi<strong>on</strong> of the spatial case, disregarding time dependencies<br />
between c<strong>on</strong>secutive timestamps. In this work, we present a framework for<br />
efficiently modeling and querying uncertain spatio-temporal data. The key idea of<br />
our approach is to model possible object trajectories by stochastic processes. This<br />
approach has three major advantages over previous work. First it allows answering<br />
queries in accordance with the possible worlds model. Sec<strong>on</strong>d, dependencies<br />
between object locati<strong>on</strong>s at c<strong>on</strong>secutive points in time are taken into account. And<br />
third it is possible to reduce all queries <strong>on</strong> this model to simple matrix multiplicati<strong>on</strong>s.<br />
Based <strong>on</strong> these c<strong>on</strong>cepts we propose efficient soluti<strong>on</strong>s for different probabilistic<br />
spatio-temporal queries. In an experimental evaluati<strong>on</strong> we show that our approaches<br />
are several order of magnitudes faster than state-of-the-art competitors.<br />
The Min-dist Locati<strong>on</strong> Selecti<strong>on</strong> Query<br />
Jianzh<strong>on</strong>g Qi (University of Melbourne)<br />
rui Zhang (University of Melbourne)<br />
Lars Kulik (University of Melbourne)<br />
Dan Lin (Missouri University of Science and Technology)<br />
yuan Xue (University of Melbourne)<br />
We propose and study a new type of locati<strong>on</strong> optimizati<strong>on</strong> problem: given a set of<br />
clients and a set of existing facilities, we select a locati<strong>on</strong> from a given set of potential<br />
locati<strong>on</strong>s for establishing a new facility so that the average distance between a<br />
client and her nearest facility is minimized. We call this problem the min-dist locati<strong>on</strong><br />
selecti<strong>on</strong> problem, which has a wide range of applicati<strong>on</strong>s in urban development<br />
simulati<strong>on</strong>, massively multiplayer <strong>on</strong>line games, and decisi<strong>on</strong> support systems.<br />
We explore two comm<strong>on</strong> approaches to locati<strong>on</strong> optimizati<strong>on</strong> problems and propose<br />
methods based <strong>on</strong> those approaches for solving this new problem. However,<br />
those methods either need to maintain an extra index or fall short in efficiency. To<br />
address their drawbacks, we propose a novel method (named MND), which has<br />
very close performance to the fastest method but does not need an extra index.<br />
We provide a detailed comparative cost analysis <strong>on</strong> the various algorithms. We also<br />
perform extensive experiments to evaluate their empirical performance and validate<br />
the efficiency of the MND method.<br />
Page<br />
84
Abstracts<br />
Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computati<strong>on</strong><br />
Jia Pan (UNc chapel Hill)<br />
Dinesh Manocha (UNc chapel Hill)<br />
We present a new Bi-level LSH algorithm to perform approximate k-nearest neighbor<br />
search in high dimensi<strong>on</strong>al spaces. Our formulati<strong>on</strong> is based <strong>on</strong> a two-level<br />
scheme. In the first level, we use a RP-tree that divides the dataset into sub-groups<br />
with bounded aspect ratios and is used to distinguish well-separated clusters. During<br />
the sec<strong>on</strong>d level, we compute a single LSH hash table for each sub-group al<strong>on</strong>g<br />
with a hierarchical structure based <strong>on</strong> space-filling curves. Given a query, we first<br />
determine the sub-group that it bel<strong>on</strong>gs to and perform k-nearest neighbor search<br />
within the suitable buckets in the LSH hash table corresp<strong>on</strong>ding to the sub-group.<br />
Our algorithm also maps well to current GPU architectures and can improve the<br />
quality of approximate KNN queries as compared to prior LSH-based algorithms.<br />
We highlight its performance <strong>on</strong> two large, high-dimensi<strong>on</strong>al image datasets. Given<br />
a runtime budget, Bi-level LSH can provide better accuracy in terms of recall or<br />
error rati<strong>on</strong>. Moreover, our formulati<strong>on</strong> reduces the variati<strong>on</strong> in runtime cost or the<br />
quality of results.<br />
SeSSi<strong>on</strong> 9: QUEry ProcESSiNG<br />
Learning-based Query Performance Modeling and Predicti<strong>on</strong><br />
Mert Akdere (Brown University)<br />
Ugur cetintemel (Brown University)<br />
Matteo ri<strong>on</strong>dato (Brown University)<br />
Eli Upfal (Brown University)<br />
Stanley B. Zd<strong>on</strong>ik (Brown University)<br />
Accurate query performance predicti<strong>on</strong> (QPP) is central to effective resource management,<br />
query optimizati<strong>on</strong> and query scheduling. Analytical cost models, used in<br />
current generati<strong>on</strong> of query optimizers, have been successful in comparing the costs<br />
of alternative query plans, but they are poor predictors of executi<strong>on</strong> latency. As a<br />
more promising approach to QPP, this paper studies the practicality and utility of<br />
sophisticated learning-based models, which have recently been applied to a variety<br />
of predictive tasks with great success, in both static (i.e., fixed) and dynamic query<br />
workloads. We propose and evaluate predictive modeling techniques that learn query<br />
executi<strong>on</strong> behavior at different granularities, ranging from coarse-grained planlevel<br />
models to fine-grained operator-level models. We dem<strong>on</strong>strate that these two<br />
extremes offer a tradeoff between high accuracy for static workload queries and<br />
generality to unforeseen queries in dynamic workloads, respectively, and introduce a<br />
hybrid approach that combines their respective strengths by selectively composing<br />
them in the process of QPP. We discuss how we can use a training workload to (i)<br />
pre-build and materialize such models offline, so that they are readily available for<br />
future predicti<strong>on</strong>s, and (ii) build new models <strong>on</strong>line as new predicti<strong>on</strong>s are needed.<br />
All predicti<strong>on</strong> models are built using <strong>on</strong>ly static features (available prior to query<br />
executi<strong>on</strong>) and the performance values obtained from the offline executi<strong>on</strong> of the<br />
training workload. We fully implemented all these techniques and extensi<strong>on</strong>s <strong>on</strong> top<br />
of PostgreSQL and evaluated them experimentally by quantifying their effectiveness<br />
over analytical workloads, represented by well-established TPC-H data and queries.<br />
Page<br />
85
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
The results provide quantitative evidence that learning-based modeling for QPP is<br />
both feasible and effective for both static and dynamic workload scenarios.<br />
Parametric Plan Caching Using Density-Based Clustering<br />
Gunes Aluc (University of Waterloo)<br />
David E. DeHaan (Sybase, an SAP company)<br />
ivan T. Bowman (Sybase, an SAP company)<br />
Query plan caching eliminates the need for repeated query optimizati<strong>on</strong>; hence, it<br />
has str<strong>on</strong>g practical implicati<strong>on</strong>s for relati<strong>on</strong>al database management systems (RD-<br />
BMSs). Unfortunately, existing approaches c<strong>on</strong>sider <strong>on</strong>ly the query plan generated at<br />
the expected values of parameters that characterize the query, data and the current<br />
state of the system, while these parameters may take different values during the lifetime<br />
of a cached plan. A better alternative is to harvest the optimizer’s plan choice<br />
for different parameter values, populate the cache with promising query plans, and<br />
select a cached plan based up<strong>on</strong> current parameter values. To address this challenge,<br />
we propose a parametric plan caching (PPC) framework that uses an <strong>on</strong>line plan<br />
space clustering algorithm. The clustering algorithm is density-based, and it exploits<br />
locality-sensitive hashing as a pre-processing step so that clusters in the plan spaces<br />
can be efficiently stored in database histograms and queried in c<strong>on</strong>stant time. We<br />
experimentally validate that our approach is precise, efficient in space-and-time and<br />
adaptive, requiring no eager explorati<strong>on</strong> of the plan spaces of the optimizer.<br />
Effective and Robust Pruning for Top-Down Join<br />
Enumerati<strong>on</strong> Algorithms<br />
Pit Fender (Mannheim University)<br />
Guido Moerkotte (Mannheim University)<br />
Thomas Neumann (Technical University of Munich)<br />
viktor Leis (Technical University of Munich)<br />
Finding the optimal executi<strong>on</strong> order of join operati<strong>on</strong>s is a crucial task of today’s<br />
cost-based query optimizers. There are two approaches to identify the best plan:<br />
bottom-up and top-down join enumerati<strong>on</strong>. For both optimizati<strong>on</strong> strategies efficient<br />
algorithms have been published. However, <strong>on</strong>ly the top-down approach allows<br />
for branch-and-bound pruning. Two pruning techniques can be found in the literature.<br />
We add six new <strong>on</strong>es. Combined, they improve performance roughly by an<br />
average factor of 2-5. Even more important, our techniques improve the worst case<br />
by two orders of magnitude. Additi<strong>on</strong>ally, we introduce a new, very efficient, and<br />
easy to implement top-down join enumerati<strong>on</strong> algorithm. This algorithm, together<br />
with our improved pruning techniques, yields a performance which is by an average<br />
factor of 6-9 higher than the performance of the original top-down enumerati<strong>on</strong><br />
algorithm with the original pruning methods.<br />
Towards Preference-aware Relati<strong>on</strong>al <strong>Data</strong>bases<br />
Anastasios Arvanitis (Nati<strong>on</strong>al Technical University of Athens)<br />
Georgia Koutrika (iBM Almaden research center)<br />
In implementing preference-aware query processing, a straightforward opti<strong>on</strong> is<br />
Page<br />
86
Abstracts<br />
to build a plug-in <strong>on</strong> top of the database engine. However, treating the DBMS as<br />
a black box affects both the expressivity and performance of queries with preferences.<br />
In this paper, we argue that preference-aware query processing needs to be<br />
pushed closer to the DBMS. We present a preference-aware relati<strong>on</strong>al data model<br />
that extends database tuples with preferences and an extended algebra that captures<br />
the essence of processing queries with preferences. A key novelty of our preference<br />
model itself is that it defines a preference in three dimensi<strong>on</strong>s showing the<br />
tuples affected, their preference scores and the credibility of the preference. Our<br />
query processing strategies push preference evaluati<strong>on</strong> inside the query plan and<br />
leverage its algebraic properties for finer-grained query optimizati<strong>on</strong>. We experimentally<br />
evaluate the proposed strategies. Finally, we compare our framework to a<br />
pure plug-in implementati<strong>on</strong> and we show its feasibility and advantages.<br />
SeSSi<strong>on</strong> 10: LocATioN AWArE DATA ProcESSiNG<br />
A Foundati<strong>on</strong> for Efficient Indoor Distance-Aware Query Processing<br />
Hua Lu (Aalborg University)<br />
Xin cao (Nanyang Technological University)<br />
christian S. Jensen (Aarhus University)<br />
Indoor spaces accommodate large numbers of spatial objects, e.g., points of interest<br />
(POIs), and moving populati<strong>on</strong>s. A variety of services, e.g., locati<strong>on</strong>-based<br />
services and security c<strong>on</strong>trol, are relevant to indoor spaces. Such services can be<br />
improved substantially if they are capable of utilizing indoor distances. However, existing<br />
indoor space models do not account well for indoor distances. To address this<br />
shortcoming, we propose a data management infrastructure that captures indoor<br />
distance and facilitates distance-aware query processing. In particular, we propose<br />
a distance-aware indoor space model that integrates indoor distance seamlessly. To<br />
enable the use of the model as a foundati<strong>on</strong> for query processing, we develop accompanying,<br />
efficient algorithms that compute indoor distances for different indoor<br />
entities like doors as well as locati<strong>on</strong>s. We also propose an indexing framework<br />
that accommodates indoor distances that are pre-computed using the proposed<br />
algorithms. On top of this foundati<strong>on</strong>, we develop efficient algorithms for typical<br />
indoor, distance-aware queries. The results of an extensive experimental evaluati<strong>on</strong><br />
dem<strong>on</strong>strate the efficacy of the proposals.<br />
LARS: A Locati<strong>on</strong>-Aware Recommender System<br />
Justin J. Levandoski (Microsoft research)<br />
Mohamed Sarwat (University of Minnesota)<br />
Ahmed Eldawy (University of Minnesota)<br />
Mohamed F. Mokbel (University of Minnesota)<br />
This paper proposes LARS, a locati<strong>on</strong>-aware recommender system that uses locati<strong>on</strong>-based<br />
ratings to produce recommendati<strong>on</strong>s. Traditi<strong>on</strong>al recommender systems<br />
do not c<strong>on</strong>sider spatial properties of users nor items; LARS, <strong>on</strong> the other hand, supports<br />
a tax<strong>on</strong>omy of three novel classes of locati<strong>on</strong>-based ratings, namely, spatial<br />
ratings for n<strong>on</strong>-spatial items, n<strong>on</strong>-spatial ratings for spatial items, and spatial ratings<br />
for spatial items. LARS exploits user rating locati<strong>on</strong>s through user partiti<strong>on</strong>ing, a<br />
technique that influences recommendati<strong>on</strong>s with ratings spatially close to querying<br />
Page<br />
87
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
users in a manner that maximizes system scalability while not sacrificing recommendati<strong>on</strong><br />
quality. LARS exploits item locati<strong>on</strong>s using travel penalty, a technique that favors<br />
recommendati<strong>on</strong> candidates closer in travel distance to querying users in a way<br />
that avoids exhaustive access to all spatial items. LARS can apply these techniques<br />
separately, or in c<strong>on</strong>cert, depending <strong>on</strong> the type of locati<strong>on</strong>-based rating available.<br />
Experimental evidence using large-scale real-world data from both the Foursquare<br />
locati<strong>on</strong>-based social network and the MovieLens movie recommendati<strong>on</strong> system<br />
reveals that LARS is efficient, scalable, and capable of producing recommendati<strong>on</strong>s<br />
twice as accurate compared to existing recommendati<strong>on</strong> approaches.<br />
Approximate Shortest Distance Computing: A Query-Dependent Local<br />
Landmark Scheme<br />
Miao Qiao (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
H<strong>on</strong>g cheng (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Lijun chang (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Jeffrey Xu yu (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Shortest distance query between two nodes is a fundamental operati<strong>on</strong> in largescale<br />
networks. Most existing methods in the literature take a landmark embedding<br />
approach, which selects a set of graph nodes as landmarks and computes the<br />
shortest distances from each landmark to all nodes as an embedding. To handle a<br />
shortest distance query between two nodes, the precomputed distances from the<br />
landmarks to the query nodes are used to compute an approximate shortest distance<br />
based <strong>on</strong> the triangle inequality. In this paper, we analyze the factors that affect<br />
the accuracy of the distance estimati<strong>on</strong> in the landmark embedding approach.<br />
In particular we find that a globally selected, query-independent landmark set plus<br />
the triangulati<strong>on</strong> based distance estimati<strong>on</strong> introduces a large relative error, especially<br />
for nearby query nodes. To address this issue, we propose a query-dependent<br />
local landmark scheme, which identifies a local landmark close to the specific query<br />
nodes and provides a more accurate distance estimati<strong>on</strong> than the traditi<strong>on</strong>al global<br />
landmark approach. Specifically, a local landmark is defined as the least comm<strong>on</strong><br />
ancestor of the two query nodes in the shortest path tree rooted at a global landmark.<br />
We propose efficient local landmark indexing and retrieval techniques, which<br />
are crucial to achieve low offline indexing complexity and <strong>on</strong>line query complexity.<br />
Two optimizati<strong>on</strong> techniques <strong>on</strong> graph compressi<strong>on</strong> and graph <strong>on</strong>line search are<br />
also proposed, with the goal to further reduce index size and improve query accuracy.<br />
Our experimental results <strong>on</strong> large-scale social networks and road networks<br />
dem<strong>on</strong>strate that the local landmark scheme reduces the shortest distance estimati<strong>on</strong><br />
error significantly when compared with global landmark embedding.<br />
Desks: Directi<strong>on</strong>-Aware Spatial Keyword Search<br />
Guoliang Li (Tsinghua University)<br />
Jianhua Feng (Tsinghua University)<br />
Jing Xu (Tsinghua University)<br />
Locati<strong>on</strong>-based services (LBS) have been widely accepted by mobile users. Many<br />
LBS users have directi<strong>on</strong>-aware search requirement that answers must be in<br />
the search directi<strong>on</strong>. However to the best of our knowledge there is not yet any<br />
research available that investigates directi<strong>on</strong>-aware search. A straightforward<br />
Page<br />
88
Abstracts<br />
method first finds candidates without c<strong>on</strong>sidering the directi<strong>on</strong> c<strong>on</strong>straint, and then<br />
generates the answers by pruning those candidates which invalidate the directi<strong>on</strong><br />
c<strong>on</strong>straint. However this method is rather expensive as it involves a lot of useless<br />
computati<strong>on</strong> <strong>on</strong> many unnecessary directi<strong>on</strong>s. To address this problem, we propose<br />
a directi<strong>on</strong>-aware spatial keyword search method which inherently supports<br />
directi<strong>on</strong>-aware search. We devise novel directi<strong>on</strong>-aware indexing structures to<br />
prune unnecessary directi<strong>on</strong>s. We develop effective pruning techniques and search<br />
algorithms to efficiently answer a directi<strong>on</strong>-aware query. As users may dynamically<br />
change their search directi<strong>on</strong>s, we propose to incrementally answer a query. Experimental<br />
results <strong>on</strong> real datasets show that our method achieves high performance<br />
and outperforms existing methods significantly.<br />
SeSSi<strong>on</strong> 11: MAP-rEDUcE BASED DATA ProcESSiNG<br />
Extending Map-Reduce for Efficient Predicate-Based Sampling<br />
raman Grover (University of california, irvine)<br />
Michael carey (University of california, irvine)<br />
In this paper we address the problem of using MapReduce to sample a massive<br />
data set in order to produce a fixed-size sample whose c<strong>on</strong>tents satisfy a given<br />
predicate. While it is simple to express this computati<strong>on</strong> using MapReduce, its<br />
default Hadoop executi<strong>on</strong> is dependent <strong>on</strong> the input size and is wasteful of cluster<br />
resources. This is unfortunate, as sampling queries are fairly comm<strong>on</strong> (e.g., for<br />
exploratory data analysis at Facebook), and the resulting waste can significantly<br />
impact the performance of a shared cluster. To address such use cases, we present<br />
the design, implementati<strong>on</strong> and evaluati<strong>on</strong> of a Hadoop executi<strong>on</strong> model extensi<strong>on</strong><br />
that supports incremental job expansi<strong>on</strong>. Under this model, a job c<strong>on</strong>sumes input<br />
as required and can dynamically govern its resource c<strong>on</strong>sumpti<strong>on</strong> while producing<br />
the required results. The proposed mechanism is able to support a variety of policies<br />
regarding job growth rates as they relate to cluster capacity and current load.<br />
We have implemented the mechanism in Hadoop, and we present results from an<br />
experimental performance study of different job growth policies under both single-<br />
and multi-user workloads.<br />
Fuzzy Joins Using MapReduce<br />
Foto Afrati (Nati<strong>on</strong>al Technical University Athens)<br />
Anish Das Sarma (Google, inc. - work initiated at yahoo! research)<br />
David Menestrina (Google, inc.)<br />
Aditya Parameswaran (Stanford University)<br />
Jeffrey D. Ullman (Stanford University)<br />
Fuzzy/similarity joins have been widely studied in the research community and extensively<br />
used in real-world applicati<strong>on</strong>s. This paper proposes and evaluates several<br />
algorithms for finding all pairs of elements from an input set that meet a similarity<br />
threshold. The computati<strong>on</strong> model is a single MapReduce job. Because we allow <strong>on</strong>ly<br />
<strong>on</strong>e MapReduce round, the Reduce functi<strong>on</strong> must be designed so a given output pair<br />
is produced by <strong>on</strong>ly <strong>on</strong>e task; for many algorithms, satisfying this c<strong>on</strong>diti<strong>on</strong> is <strong>on</strong>e of<br />
the biggest challenges. We break the cost of an algorithm into three comp<strong>on</strong>ents: the<br />
executi<strong>on</strong> cost of the mappers, the executi<strong>on</strong> cost of the reducers, and the communi-<br />
Page<br />
89
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
cati<strong>on</strong> cost from the mappers to reducers. The algorithms are presented first in terms<br />
of Hamming distance, but extensi<strong>on</strong>s to edit distance and Jaccard distance are shown<br />
as well. We find that there are many different approaches to the similarity-join problem<br />
using MapReduce, and n<strong>on</strong>e dominates the others when both communicati<strong>on</strong><br />
and reducer costs are c<strong>on</strong>sidered. Our cost analyses enable applicati<strong>on</strong>s to pick the<br />
optimal algorithm based <strong>on</strong> their communicati<strong>on</strong>, memory, and cluster requirements.<br />
Parallel Top-K Similarity Join Algorithms Using MapReduce<br />
youngho<strong>on</strong> Kim (Seoul Nati<strong>on</strong>al University)<br />
Kyuseok Shim (Seoul Nati<strong>on</strong>al University)<br />
There is a wide range of applicati<strong>on</strong>s that require finding the top-k most similar<br />
pairs of records in a given database. However, computing such top-k similarity joins<br />
is a challenging problem today, as there is an increasing trend of applicati<strong>on</strong>s that<br />
expect to deal with vast amounts of data. For such data-intensive applicati<strong>on</strong>s,<br />
parallel executi<strong>on</strong>s of programs <strong>on</strong> a large cluster of commodity machines using<br />
the MapReduce paradigm have recently received a lot of attenti<strong>on</strong>. In this paper, we<br />
investigate how the top-k similarity join algorithms can get benefits from the popular<br />
MapReduce framework. We first develop the divide-and-c<strong>on</strong>quer and branchand-bound<br />
algorithms. We next propose the all pair partiti<strong>on</strong>ing and essential pair<br />
partiti<strong>on</strong>ing methods to minimize the amount of data transfers between map and<br />
reduce functi<strong>on</strong>s. We finally perform the experiments with not <strong>on</strong>ly synthetic but<br />
also real-life data sets. Our performance study c<strong>on</strong>firms the effectiveness and scalability<br />
of our MapReduce algorithms.<br />
Load Balancing in MapReduce Based <strong>on</strong> Scalable Cardinality Estimates<br />
Benjamin Gufler (Technische Universität München)<br />
Nikolaus Augsten (Free University of Bolzano-Bozen)<br />
Angelika reiser (Technische Universität München)<br />
Alf<strong>on</strong>s Kemper (Technische Universität München)<br />
MapReduce has emerged as a popular tool for distributed and scalable processing<br />
of massive data sets and is being used increasingly in e-science applicati<strong>on</strong>s. Unfortunately,<br />
the performance of MapReduce systems str<strong>on</strong>gly depends <strong>on</strong> an even<br />
data distributi<strong>on</strong> while scientific data sets are often highly skewed. The resulting<br />
load imbalance, which raises the processing time, is even amplified by high runtime<br />
complexity of the reducer tasks. An adaptive load balancing strategy is required for<br />
appropriate skew handling. In this paper, we address the problem of estimating the<br />
cost of the tasks that are distributed to the reducers based <strong>on</strong> a given cost model.<br />
An accurate cost estimati<strong>on</strong> is the basis for adaptive load balancing algorithms and<br />
requires to gather statistics from the mappers. This is challenging: (a) Since the<br />
statistics from all mappers must be integrated, the mapper statistics must be small.<br />
(b) Although each mapper sees <strong>on</strong>ly a small fracti<strong>on</strong> of the data, the integrated<br />
statistics must capture the global data distributi<strong>on</strong>. (c) The mappers terminate after<br />
sending the statistics to the c<strong>on</strong>troller, and no sec<strong>on</strong>d round is possible. Our soluti<strong>on</strong><br />
to these challenges c<strong>on</strong>sists of two comp<strong>on</strong>ents. First, a m<strong>on</strong>itoring comp<strong>on</strong>ent<br />
executed <strong>on</strong> every mapper captures the local data distributi<strong>on</strong> and identifies<br />
its most relevant subset for cost estimati<strong>on</strong>. Sec<strong>on</strong>d, an integrati<strong>on</strong> comp<strong>on</strong>ent<br />
aggregates these subsets approximating the global data distributi<strong>on</strong>.<br />
Page<br />
90
SeSSi<strong>on</strong> 12: SociAL MEDiA<br />
Community Detecti<strong>on</strong> with Edge C<strong>on</strong>tent in Social Media Networks<br />
Guo-Jun Qi (University of illinois at Urbana-champaign)<br />
charu c. Aggarwal (iBM T. J. Wats<strong>on</strong> research center)<br />
Thomas S. Huang (University of illinois at Urbana-champaign)<br />
Abstracts<br />
The problem of community detecti<strong>on</strong> in social media has been widely studied in<br />
the social networking community in the c<strong>on</strong>text of the structure of the underlying<br />
graphs. Most community detecti<strong>on</strong> algorithms use the links between the nodes in<br />
order to determine the dense regi<strong>on</strong>s in the graph. These dense regi<strong>on</strong>s are the<br />
communities of social media in the graph. Such methods are typically based purely<br />
<strong>on</strong> the linkage structure of the underlying social media network. However, in many<br />
recent applicati<strong>on</strong>s, edge c<strong>on</strong>tent is available in order to provide better supervisi<strong>on</strong><br />
to the community detecti<strong>on</strong> process. Many natural representati<strong>on</strong>s of edges in social<br />
interacti<strong>on</strong>s such as shared images and videos, user tags and comments are naturally<br />
associated with c<strong>on</strong>tent <strong>on</strong> the edges. While some work has been d<strong>on</strong>e <strong>on</strong> utilizing<br />
node c<strong>on</strong>tent for community detecti<strong>on</strong>, the presence of edge c<strong>on</strong>tent presents<br />
unprecedented opportunities and flexibility for the community detecti<strong>on</strong> process.<br />
We will show that such edge c<strong>on</strong>tent can be leveraged in order to greatly improve<br />
the effectiveness of the community detecti<strong>on</strong> process in social media networks. We<br />
present experimental results illustrating the effectiveness of our approach.<br />
Cross Domain Search by Exploiting Wikipedia<br />
chen Liu (Nati<strong>on</strong>al University of Singapore)<br />
Sai Wu (Nati<strong>on</strong>al University of Singapore)<br />
Shouxu Jiang (Harbin institute of Technology)<br />
Anth<strong>on</strong>y K.H. Tung (Nati<strong>on</strong>al University of Singapore)<br />
The abundance of Web 2.0 resources in various media formats calls for better<br />
resource integrati<strong>on</strong> to enrich user experience. This naturally leads to a new crossmodal<br />
resource search requirement, in which a query is a resource in <strong>on</strong>e modal<br />
and the results are closely related resources in other modalities. With cross-modal<br />
search, we can better exploit existing resources. Tags associated with Web 2.0<br />
resources are intuitive medium to link resources with different modality together.<br />
However, tagging is by nature an ad hoc activity. They often c<strong>on</strong>tain noises and are<br />
affected by the subjective inclinati<strong>on</strong> of the tagger. C<strong>on</strong>sequently, linking resources<br />
simply by tags will not be reliable. In this paper, we propose an approach for linking<br />
tagged resources to c<strong>on</strong>cepts extracted from Wikipedia, which has become a fairly<br />
reliable reference over the last few years. Compared to the tags, the c<strong>on</strong>cepts are<br />
therefore of higher quality. We develop effective methods for cross-modal search<br />
based <strong>on</strong> the c<strong>on</strong>cepts associated with resources. Extensive experiments were c<strong>on</strong>ducted,<br />
and the results show that our soluti<strong>on</strong> achieves good performance.<br />
Page<br />
91
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Provenance-based Indexing Support in Micro-blog Platforms<br />
Junjie yao (Peking University)<br />
Bin cui (Peking University)<br />
Zijun Xue (Peking University)<br />
Qingyun Liu (Peking University)<br />
Recently, lots of micro-blog message sharing applicati<strong>on</strong>s have emerged <strong>on</strong> the<br />
web. Users can publish short messages freely and get notified by the subscripti<strong>on</strong>s<br />
instantly. Prominent examples include Twitter, Facebook’s statuses, and Sina Weibo<br />
in China. The Micro-blog platform becomes a useful service for real time informati<strong>on</strong><br />
creati<strong>on</strong> and propagati<strong>on</strong>. However, these messages’ short length and dynamic<br />
characters have posed great challenges for effective c<strong>on</strong>tent understanding. Additi<strong>on</strong>ally,<br />
the noise and fragments make it difficult to discover the temporal propagati<strong>on</strong><br />
trail to explore development of micro-blog messages. In this paper, we propose<br />
a provenance model to capture c<strong>on</strong>necti<strong>on</strong>s between micro-blog messages. Provenance<br />
refers to data origin identificati<strong>on</strong> and transformati<strong>on</strong> logging, dem<strong>on</strong>strating<br />
of great value in recent database and workflow systems. To cope with the real time<br />
micro-message deluge, we utilize a novel message grouping approach to encode<br />
and maintain the provenance informati<strong>on</strong>. Furthermore, we adopt a summary index<br />
and several adaptive pruning strategies to implement efficient provenance updating.<br />
Based <strong>on</strong> the index, our provenance soluti<strong>on</strong> can support rich query retrieval<br />
and intuitive message tracking for effective message organizati<strong>on</strong>. Experiments<br />
c<strong>on</strong>ducted <strong>on</strong> a real dataset verify the effectiveness and efficiency of our approach.<br />
Provenance refers to data origin identificati<strong>on</strong> and transformati<strong>on</strong> m<strong>on</strong>itoring, which<br />
has been dem<strong>on</strong>strated of great value in database and workflow systems. In this<br />
paper, we propose a provenance model in micro-blog platforms, and design an indexing<br />
scheme to support provenance-based message discovery and maintenance,<br />
which can capture the interacti<strong>on</strong>s of messages for effective message organizati<strong>on</strong>.<br />
To cope with the real time micro-message tornadoes, we introduce a novel virtual<br />
annotati<strong>on</strong> grouping approach to encode and maintain the provenance informati<strong>on</strong>.<br />
Furthermore, we design a summary index and adaptive pruning strategies to facilitate<br />
efficient message update. Based <strong>on</strong> this provenance index, our approach can<br />
support query and message tracking in micro-blog systems. Experiments c<strong>on</strong>ducted<br />
<strong>on</strong> real datasets verify the effectiveness and efficiency of our approach.<br />
Learning Stochastic Models of Informati<strong>on</strong> Flow<br />
Luke Dickens (imperial college L<strong>on</strong>d<strong>on</strong>)<br />
ian Molloy (iBM T. J. Wats<strong>on</strong> research center)<br />
Jorge Lobo (iBM T. J. Wats<strong>on</strong> research center)<br />
Pau-chen cheng (iBM T. J. Wats<strong>on</strong> research center)<br />
Alessandra russo (imperial college L<strong>on</strong>d<strong>on</strong>)<br />
An understanding of informati<strong>on</strong> flow has many applicati<strong>on</strong>s, including for maximizing<br />
marketing impact <strong>on</strong> social media, limiting malware propagati<strong>on</strong>, and managing<br />
undesired disclosure of sensitive informati<strong>on</strong>. This paper presents scalable methods<br />
for both learning models of informati<strong>on</strong> flow in networks from data, based<br />
<strong>on</strong> the Independent Cascade Model; and predicting probabilities of unseen flow<br />
from these models. Our approach is based <strong>on</strong> a principled probabilistic c<strong>on</strong>structi<strong>on</strong><br />
and results compare favourably with existing methods in terms of accuracy of<br />
Page<br />
92
Abstracts<br />
predicti<strong>on</strong> and scalable evaluati<strong>on</strong>, with the additi<strong>on</strong> that we are able to evaluate a<br />
broader range of queries than previously shown, including probability of joint and/<br />
or c<strong>on</strong>diti<strong>on</strong>al flow, as well as reflecting model uncertainty. Exact evaluati<strong>on</strong> of flow<br />
probabilities is exp<strong>on</strong>ential in the number of edges and naive sampling can also<br />
be expensive, so we propose sampling in an efficient Markov-Chain M<strong>on</strong>te-Carlo<br />
fashi<strong>on</strong> using the Metropolis-Hastings algorithm — details described in the paper.<br />
We identify two types of data, those where the paths of past flows are known — attributed<br />
data, and those where <strong>on</strong>ly the endpoints are known — unattributed data.<br />
Both data types are addressed in this paper, including training methods, example<br />
real world data sets, and experimental evaluati<strong>on</strong>. In particular, we investigate<br />
flow data from the Twitter micro-blogging service, exploring the flow of messages<br />
through retweets (tweet forwards) for the attributed case, and the propagati<strong>on</strong> of<br />
hashtags (metadata tags) and urls for the unattributed case.<br />
SeSSi<strong>on</strong> 13: P2P AND DiSTriBUTED ProcESSiNG<br />
BestPeer++: A Peer-to-Peer based Large-scale <strong>Data</strong> Processing<br />
Gang chen (NetEase.com inc. & Zhejiang University)<br />
Tianlei Hu (NetEase.com inc. & Zhejiang University)<br />
Dawei Jiang (Nati<strong>on</strong>al University of Singapore)<br />
Peng Lu (Nati<strong>on</strong>al University of Singapore)<br />
Kian-Lee Tan (Nati<strong>on</strong>al University of Singapore)<br />
Hoang Tam vo (Nati<strong>on</strong>al University of Singapore)<br />
Sai Wu (BestPeer Pte. Ltd. & Nati<strong>on</strong>al University of Singapore)<br />
The corporate network is often used for sharing informati<strong>on</strong> am<strong>on</strong>g the participating<br />
companies and facilitating collaborati<strong>on</strong> in a certain industry sector where companies<br />
share a comm<strong>on</strong> interest. It can effectively help the companies to reduce<br />
their operati<strong>on</strong>al costs and increase the revenues. However, the inter-company data<br />
sharing and processing poses unique challenges to such a data management system<br />
including scalability, performance, throughput, and security. In this paper, we<br />
present BestPeer++, a system which delivers elastic data sharing services for corporate<br />
network applicati<strong>on</strong>s in the cloud based <strong>on</strong> BestPeer — a peer-to-peer (P2P)<br />
based data management platform. By integrating cloud computing, database, and<br />
P2P technologies into <strong>on</strong>e system, BestPeer++ provides an ec<strong>on</strong>omical, flexible and<br />
scalable platform for corporate network applicati<strong>on</strong>s and delivers data sharing services<br />
to participants based <strong>on</strong> the widely accepted pay-as-you-go business model.<br />
We evaluate BestPeer++ <strong>on</strong> Amaz<strong>on</strong> EC2 Cloud platform. The benchmarking results<br />
show that BestPeer++ outperforms HadoopDB, a recently proposed large-scale<br />
data processing system, in performance when both systems are employed to handle<br />
typical corporate network workloads. The benchmarking results also dem<strong>on</strong>strate<br />
that BestPeer++ achieves near linear scalability for throughput with respect to the<br />
number of peer nodes.<br />
Page<br />
93
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Effective <strong>Data</strong> Density Estimati<strong>on</strong> in Ring-based P2P Networks<br />
Minqi Zhou (East china Normal University)<br />
Heng Tao Shen (The University of Queensland)<br />
Xiaofang Zhou (The University of Queensland)<br />
Weining Qian (East china Normal University)<br />
Aoying Zhou (East china Normal University)<br />
Estimating the global data distributi<strong>on</strong> in Peer-to-Peer (P2P) networks is an important<br />
issue and has yet to be well addressed. It can benefit many P2P applicati<strong>on</strong>s,<br />
such as load balancing analysis, query processing, and data mining. Inspired by the<br />
inversi<strong>on</strong> method for random variate generati<strong>on</strong>, in this paper we present a novel<br />
model named distributi<strong>on</strong>-free data density estimati<strong>on</strong> for dynamic ring-based P2P<br />
networks to achieve high estimati<strong>on</strong> accuracy with low estimati<strong>on</strong> cost regardless<br />
of distributi<strong>on</strong> models of the underlying data. It generates random samples for any<br />
arbitrary distributi<strong>on</strong> by sampling the global cumulative distributi<strong>on</strong> functi<strong>on</strong> and is<br />
free from sampling bias. In P2P networks, the key idea for distributi<strong>on</strong>-free estimati<strong>on</strong><br />
is to sample a small subset of peers for estimating the global data distributi<strong>on</strong><br />
over the data domain. Algorithms <strong>on</strong> computing and sampling the global cumulative<br />
distributi<strong>on</strong> functi<strong>on</strong> based <strong>on</strong> which global data distributi<strong>on</strong> is estimated are<br />
introduced with detailed theoretical analysis. Our extensive performance study c<strong>on</strong>firms<br />
the effectiveness and efficiency of our methods in ring-based P2P networks.<br />
Processing of Rank Joins in Highly Distributed Systems<br />
christos Doulkeridis (Norwegian University of Science and Technology (NTNU))<br />
Akrivi vlachou (Norwegian University of Science and Technology (NTNU))<br />
Kjetil Nørvåg (Norwegian University of Science and Technology (NTNU))<br />
yannis Kotidis (Athens University of Ec<strong>on</strong>omics and Business (AUEB))<br />
Neoklis Polyzotis (Uc Santa cruz (UcSc))<br />
In this paper, we study efficient processing of rank joins in highly distributed<br />
systems, where servers store fragments of relati<strong>on</strong>s in an aut<strong>on</strong>omous manner.<br />
Existing rank-join algorithms exhibit poor performance in this setting due to excessive<br />
communicati<strong>on</strong> costs or high latency. We propose a novel distributed rank-join<br />
framework that employs data statistics, maintained as histograms, to determine the<br />
subset of each relati<strong>on</strong>al fragment that needs to be fetched to generate the top-k<br />
join results. At the heart of our framework lies a distributed score bound estimati<strong>on</strong><br />
algorithm that produces sufficient score bounds for each relati<strong>on</strong>, that guarantee<br />
the correctness of the rank-join result set, when the histograms are accurate. Furthermore,<br />
we propose a generalizati<strong>on</strong> of our framework that supports approximate<br />
statistics, in the case that the exact statistical informati<strong>on</strong> is not available. An extensive<br />
experimental study validates the efficiency of our framework and dem<strong>on</strong>strates<br />
its advantages over existing methods.<br />
Page<br />
94
Load Balancing for MapReduce-based Entity Resoluti<strong>on</strong><br />
Lars Kolb (University of Leipzig)<br />
Andreas Thor (University of Leipzig)<br />
Erhard rahm (University of Leipzig)<br />
Abstracts<br />
The effectiveness and scalability of MapReduce-based implementati<strong>on</strong>s of complex data-intensive<br />
tasks depend <strong>on</strong> an even redistributi<strong>on</strong> of data between map and reduce<br />
tasks. In the presence of skewed data, sophisticated redistributi<strong>on</strong> approaches thus<br />
become necessary to achieve load balancing am<strong>on</strong>g all reduce tasks to be executed<br />
in parallel. For the complex problem of entity resoluti<strong>on</strong>, we propose and evaluate<br />
two approaches for such skew handling and load balancing. The approaches support<br />
blocking techniques to reduce the search space of entity resoluti<strong>on</strong>, utilize a preprocessing<br />
MapReduce job to analyze the data distributi<strong>on</strong>, and distribute the entities of<br />
large blocks am<strong>on</strong>g multiple reduce tasks. The evaluati<strong>on</strong> <strong>on</strong> a real cloud infrastructure<br />
shows the value and effectiveness of the proposed load balancing approaches.<br />
SeSSi<strong>on</strong> 14: XML AND rDF DATA MANAGEMENT<br />
Mapping XML to a Wide Sparse Table<br />
Liang Jeff chen (UcSD)<br />
Philip A. Bernstein (Microsoft corp.)<br />
Peter carlin (Microsoft corp.)<br />
Dimitrije Filipovic (Microsoft corp.)<br />
Michael rys (Microsoft corp.)<br />
Nikita Shamgunov (Facebook inc.)<br />
James F. Terwilliger (Microsoft corp.)<br />
Milos Todic (Microsoft corp.)<br />
Sasa Tomasevic (Microsoft corp.)<br />
Dragan Tomic (Microsoft corp.)<br />
XML is comm<strong>on</strong>ly supported by SQL database systems. However, existing mappings<br />
of XML to tables can <strong>on</strong>ly deliver satisfactory query performance for limited use<br />
cases. In this paper, we propose a novel mapping of XML data into <strong>on</strong>e wide table<br />
whose columns are sparsely populated. This mapping provides good performance<br />
for document types and queries that are observed in enterprise applicati<strong>on</strong>s but are<br />
not supported efficiently by existing work. XML queries are evaluated by translating<br />
them into SQL queries over the wide sparsely-populated table. We show how to<br />
translate full XPath 1.0 into SQL. Based <strong>on</strong> the characteristics of the new mapping,<br />
we present rewriting optimizati<strong>on</strong>s that minimize the number of joins. Experiments<br />
dem<strong>on</strong>strate that query evaluati<strong>on</strong> over the new mapping delivers c<strong>on</strong>siderable<br />
improvements over existing techniques for the target use cases.<br />
Querying XML <strong>Data</strong>: As You Shape It<br />
curtis E. Dyres<strong>on</strong> (Utah State University)<br />
Sourav S. Bhowmick (Nanyang Technological University)<br />
A limitati<strong>on</strong> of XQuery is that a programmer has to be familiar with the shape of the<br />
data to query it effectively. And if that shape changes, or if the shape is other than<br />
Page<br />
95
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
what the programmer expects, the query may fail. One way to avoid this limitati<strong>on</strong><br />
is to transform the data into a desired shape. A data transformati<strong>on</strong> is a rearrangement<br />
of data into a new shape. In this paper, we present the semantics and implementati<strong>on</strong><br />
of XMorph 2.0, a shape-polymorphic data transformati<strong>on</strong> language for<br />
XML. An XMorph program can act as a query guard. The guard both transforms<br />
data to the shape needed by the query and determines whether and how the transformati<strong>on</strong><br />
potentially loses informati<strong>on</strong>; a transformati<strong>on</strong> that loses informati<strong>on</strong><br />
may lead to a query yielding an inaccurate result. This paper describes how to use<br />
XMorph as a query guard, gives a formal semantics for shape-to-shape transformati<strong>on</strong>s,<br />
documents how XMorph determines how a transformati<strong>on</strong> potentially loses<br />
informati<strong>on</strong>, and describes the XMorph implementati<strong>on</strong>.<br />
Branch Code: A Labeling Scheme for Efficient Query Answering <strong>on</strong> Trees<br />
yanghua Xiao (Fudan University)<br />
Ji H<strong>on</strong>g (Fudan University)<br />
Wanyun cui (Fudan University)<br />
Zhenying He (Fudan University)<br />
Wei Wang (Fudan University)<br />
Guod<strong>on</strong>g Feng (Fudan University)<br />
Labeling schemes lie at the core of query processing for many tree-structured data<br />
such as XML data that is flooding the web. A labeling scheme that can simultaneously<br />
and efficiently support various relati<strong>on</strong>ship queries <strong>on</strong> trees (such as parent/<br />
children, descendant/ancestor, etc.), computati<strong>on</strong> of lowest comm<strong>on</strong> ancestors<br />
(LCA) and update of trees, is desired for effective and efficient management of<br />
tree-structured data. Although a variety of labeling schemes such as prefix-based<br />
labeling, interval-based labeling and prime-based labeling as well as their variants<br />
have been available to us for encoding static and dynamic trees, these labeling<br />
schemes usually show weakness in <strong>on</strong>e aspect or another. In this paper, we propose<br />
an integer-based labeling scheme branch code as well as its compressed versi<strong>on</strong><br />
as our major soluti<strong>on</strong> to simultaneously support efficient query processing <strong>on</strong> both<br />
static and dynamic ordered trees with affordable storage cost. The proposed branch<br />
code can answer comm<strong>on</strong> queries <strong>on</strong> ordered trees in c<strong>on</strong>stant time, which comes<br />
at the cost of c<strong>on</strong>suming O(Nlog N) storage. To reduce storage cost to O(N), a compressed<br />
branch code is further developed. We also give a relati<strong>on</strong>ship determinati<strong>on</strong><br />
algorithm purely using compressed branch code, which is of quite low possibility to<br />
produce false positive results as verified by experimental results. With the support<br />
of splay trees, branch code can also support dynamic trees so that updates and<br />
queries can be implemented with O(log N) amortized cost. All the results above are<br />
either theoretically proved or verified by experimental studies.<br />
Scalable Multi-Query Optimizati<strong>on</strong> for SPARQL<br />
Wangchao Le (University of Utah)<br />
Anastasios Kementsietsidis (iBM T. J. Wats<strong>on</strong> research center)<br />
S<strong>on</strong>gyun Duan (iBM T. J. Wats<strong>on</strong> research center)<br />
Feifei Li (University of Utah)<br />
This paper revisits the classical problem of multi-query optimizati<strong>on</strong> in the c<strong>on</strong>text<br />
of RDF/SPARQL. We show that the techniques developed for relati<strong>on</strong>al and<br />
Page<br />
96
Abstracts<br />
semi-structured data/query languages are hard, if not impossible, to be extended<br />
to account for RDF data model and graph query patterns expressed in SPARQL. In<br />
light of the NP-hardness of the multi-query optimizati<strong>on</strong> for SPARQL, we propose<br />
heuristic algorithms that partiti<strong>on</strong> the input batch of queries into groups such that<br />
each group of queries can be optimized together. An essential comp<strong>on</strong>ent of the<br />
optimizati<strong>on</strong> incorporates an efficient algorithm to discover the comm<strong>on</strong> substructures<br />
of multiple SPARQL queries and an effective cost model to compare<br />
candidate executi<strong>on</strong> plans. Since our optimizati<strong>on</strong> techniques do not make any<br />
assumpti<strong>on</strong> about the underlying SPARQL query engine, they have the advantage<br />
of being portable across different RDF stores. The extensive experimental studies,<br />
performed <strong>on</strong> three popular RDF stores, show that the proposed techniques are<br />
effective, efficient and scalable.<br />
SeSSi<strong>on</strong> 15: PErForMANcE<br />
GSLPI: a Cost-based Query Progress Indicator<br />
Jiexing Li (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
rimma v. Nehme (Microsoft Jim Gray Systems Lab)<br />
Jeffrey Naught<strong>on</strong> (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Progress indicators for SQL queries were first published in 2004 with the simultaneous<br />
and independent proposals from Chaudhuri et al. and Luo et al. In this paper,<br />
we implement both progress indicators in the same commercial RDBMS to investigate<br />
their performance. We summarize comm<strong>on</strong> cases in which they are both accurate<br />
and cases in which they fail to provide reliable estimates. Although there are<br />
differences in their performance, much more striking is the similarity in the errors<br />
they make due to a comm<strong>on</strong> simplifying uniform future speed assumpti<strong>on</strong>. While<br />
the developers of these progress indicators were aware that this assumpti<strong>on</strong> could<br />
cause errors, they neither explored how large the errors might be nor did they<br />
investigate the feasibility of removing the assumpti<strong>on</strong>. To rectify this we propose a<br />
new query progress indicator, similar to these early progress indicators but without<br />
the uniform speed assumpti<strong>on</strong>. Experiments show that <strong>on</strong> the TPC-H benchmark,<br />
<strong>on</strong> queries for which the original progress indicators have errors up to 30X the<br />
query running time, the new progress indicator is accurate to within 10 percent. We<br />
also discuss the sources of the errors that still remain and shed some light <strong>on</strong> what<br />
would need to be d<strong>on</strong>e to eliminate them.<br />
Micro-Specializati<strong>on</strong> in DBMSes<br />
rui Zhang (The University of Ariz<strong>on</strong>a)<br />
richard T. Snodgrass (The University of Ariz<strong>on</strong>a)<br />
Saumya Debray (The University of Ariz<strong>on</strong>a)<br />
Relati<strong>on</strong>al database management systems are general in the sense that they can<br />
handle arbitrary schemas, queries, and modificati<strong>on</strong>s; this generality is implemented<br />
using runtime metadata lookups and tests that ensure that c<strong>on</strong>trol is channelled<br />
to the appropriate code in all cases. Unfortunately, these lookups and tests are<br />
carried out even when informati<strong>on</strong> is available that renders some of these operati<strong>on</strong>s<br />
superfluous, leading to unnecessary runtime overheads. This paper introduces<br />
micro-specializati<strong>on</strong>, an approach that uses relati<strong>on</strong>- and query-specific<br />
Page<br />
97
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
informati<strong>on</strong> to specialize the DBMS code at runtime and thereby eliminate some of<br />
these overheads. We develop a tax<strong>on</strong>omy of approaches and specializati<strong>on</strong> times<br />
and propose a general architecture that isolates most of the creati<strong>on</strong> and executi<strong>on</strong><br />
of the specialized code sequences in a separate DBMS-independent module.<br />
Through three illustrative types of micro-specializati<strong>on</strong>s applied to PostgreSQL,<br />
we show that this approach requires minimal changes to a DBMS and can improve<br />
the performance simultaneously across a wide range of queries, modificati<strong>on</strong>s, and<br />
bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C<br />
benchmarks.<br />
Towards Multi-Tenant Performance SLOs<br />
Willis Lang (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Srinath Shankar (Microsoft Jim Gray Systems Lab)<br />
Jignesh M. Patel (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Ajay Kalhan (Microsoft corp.)<br />
As traditi<strong>on</strong>al and missi<strong>on</strong>-critical relati<strong>on</strong>al database workloads migrate to the<br />
cloud in the form of <strong>Data</strong>base- as-a-Service (DaaS), there is an increasing motivati<strong>on</strong><br />
to provide performance goals in Service Level Objectives (SLOs). Providing<br />
such performance goals is challenging for DaaS providers as they must balance the<br />
performance that they can deliver to tenants and the data center’s operating costs.<br />
In general, aggressively aggregating tenants <strong>on</strong> each server reduces the operating<br />
costs but degrades performance for the tenants, and vice versa. In this paper, we<br />
present a framework that takes as input the tenant workloads, their performance<br />
SLOs, and the server hardware that is available to the DaaS provider, and outputs<br />
a cost- effective recipe that specifies how much hardware to provisi<strong>on</strong> and how<br />
to schedule the tenants <strong>on</strong> each hardware resource. We evaluate our method and<br />
show that it produces effective soluti<strong>on</strong>s that can reduce the costs for the DaaS<br />
provider while meeting performance goals.<br />
Multi-Versi<strong>on</strong> C<strong>on</strong>currency via Timestamp Range C<strong>on</strong>flict Management<br />
David Lomet (Microsoft research)<br />
Alan Fekete (University of Sydney)<br />
rui Wang (Microsoft research)<br />
Peter Ward (University of Sydney)<br />
A database supporting multiple versi<strong>on</strong>s of records may use the versi<strong>on</strong>s to support<br />
queries of the past or to increase c<strong>on</strong>currency by enabling reads and writes to<br />
be c<strong>on</strong>current. We introduce a new c<strong>on</strong>currency c<strong>on</strong>trol approach that enables all<br />
SQL isolati<strong>on</strong> levels including serializability to utilize multiple versi<strong>on</strong>s to increase<br />
c<strong>on</strong>currency while also supporting transacti<strong>on</strong> time database functi<strong>on</strong>ality. The<br />
key insight is to manage a range of possible timestamps for each transacti<strong>on</strong> that<br />
captures the impact of c<strong>on</strong>flicts that have occurred. Using these ranges as c<strong>on</strong>straints<br />
often permits c<strong>on</strong>current access where lock based c<strong>on</strong>currency c<strong>on</strong>trol<br />
would block. This can also allow blocking instead of some aborts that are comm<strong>on</strong><br />
in earlier multi-versi<strong>on</strong> c<strong>on</strong>currency techniques. Also, timestamp ranges can be<br />
used to c<strong>on</strong>servatively find deadlocks without graph based cycle detecti<strong>on</strong>. Thus,<br />
our multi-versi<strong>on</strong> support can enhance performance of current time data access via<br />
improved c<strong>on</strong>currency, while supporting transacti<strong>on</strong> time functi<strong>on</strong>ality.<br />
Page<br />
98
SeSSi<strong>on</strong> 16: DATA EXTrAcTioN AND QUALiTy<br />
Abstracts<br />
Automatic Extracti<strong>on</strong> of Structured Web <strong>Data</strong> with Domain Knowledge<br />
Nora Derouiche (Télécom ParisTech – cNrS LTci)<br />
Bogdan cautis (Télécom ParisTech – cNrS LTci)<br />
Talel Abdessalem (Télécom ParisTech – cNrS LTci)<br />
We present in this paper a novel approach for extracting structured data from the<br />
Web, whose goal is to harvest real-world items from template-based HTML pages<br />
(the structured Web). It illustrates a two-phase querying of the Web, in which an<br />
intenti<strong>on</strong>al descripti<strong>on</strong> of the data that is targeted is first provided, in a flexible and<br />
widely applicable manner. The extracti<strong>on</strong> process leverages then both the input<br />
descripti<strong>on</strong> and the source structure. Our approach is domain-independent, in the<br />
sense that it applies to any relati<strong>on</strong>, either flat or nested, describing real-world<br />
items. Extensive experiments <strong>on</strong> five different domains and comparis<strong>on</strong> with the<br />
main state of the art extracti<strong>on</strong> systems from literature illustrate its flexibility and<br />
precisi<strong>on</strong>. We advocate via our technique that automatic extracti<strong>on</strong> and integrati<strong>on</strong><br />
of complex structured data can be d<strong>on</strong>e fast and effectively, when the redundancy<br />
of the Web meets knowledge over the to-be-extracted data.<br />
Discovering C<strong>on</strong>servati<strong>on</strong> Rules<br />
Lukasz Golab (University of Waterloo)<br />
Howard Karloff (AT&T Labs–research)<br />
Flip Korn (AT&T Labs–research)<br />
Barna Saha (AT&T Labs–research)<br />
Divesh Srivastava (AT&T Labs–research)<br />
Many applicati<strong>on</strong>s process data in which there exists a ``c<strong>on</strong>servati<strong>on</strong> law’’ between<br />
related quantities. For example, in traffic m<strong>on</strong>itoring, every incoming event, such as<br />
a packet’s entering a router or a car’s entering an intersecti<strong>on</strong>, should ideally have<br />
an immediate outgoing counterpart. We propose a new class of c<strong>on</strong>straints—-C<strong>on</strong>servati<strong>on</strong><br />
Rules—-that express the semantics and characterize the data quality of<br />
such applicati<strong>on</strong>s. We give c<strong>on</strong>fidence metrics that quantify how str<strong>on</strong>gly a c<strong>on</strong>servati<strong>on</strong><br />
rule holds and present approximati<strong>on</strong> algorithms (with error guarantees) for<br />
the problem of discovering a c<strong>on</strong>cise summary of subsets of the data that satisfy a<br />
given c<strong>on</strong>servati<strong>on</strong> rule. Using real data, we dem<strong>on</strong>strate the utility of c<strong>on</strong>servati<strong>on</strong><br />
rules and we show order-of-magnitude performance improvements of our discovery<br />
algorithms over naive approaches.<br />
Answering Why-not Questi<strong>on</strong>s <strong>on</strong> Top-k Queries<br />
Zhian He (H<strong>on</strong>g K<strong>on</strong>g Polytechnic University),<br />
Eric Lo (H<strong>on</strong>g K<strong>on</strong>g Polytechnic University)<br />
After decades of effort working <strong>on</strong> database performance, the quality and the<br />
usability of database systems have received more attenti<strong>on</strong> in recent years. In<br />
particular, the feature of explaining missing tuples in a query result, or the so-called<br />
“why-not” questi<strong>on</strong>s, has recently become an active topic. In this paper, we study<br />
the problem of answering why-not questi<strong>on</strong>s <strong>on</strong> top-k queries. Our motivati<strong>on</strong> is<br />
Page<br />
99
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
that we know many users love to use top-k queries when they are making multi-criteria<br />
decisi<strong>on</strong>s. However, they often feel frustrated when they are asked to quantify<br />
their feeling as a set of numeric weightings, and feel even more frustrated after they<br />
see the query results do not include their expected answers. In this paper, we use<br />
the query-refinement method to approach the problem. Given as inputs the original<br />
top-k query and a set of missing tuples, our algorithm returns to the user a refined<br />
top-k query that includes the missing tuples. A case study and experimental results<br />
show that our approach returns high quality explanati<strong>on</strong>s to users efficiently.<br />
An Efficient Trie-based Method for Approximate Entity Extracti<strong>on</strong> with<br />
Edit-Distance C<strong>on</strong>straints<br />
D<strong>on</strong>g Deng (Tsinghua University)<br />
Guoliang Li (Tsinghua University)<br />
Jianhua Feng (Tsinghua University)<br />
Dicti<strong>on</strong>ary-based entity extracti<strong>on</strong> has attracted much attenti<strong>on</strong> from the database<br />
community recently, which locates substrings in a document into predefined entities<br />
(e.g., pers<strong>on</strong> names or locati<strong>on</strong>s). To improve extracti<strong>on</strong> recall, a recent trend is<br />
to provide approximate matching between substrings of the document and entities<br />
by tolerating minor errors. In this paper we study dicti<strong>on</strong>ary-based approximate<br />
entity extracti<strong>on</strong> with edit-distance c<strong>on</strong>straints. Existing methods have several<br />
limitati<strong>on</strong>s. First, they need to tune many parameters to achieve high performance.<br />
Sec<strong>on</strong>d, they are inefficient for large edit-distance thresholds. We propose a triebased<br />
method to address these problems. We first partiti<strong>on</strong> each entity into a set of<br />
segments, and then use a trie structure to index segments. To extract similar entities,<br />
we search segments from the document, and extend the matching segments<br />
in both entities and the document to find similar pairs. We develop an extensi<strong>on</strong>based<br />
method to efficiently find similar string pairs by extending the matching<br />
segments. We optimize our partiti<strong>on</strong> scheme and select the best partiti<strong>on</strong> strategy<br />
to improve the extracti<strong>on</strong> performance. Experimental results show that our method<br />
achieves much higher performance compared with state-of-the-art studies.<br />
SeSSi<strong>on</strong> 17: ToP-K ProcESSiNG<br />
On Top-k Structural Similarity Search<br />
Pei Lee (University of British columbia)<br />
Laks v.S. Lakshmanan (University of British columbia)<br />
Jeffrey Xu yu (chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Search for objects similar to a given query object in a network has numerous applicati<strong>on</strong>s<br />
including web search and collaborative filtering. We use the noti<strong>on</strong> of<br />
structural similarity to capture the comm<strong>on</strong>ality of two objects in a network, e.g.,<br />
if two nodes are referenced by the same node, they may be similar. Meeting-based<br />
methods including SimRank and P-Rank capture structural similarity very well.<br />
Deriving inspirati<strong>on</strong> from PageRank, SimRank has gained popularity by a natural<br />
intuiti<strong>on</strong> and domain independence. Since it’s computati<strong>on</strong>ally expensive, subsequent<br />
work has focused <strong>on</strong> optimizing and approximating the computati<strong>on</strong> of<br />
SimRank. In this paper, we approach SimRank from a top-k querying perspective<br />
where given a query node v, we are interested in finding the top-k nodes that have<br />
Page<br />
100
Abstracts<br />
the highest SimRank score w.r.t. v. The <strong>on</strong>ly known approaches for answering such<br />
queries are either a naive algorithm of computing the similarity matrix for all node<br />
pairs or computing the similarity vector by comparing the query node v with each<br />
other node independently, and then picking the top-k. N<strong>on</strong>e of these approaches<br />
can handle top-k structural similarity search efficiently by scaling to very large<br />
graphs c<strong>on</strong>sisting of milli<strong>on</strong>s of nodes. We propose an algorithmic framework called<br />
TopSim based <strong>on</strong> transforming the top-k SimRank problem <strong>on</strong> a graph G to <strong>on</strong>e<br />
of finding the top-k nodes with highest authority <strong>on</strong> the product graph G G. We<br />
further accelerate TopSim by merging similarity paths and develop a more efficient<br />
algorithm called TopSim-SM. Two heuristic algorithms, Trun-TopSim-SM and Prio-<br />
TopSim-SM, are also proposed to approximate TopSim- SM <strong>on</strong> scale-free graphs to<br />
trade accuracy for speed, based <strong>on</strong> truncated random walk and prioritizing propagati<strong>on</strong><br />
respectively. We analyze the accuracy and performance of TopSim family<br />
algorithms and report the results of a detailed experimental study.<br />
Relevance Matters: Capitalizing <strong>on</strong> Less (Top-k Matching in<br />
Publish/Subscribe)<br />
Mohammad Sadoghi (University of Tor<strong>on</strong>to)<br />
Hans-Arno Jacobsen (University of Tor<strong>on</strong>to)<br />
The efficient processing of large collecti<strong>on</strong>s of Boolean expressi<strong>on</strong>s plays a central<br />
role in major data intensive applicati<strong>on</strong>s ranging from user-centric processing<br />
and pers<strong>on</strong>alizati<strong>on</strong> to real-time data analysis. Emerging applicati<strong>on</strong>s such<br />
as computati<strong>on</strong>al advertising and selective informati<strong>on</strong> disseminati<strong>on</strong> demand<br />
determining and presenting to an end-user <strong>on</strong>ly the most relevant c<strong>on</strong>tent that is<br />
both user-c<strong>on</strong>sumable and suitable for limited screen real estate of target devices.<br />
To retrieve the most relevant c<strong>on</strong>tent, we present BE*-Tree, a novel indexing data<br />
structure designed for effective hierarchical top-k pattern matching, which as its<br />
by-product also reduces the operati<strong>on</strong>al cost of processing milli<strong>on</strong>s of patterns. To<br />
further reduce processing cost, BE*-Tree employs an adaptive and n<strong>on</strong>-rigid spacecutting<br />
technique designed to efficiently index Boolean expressi<strong>on</strong>s over a highdimensi<strong>on</strong>al<br />
c<strong>on</strong>tinuous space. At the core of BE*-Tree lie two innovative ideas: (1)<br />
a bi-directi<strong>on</strong>al tree expansi<strong>on</strong> build as a top-down (data and space clustering) and<br />
a bottom-up growths (space clustering), which together enable indexing <strong>on</strong>ly n<strong>on</strong>empty<br />
c<strong>on</strong>tinuous sub-spaces, and (2) an overlap-free splitting strategy. Finally, the<br />
performance of BE*-Tree is proven through a comprehensive experimental comparis<strong>on</strong><br />
against state-of-the-art index structures for matching Boolean expressi<strong>on</strong>s.<br />
Efficiently M<strong>on</strong>itoring Top-k Pairs over Sliding Windows<br />
Zhitao Shen (UNSW)<br />
Muhammad Aamir cheema (UNSW)<br />
Xuemin Lin (UNSW & EcNU)<br />
Wenjie Zhang (UNSW)<br />
Haixun Wang (Microsoft research Asia)<br />
Top-k pairs queries have received significant attenti<strong>on</strong> by the research community.<br />
k-closest pairs queries, k-furthest pairs queries and their variants are am<strong>on</strong>g the<br />
most well studied special cases of the top-k pairs queries. In this paper, we present<br />
the first approach to answer a broad class of top-k pairs queries over sliding<br />
Page<br />
101
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
windows. Our framework handles multiple top-k pairs queries and each query is<br />
allowed to use a different scoring functi<strong>on</strong>, a different value of k and a different size<br />
of the sliding window. Although the number of possible pairs in the sliding window<br />
is quadratic to the number of objects N in the sliding window, we efficiently answer<br />
the top-k pairs query by maintaining a small subset of pairs called K-skyband which<br />
is expected to c<strong>on</strong>sist of O(K log(N/K)) pairs. For all the queries that use the same<br />
scoring functi<strong>on</strong>, we need to maintain <strong>on</strong>ly <strong>on</strong>e K-skyband. We present efficient<br />
techniques for the K-skyband maintenance and query answering. We c<strong>on</strong>duct a<br />
detailed complexity analysis and show that the expected cost of our approach is<br />
reas<strong>on</strong>ably close to the lower bound cost. We experimentally verify this by comparing<br />
our approach with a specially designed supreme algorithm that assumes the<br />
existence of an oracle and meets the lower bound cost.<br />
Processing and Notifying Range Top-k Subscripti<strong>on</strong>s<br />
Albert yu (Duke University)<br />
Pankaj K. Agarwal (Duke University)<br />
Jun yang (Duke University)<br />
We c<strong>on</strong>sider how to support a large number of users over a wide-area network<br />
whose interests are characterised by range top-k c<strong>on</strong>tinuous queries. Given an<br />
object update, we need to notify users whose top-k results are affected. Simple<br />
soluti<strong>on</strong>s include using a c<strong>on</strong>tent-driven network to notify all users whose interest<br />
ranges c<strong>on</strong>tain the update (ignoring top-k), or using a server to compute <strong>on</strong>ly the<br />
affected queries and notifying them individually. The former soluti<strong>on</strong> generates too<br />
much network traffic, while the latter overwhelms the server. We present a geometric<br />
framework for the problem that allows us to describe the set of affected queries<br />
succinctly with messages that can be efficiently disseminated using c<strong>on</strong>tent-driven<br />
networks. We give fast algorithms to reformulate each update into a set of messages<br />
whose number is provably optimal, with or without knowing all user interests.<br />
We also present extensi<strong>on</strong>s to our soluti<strong>on</strong>, including an approximate algorithm that<br />
trades off between the cost of server-side reformulati<strong>on</strong> and that of user-side postprocessing,<br />
as well as efficient techniques for batch updates.<br />
SESSioN 18: SiMiLAriTy<br />
Efficient Exact Similarity Searches using Multiple Token Orderings<br />
J<strong>on</strong>gik Kim (ch<strong>on</strong>buk Nati<strong>on</strong>al University)<br />
H<strong>on</strong>grae Lee (Google inc.)<br />
Similarity searches are essential in many applicati<strong>on</strong>s including data cleaning and near<br />
duplicate detecti<strong>on</strong>. Many similarity search algorithms first generate candidate records,<br />
and then identify true matches am<strong>on</strong>g them. A major focus of those algorithms has<br />
been <strong>on</strong> how to reduce the number of candidate records in the early stage of similarity<br />
query processing. One of the most comm<strong>on</strong>ly used techniques to reduce the candidate<br />
size is the prefix filtering principle, which exploits the document frequency ordering of<br />
tokens. In this paper, we propose a novel partiti<strong>on</strong>ing technique that c<strong>on</strong>siders multiple<br />
token orderings based <strong>on</strong> token co-occurrence statistics. Experimental results show<br />
that the proposed technique is effective in reducing the number of candidate records<br />
and as a result improves the performance of existing algorithms significantly.<br />
Page<br />
102
Abstracts<br />
Efficient Graph Similarity Joins with Edit Distance C<strong>on</strong>straints<br />
Xiang Zhao (The University of New South Wales & NicTA)<br />
chuan Xiao (The University of New South Wales)<br />
Xuemin Lin (The University of New South Wales & East china Normal University)<br />
Wei Wang (The University of New South Wales)<br />
Graphs are widely used to model complicated data semantics in many applicati<strong>on</strong>s<br />
in bioinformatics, chemistry, social networks, pattern recogniti<strong>on</strong>, etc. A recent trend<br />
is to tolerate noise arising from various sources, such as err<strong>on</strong>eous data entry, and<br />
find similarity matches. In this paper, we study the graph similarity join problem that<br />
returns pairs of graphs such that their edit distances are no larger than a threshold.<br />
Inspired by the q-gram idea for string similarity problem, our soluti<strong>on</strong> extracts<br />
paths from graphs as features for indexing. We establish a lower bound of comm<strong>on</strong><br />
features to generate candidates. An efficient algorithm is proposed to exploit both<br />
matching and mismatching features to improve the filtering and verificati<strong>on</strong> <strong>on</strong> candidates.<br />
We dem<strong>on</strong>strate the proposed algorithm significantly outperforms existing<br />
approaches with extensive experiments <strong>on</strong> publicly available datasets.<br />
Parameter-Free Determinati<strong>on</strong> of Distance Thresholds for Metric<br />
Distance C<strong>on</strong>straints<br />
Shaoxu S<strong>on</strong>g (Tsinghua University)<br />
Lei chen (The H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology)<br />
H<strong>on</strong>g cheng (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
The importance of introducing distance c<strong>on</strong>straints to data dependencies, such as<br />
differential dependencies (DDs) [28], has recently been recognized. The metric distance<br />
c<strong>on</strong>straints are tolerant to small variati<strong>on</strong>s, which enable them apply to wide<br />
data quality checking applicati<strong>on</strong>s, such as detecting data violati<strong>on</strong>s. However, the<br />
determinati<strong>on</strong> of distance thresholds for the metric distance c<strong>on</strong>straints is n<strong>on</strong>-trivial.<br />
It often relies <strong>on</strong> a truth data instance which embeds the distance c<strong>on</strong>straints.<br />
To find useful distance threshold patterns from data, there are several guidelines<br />
of statistical measures to specify, e.g., support, c<strong>on</strong>fidence and dependent quality.<br />
Unfortunately, given a data instance, users might not have any knowledge about<br />
the data distributi<strong>on</strong>, thus it is very challenging to set the right parameters. In<br />
this paper, we study the determinati<strong>on</strong> of distance thresholds for metric distance<br />
c<strong>on</strong>straints, in a parameter-free style. Specifically, we compute an expected utility<br />
based <strong>on</strong> the statistical measures from the data. According to our analysis as well<br />
as experimental verificati<strong>on</strong>, distance threshold patterns with higher expected<br />
utility could offer better usage in real applicati<strong>on</strong>s, such as violati<strong>on</strong> detecti<strong>on</strong>. We<br />
then develop efficient algorithms to determine the distance thresholds having the<br />
maximum expected utility. Finally, our extensive experimental evaluati<strong>on</strong> dem<strong>on</strong>strates<br />
the effectiveness and efficiency of the proposed methods.<br />
Page<br />
103
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Random Error Reducti<strong>on</strong> in Similarity Search <strong>on</strong> Time Series:<br />
A Statistical Approach<br />
Wush chi-Hsuan Wu (Academia Sinica)<br />
Mi-yen yeh (Academia Sinica)<br />
Jian Pei (Sim<strong>on</strong> Fraser University)<br />
Errors in measurement can be categorized into two types: systematic errors that<br />
are predictable, and random errors that are inherently unpredictable and have null<br />
expected value. Random error is always present in a measurement. More often<br />
than not, readings in time series may c<strong>on</strong>tain inherent random errors due to causes<br />
like dynamic error, drift, noise, hysteresis, digitalizati<strong>on</strong> error and limited sampling<br />
frequency. Random errors may affect the quality of time series analysis substantially.<br />
Unfortunately, most of the existing time series mining and analysis methods,<br />
such as similarity search, clustering, and classificati<strong>on</strong> tasks, do not address random<br />
errors, possibly because random error in a time series, which can be modeled as<br />
a random variable of unknown distributi<strong>on</strong>, is hard to handle. In this paper, we<br />
tackle this challenging problem. Taking similarity search as an example, which is an<br />
essential task in time series analysis, we develop MISQ, a statistical approach for<br />
random error reducti<strong>on</strong> in time series analysis. The major intuiti<strong>on</strong> in our method is<br />
to use <strong>on</strong>ly the readings at different time instants in a time series to reduce random<br />
errors. We achieve a highly desirable property in MISQ: it can ensure that the recall<br />
is above a user-specified threshold. An extensive empirical study <strong>on</strong> 20 benchmark<br />
real data sets clearly shows that our method can lead to better performance than<br />
the baseline method without random error reducti<strong>on</strong> in real applicati<strong>on</strong>s such as<br />
classificati<strong>on</strong>. Moreover, MISQ achieves good quality in similarity search.<br />
SeSSi<strong>on</strong> 19: TEXT AND STriNGS<br />
Optimizing Statistical Informati<strong>on</strong> Extracti<strong>on</strong> Programs Over<br />
Evolving Text<br />
Fei chen (HP Labs china)<br />
Xixuan Feng (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
christopher re (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Min Wang (HP Labs china)<br />
Statistical informati<strong>on</strong> extracti<strong>on</strong> (IE) programs are increasingly used to build realworld<br />
IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical<br />
IE approaches c<strong>on</strong>sider the text corpora underlying the extracti<strong>on</strong> program to be<br />
static. However, many real-world text corpora are dynamic (documents are inserted,<br />
modified, and removed). As the corpus evolves, and IE programs must be applied<br />
repeatedly to c<strong>on</strong>secutive corpus snapshots to keep extracted informati<strong>on</strong> up to<br />
date. Applying IE from scratch to each snapshot may be inefficient: a pair of c<strong>on</strong>secutive<br />
snapshots may change very little, but unaware of this, the program must<br />
run again from scratch. In this paper, we present \crflex, a system that efficiently<br />
executes such repeated statistical IE, by recycling previous IE results to enable incremental<br />
update. We focus <strong>on</strong> statistical IE programs which use a leading statistical<br />
model, C<strong>on</strong>diti<strong>on</strong>al Random Fields (CRFs). We show how to model properties<br />
of the CRF inference algorithms for incremental update and how to exploit them<br />
Page<br />
104
Abstracts<br />
to correctly recycle previous inference results. Then we show how to efficiently<br />
capture and store intermediate results of IE programs for subsequent recycling.<br />
We find that there is a tradeoff between the I/O cost spent <strong>on</strong> reading and writing<br />
intermediate results, and CPU cost we can save from recycling those intermediate<br />
results. Therefore we present a cost-based soluti<strong>on</strong> to determine the most efficient<br />
recycling approach for any given CRF-based IE program and an evolving corpus.<br />
We present extensive experiments with CRF-based IE programs for 3 IE tasks over<br />
a real-world data set to dem<strong>on</strong>strate the utility of our approach.<br />
Approximate String Membership Checking: A Multiple Filter,<br />
Optimizati<strong>on</strong>-Based Approach<br />
ch<strong>on</strong>g Sun (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Jeffrey F. Naught<strong>on</strong> (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
Siddharth Barman (University of Wisc<strong>on</strong>sin-Madis<strong>on</strong>)<br />
We c<strong>on</strong>sider the approximate string membership checking (ASMC) problem of extracting<br />
all the strings or substrings in a document that approximately match some<br />
string in a given dicti<strong>on</strong>ary. To solve this problem, the current state-of-art approach<br />
involves first applying an approximate, fast filter, then applying a more expensive<br />
exact verificati<strong>on</strong> algorithm to the strings that pass the filter. Corresp<strong>on</strong>dingly,<br />
many string filters have been proposed. We note that different filters are good at<br />
eliminating different strings, depending <strong>on</strong> the characteristics of the strings in both<br />
the documents and the dicti<strong>on</strong>ary. We suspect that no single filter will dominate all<br />
other filters everywhere. Given an ASMC problem instance and a set of string filters,<br />
we need to select the optimal filter to maximize the performance. Furthermore, in<br />
our experiments we found that in some cases a sequence of filters dominates any<br />
of the filters of the sequence in isolati<strong>on</strong>, and that the best set of filters and their<br />
ordering depend up<strong>on</strong> the specific problem instance encountered. Accordingly, we<br />
propose that the approximate match problem be viewed as an optimizati<strong>on</strong> problem,<br />
and evaluate a number of techniques for solving this optimizati<strong>on</strong> problem.<br />
On Text Clustering with Side Informati<strong>on</strong><br />
charu c. Aggarwal (iBM T. J. Wats<strong>on</strong> research center)<br />
yuchen Zhao (University of illinois at chicago)<br />
Philip S. yu (University of illinois at chicago)<br />
Text clustering has become an increasingly important problem in recent years<br />
because of the tremendous amount of unstructured data which is available in various<br />
forms in <strong>on</strong>line forums such as the web, social networks, and other informati<strong>on</strong><br />
networks. In most cases, the data is not purely available in text form. A lot of side-informati<strong>on</strong><br />
is available al<strong>on</strong>g with the text documents. Such side-informati<strong>on</strong> may be of<br />
different kinds, such as the links in the document, user-access behavior from web logs,<br />
or other n<strong>on</strong>-textual attributes which are embedded into the text document. Such<br />
attributes may c<strong>on</strong>tain a tremendous amount of informati<strong>on</strong> for clustering purposes.<br />
However, the relative importance of this side-informati<strong>on</strong> may be difficult to estimate,<br />
especially when some of the informati<strong>on</strong> is noisy. In such cases, it can be risky to<br />
incorporate side-informati<strong>on</strong> into the clustering process, because it can either improve<br />
the quality of the representati<strong>on</strong> for clustering, or can add noise to the process. Therefore,<br />
we need a principled way to perform the clustering process, so as to maximize<br />
Page<br />
105
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
the advantages from using this side informati<strong>on</strong>. In this paper, we design an algorithm<br />
which combines classical partiti<strong>on</strong>ing algorithms with probabilistic models in order to<br />
create an effective clustering approach. We present experimental results <strong>on</strong> a number<br />
of real data sets in order to illustrate the advantages of using such an approach.<br />
Fast SLCA and ELCA Computati<strong>on</strong> for XML Keyword Queries based <strong>on</strong><br />
Set Intersecti<strong>on</strong><br />
Junfeng Zhou (yanshan University)<br />
Zhifeng Bao (Nati<strong>on</strong>al University of Singapore)<br />
Wei Wang (The University of New South Wales)<br />
Tok Wang Ling (Nati<strong>on</strong>al University of Singapore)<br />
Ziyang chen (yanshan University)<br />
Xud<strong>on</strong>g Lin (yanshan University)<br />
Jingfeng Guo (yanshan University)<br />
In this paper, we focus <strong>on</strong> efficient keyword query processing for XML data based<br />
<strong>on</strong> the SLCA and ELCA semantics. We propose a novel form of inverted lists for keywords<br />
which include IDs of nodes that directly or indirectly c<strong>on</strong>tain a given keyword.<br />
We propose a family of efficient algorithms that are based <strong>on</strong> the set intersecti<strong>on</strong> operati<strong>on</strong><br />
for both semantics. We show that the problem of SLCA/ELCA computati<strong>on</strong><br />
becomes finding a set of nodes that appear in all involved inverted lists and satisfy<br />
certain c<strong>on</strong>diti<strong>on</strong>s. We also propose several optimizati<strong>on</strong> techniques to further improve<br />
the query processing performance. We have c<strong>on</strong>ducted extensive experiments<br />
with many alternative methods. The results dem<strong>on</strong>strate that our proposed methods<br />
outperform previous methods by up to two orders of magnitude in many cases.<br />
SeSSi<strong>on</strong> 20: QUEry ProcESSiNG ii<br />
Optimizati<strong>on</strong> of Massive Pattern Queries by Dynamic<br />
C<strong>on</strong>figurati<strong>on</strong> Morphing<br />
Nikolay Laptev (University of california, Los Angeles)<br />
carlo Zaniolo (University of california, Los Angeles)<br />
Complex pattern queries play a critical role in many applicati<strong>on</strong>s that must efficiently<br />
search databases and data streams. Current techniques support the search<br />
for multiple patterns using deterministic or n<strong>on</strong>-deterministic automata. In practice<br />
however, the static pattern representati<strong>on</strong> does not fully utilize available system<br />
resources, subsequently suffering from poor performance. Therefore a low overhead<br />
auto-rec<strong>on</strong>figurable automat<strong>on</strong> is needed that optimizes pattern matching<br />
performance. In this paper, we propose a dynamic system that entails the efficient<br />
and reliable evaluati<strong>on</strong> of a very large number of pattern queries <strong>on</strong> a resource c<strong>on</strong>strained<br />
system under changing stress-load. Our system prototype, Morpheus, precomputes<br />
several query pattern representati<strong>on</strong>s, named templates, which are then<br />
morphed into a required form during run-time. Morpheus uses templates to speed<br />
up dynamic automat<strong>on</strong> rec<strong>on</strong>figurati<strong>on</strong>. Results from empirical studies c<strong>on</strong>firm the<br />
benefits of our approach, with three orders of magnitude improvement achieved in<br />
the overall pattern matching performance with the help of dynamic rec<strong>on</strong>figurati<strong>on</strong>.<br />
This is accomplished <strong>on</strong>ly with a modest increase in amortized memory usage.<br />
Page<br />
106
Three-level Processing of Multiple Aggregate C<strong>on</strong>tinuous Queries<br />
Shenoda Guirguis (University of Pittsburgh)<br />
Mohamed A. Sharaf (The University of Queensland)<br />
Panos K. chrysanthis (University of Pittsburgh)<br />
Alexandros Labrinidis (University of Pittsburgh)<br />
Abstracts<br />
Aggregate C<strong>on</strong>tinuous Queries (ACQs) are both a very popular class of C<strong>on</strong>tinuous<br />
Queries (CQs) and also have a potentially high executi<strong>on</strong> cost. As such, optimizing<br />
the processing of ACQs is imperative for <strong>Data</strong> Stream Management Systems<br />
(DSMSs) to reach their full potential in supporting (critical) m<strong>on</strong>itoring applicati<strong>on</strong>s.<br />
For multiple ACQs that vary in window specificati<strong>on</strong>s and pre-aggregati<strong>on</strong> filters,<br />
existing multiple ACQs optimizati<strong>on</strong> schemes assume a processing model where<br />
each ACQ is computed as a final-aggregati<strong>on</strong> of a sub-aggregati<strong>on</strong>. In this paper,<br />
we propose a novel processing model for ACQs, called TriOps, with the goal of<br />
minimizing the repetiti<strong>on</strong> of operator executi<strong>on</strong> at the sub-aggregati<strong>on</strong> level. We<br />
also propose TriWeave, a TriOps-aware multi-query optimizer. We analytically and<br />
experimentally dem<strong>on</strong>strate the performance gains of our proposed schemes which<br />
shows their superiority over alternative schemes. Finally, we generalize TriWeave to<br />
incorporate the classical subsumpti<strong>on</strong>-based multi-query optimizati<strong>on</strong> techniques.<br />
Accelerating Range Queries For Brain Simulati<strong>on</strong>s<br />
Farhan Tauheed (EPFL)<br />
Laurynas Biveinis (Aalborg University)<br />
Thomas Heinis (EPFL)<br />
Felix Schürmann (EPFL)<br />
Henry Markram (EPFL)<br />
Anastasia Ailamaki (EPFL)<br />
Neuroscientists increasingly use computati<strong>on</strong>al tools in building and simulating<br />
models of the brain. The amounts of data involved In these simulati<strong>on</strong>s are immense<br />
and efficiently managing this data is key. One particular problem in analyzing this<br />
data is the scalable executi<strong>on</strong> of range queries <strong>on</strong> spatial models of the brain.<br />
Known indexing approaches do not perform well even <strong>on</strong> today’s small models<br />
which represent a small fracti<strong>on</strong> of the brain, c<strong>on</strong>taining <strong>on</strong>ly few milli<strong>on</strong>s of densely<br />
packed spatial elements. The problem of current approaches is that with the increasing<br />
level of detail in the models, also the overlap in the tree structure increases,<br />
ultimately slowing down query executi<strong>on</strong>. The neuroscientists’ need to work<br />
with bigger and more detailed (denser) models thus motivates us to develop a new<br />
indexing approach. To this end we develop FLAT, a scalable indexing approach for<br />
dense data sets. We base the development of FLAT <strong>on</strong> the key observati<strong>on</strong> that<br />
current approaches suffer from overlap in case of dense data sets. We hence design<br />
FLAT as an approach with two phases, each independent of density. In the first<br />
phase it uses a traditi<strong>on</strong>al spatial index to retrieve an initial object efficiently. In the<br />
sec<strong>on</strong>d phase it traverses the initial object’s neighborhood to retrieve the remaining<br />
query result. Our experimental results show that FLAT not <strong>on</strong>ly outperforms R-Tree<br />
variants from a factor of two up to eight but that it also achieves independence<br />
from data set size and density.<br />
Page<br />
107
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Keyword Query Reformulati<strong>on</strong> <strong>on</strong> Structured <strong>Data</strong><br />
Junjie yao (Peking University)<br />
Bin cui (Peking University)<br />
Liansheng Hua (Peking University)<br />
yuxin Huang (Peking University)<br />
Textual web pages dominate web search engines nowadays. However, there is<br />
also a striking increase of structured data <strong>on</strong> the web. Efficient keyword query<br />
processing <strong>on</strong> structured data has attracted enough attenti<strong>on</strong>, but effective query<br />
understanding has yet to be investigated. In this paper, we focus <strong>on</strong> the problem of<br />
keyword query reformulati<strong>on</strong> in the structured data scenario. These reformulated<br />
queries provide alternative descripti<strong>on</strong>s of original input. They could better capture<br />
users’ informati<strong>on</strong> need and guide users to explore related items in the target<br />
structured data. We propose an automatic keyword query reformulati<strong>on</strong> approach<br />
by exploiting structural semantics in the underlying structured data sources. The<br />
reformulati<strong>on</strong> soluti<strong>on</strong> is decomposed into two stages, i.e., offline term relati<strong>on</strong><br />
extracti<strong>on</strong> and <strong>on</strong>line query generati<strong>on</strong>. We first utilize a heterogenous graph to<br />
model the words and items in structured data, and design an enhanced Random<br />
Walk approach to extract relevant terms from the graph c<strong>on</strong>text. In the <strong>on</strong>line query<br />
reformulati<strong>on</strong> stage, we introduce an efficient probabilistic generati<strong>on</strong> module to<br />
suggest substitutable reformulated queries. Extensive experiments are c<strong>on</strong>ducted<br />
<strong>on</strong> a real-life data set, and our approach yields promising results.<br />
SeSSi<strong>on</strong> 21: DATA MiNiNG<br />
Predicting Approximate Protein-DNA Binding Cores Using<br />
Associati<strong>on</strong> Rule Mining<br />
Po-yuen W<strong>on</strong>g (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Tak-Ming chan (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Man-H<strong>on</strong> W<strong>on</strong>g (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Kw<strong>on</strong>g-Sak Leung (The chinese University of H<strong>on</strong>g K<strong>on</strong>g)<br />
The studies of protein-DNA bindings between transcripti<strong>on</strong> factors (TFs) and transcripti<strong>on</strong><br />
factor binding sites (TFBSs) are important bioinformatics topics. High-resoluti<strong>on</strong><br />
(length490) are shown promising in identifying<br />
accurate binding cores without using any 3D structures. While the current associati<strong>on</strong><br />
rule mining method <strong>on</strong> this problem addresses exact sequences <strong>on</strong>ly, the most<br />
recent ad hoc method for approximati<strong>on</strong> does not establish any formal model and is<br />
limited by experimentally known patterns. As biological mutati<strong>on</strong>s are comm<strong>on</strong>, it is<br />
desirable to formally extend the exact model into an approximate <strong>on</strong>e. In this paper,<br />
we formalize the problem of mining approximate protein-DNA associati<strong>on</strong> rules<br />
from sequence data and propose a novel efficient algorithm to predict protein-DNA<br />
binding cores. Our two-phase algorithm first c<strong>on</strong>structs two compact intermediate<br />
structures called frequent sequence tree (FS-Tree) and frequent sequence class tree<br />
(FSCTree). Approximate associati<strong>on</strong> rules are efficiently generated from the structures<br />
and bioinformatics c<strong>on</strong>cepts (positi<strong>on</strong> weight matrix and informati<strong>on</strong> c<strong>on</strong>tent)<br />
Page<br />
108
Abstracts<br />
are further employed to prune meaningless rules. Experimental results <strong>on</strong> real data<br />
show the performance and applicability of the proposed algorithm.<br />
Upgrading Uncompetitive Products Ec<strong>on</strong>omically<br />
Hua Lu (Aalborg University)<br />
christian S. Jensen (Aarhus University)<br />
The skyline of a multidimensi<strong>on</strong>al point set c<strong>on</strong>sists of the points that are not<br />
dominated by other points. In a scenario where product features are represented by<br />
multidimensi<strong>on</strong>al points, the skyline points may be viewed as representing competitive<br />
products. A product provider may wish to upgrade uncompetitive products to<br />
become competitive, but wants to take into account the upgrading cost. We study<br />
the top-k product upgrading problem. Given a set P of competitor products, a set<br />
T of products that are candidates for upgrade, and an upgrading cost functi<strong>on</strong> f<br />
that applies to T, the problem is to return the k products in T that can be upgraded<br />
to not be dominated by any products in P at the lowest cost. This problem is n<strong>on</strong>trivial<br />
due to not <strong>on</strong>ly the large data set sizes, but also to the many possibilities for<br />
upgrading a product. We identify and provide soluti<strong>on</strong>s for the different opti<strong>on</strong>s for<br />
upgrading an uncompetitive product, and combine the soluti<strong>on</strong>s into a single soluti<strong>on</strong>.<br />
We also propose a spatial join-based soluti<strong>on</strong> that assumes P and T are indexed<br />
by an R-tree. Given a set of products in the same R-tree node, we derive three lower<br />
bounds <strong>on</strong> their upgrading costs. These bounds are employed by the join approach<br />
to prune upgrade candidates with uncompetitive upgrade costs. Empirical studies<br />
with synthetic and real data show that the join approach is efficient and scalable.<br />
Attribute-Based Subsequence Matching and Mining<br />
yu Peng (The H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology)<br />
raym<strong>on</strong>d chi-Wing W<strong>on</strong>g (The H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology)<br />
Liangliang ye (The H<strong>on</strong>g K<strong>on</strong>g University of Science and Technology)<br />
Philip S. yu (University of illinois at chicago)<br />
Sequence analysis is very important in our daily life. Typically, each sequence is<br />
associated with an ordered list of elements. For example, in a movie rental applicati<strong>on</strong>,<br />
a customer’s movie rental record c<strong>on</strong>taining an ordered list of movies is a<br />
sequence example. Most studies about sequence analysis focus <strong>on</strong> subsequence<br />
matching which finds all sequences stored in the database such that a given query<br />
sequence is a subsequence of each of these sequences. In many applicati<strong>on</strong>s,<br />
elements are associated with properties or attributes. For example, each movie is<br />
associated with some attributes like “Director” and “Actors”. Unfortunately, to the<br />
best of our knowledge, all existing studies about sequence analysis do not c<strong>on</strong>sider<br />
the attributes of elements. In this paper, we propose two problems. The first problem<br />
is: given a query sequence and a set of sequences, c<strong>on</strong>sidering the attributes of<br />
elements, we want to find all sequences which are matched by this query sequence.<br />
This problem is called attribute-based subsequence matching (ASM). All existing<br />
applicati<strong>on</strong>s for the traditi<strong>on</strong>al subsequence matching problem can also be applied<br />
to our new problem provided that we are given the attributes of elements. We propose<br />
an efficient algorithm for problem ASM. The key idea to the efficiency of this<br />
algorithm is to compress each whole sequence with potentially many associated<br />
attributes into just a triplet of numbers. By dealing with these very compressed rep-<br />
Page<br />
109
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
resentati<strong>on</strong>s, we greatly speed up the attribute-based subsequence matching. The<br />
sec<strong>on</strong>d problem is to find all frequent attribute-based subsequence. We also adapt<br />
an existing efficient algorithm for this sec<strong>on</strong>d problem to show we can use the algorithm<br />
developed for the first problem. Empirical studies show that our algorithms<br />
are scalable in large datasets. In particular, our algorithms run at least an order of<br />
magnitude faster than a straightforward method in most cases. This work can stimulate<br />
a number of existing data mining problems which are fundamentally based <strong>on</strong><br />
subsequence matching such as sequence classificati<strong>on</strong>, frequent sequence mining,<br />
motif detecti<strong>on</strong> and sequence matching in bioinformatics.<br />
Integrating Frequent Pattern Mining from Multiple <strong>Data</strong> Domains<br />
for Classificati<strong>on</strong><br />
Dhaval Patel (Nati<strong>on</strong>al University of Singapore)<br />
Wynne Hsu (Nati<strong>on</strong>al University of Singapore)<br />
M<strong>on</strong>g Li Lee (Nati<strong>on</strong>al University of Singapore)<br />
Many frequent pattern mining algorithms have been developed for categorical,<br />
numerical, time series, or interval data. However, little attenti<strong>on</strong> has been given to<br />
integrate these algorithms so as to mine frequent patterns from multiple domain<br />
datasets for classificati<strong>on</strong>. In this paper, we introduce the noti<strong>on</strong> of a heterogenous<br />
pattern to capture the associati<strong>on</strong>s am<strong>on</strong>g different kinds of data. We propose a<br />
unified framework for mining multiple domain datasets and design an iterative algorithm<br />
called HTMiner. HTMiner discovers essential heterogenous patterns for classificati<strong>on</strong><br />
and performs instance eliminati<strong>on</strong>. This instance eliminati<strong>on</strong> step reduces<br />
the problem size progressively by removing training instances which are correctly<br />
covered by the discovered essential heterogenous pattern. Experiments <strong>on</strong> two real<br />
world datasets show that the HTMiner is efficient and can significantly improve the<br />
classificati<strong>on</strong> accuracy.<br />
SeSSi<strong>on</strong> 22:<br />
SciENTiFic DATA, ANALySiS AND viSUALiZATioN<br />
Efficient Versi<strong>on</strong>ing for Scientific Array <strong>Data</strong>bases<br />
Adam Seering (MiT cSAiL)<br />
Philippe cudre-Mauroux (University of Fribourg)<br />
Samuel Madden (MiT cSAiL)<br />
Michael St<strong>on</strong>ebraker (MiT cSAiL)<br />
In this paper, we describe a versi<strong>on</strong>ed database storage manager we are developing<br />
for the SciDB scientific database. The system is designed to efficiently store and<br />
retrieve array-oriented data, exposing a ``no-overwrite’’ storage model in which<br />
each update creates a new ``versi<strong>on</strong>’’ of an array. This makes it possible to perform<br />
comparis<strong>on</strong>s of versi<strong>on</strong>s produced at different times or by different algorithms, and<br />
to create complex chains and trees of versi<strong>on</strong>s. We present algorithms to efficiently<br />
encode these versi<strong>on</strong>s, minimizing storage space while still providing efficient access<br />
to the data. Additi<strong>on</strong>ally, we present an optimal algorithm that, given a l<strong>on</strong>g<br />
sequence of versi<strong>on</strong>s, determines which versi<strong>on</strong>s to encode in terms of each other<br />
(using delta compressi<strong>on</strong>) to minimize total storage space or query executi<strong>on</strong> cost.<br />
Page<br />
110
Abstracts<br />
We compare the performance of these algorithms <strong>on</strong> real world data sets from the<br />
Nati<strong>on</strong>al Oceanic and Atmospheric Administrati<strong>on</strong> (NOAA), OpenStreetMaps, and<br />
several other sources. We show that our algorithms provide better performance<br />
than existing versi<strong>on</strong> c<strong>on</strong>trol systems not optimized for array data, both in terms of<br />
storage size and access time, and that our delta-compressi<strong>on</strong> algorithms are able to<br />
substantially reduce the total storage space when versi<strong>on</strong>s exist with a high degree<br />
of similarity.<br />
Multidimensi<strong>on</strong>al Analysis of Atypical Events in Cyber-Physical <strong>Data</strong><br />
Lu-An Tang (UiUc)<br />
Xiao yu (UiUc)<br />
Sangkyum Kim (UiUc)<br />
Jiawei Han (UiUc)<br />
Wen-chih Peng (Nati<strong>on</strong>al chiao Tung University)<br />
yizhou Sun (UiUc)<br />
Hector G<strong>on</strong>zalez (Google)<br />
Sebastian Seith (Morning Star)<br />
A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras)<br />
with cyber (or informati<strong>on</strong>al) comp<strong>on</strong>ents to form a situati<strong>on</strong>-integrated analytical<br />
system that may resp<strong>on</strong>d intelligently to dynamic changes of the real-world situati<strong>on</strong>s.<br />
CPS claims many promising applicati<strong>on</strong>s, such as traffic observati<strong>on</strong>, battlefield<br />
surveillance and sensor-networkbased m<strong>on</strong>itoring. One important research<br />
topic in CPS is about the atypical event analysis, i.e., retrieving the events from<br />
large amount of data and analyzing them with spatial, temporal and other multidimensi<strong>on</strong>al<br />
informati<strong>on</strong>. Many traditi<strong>on</strong>al approaches are not feasible for such<br />
analysis since they use numeric measures and cannot describe the complex atypical<br />
events. In this study, we propose a new model of atypical cluster to effectively<br />
represent those events and efficiently retrieve them from massive data. The microcluster<br />
is designed to summarize individual events, and the macro-cluster is used<br />
to integrate the informati<strong>on</strong> from multiple event. To facilitate scalable, flexible and<br />
<strong>on</strong>line analysis, the c<strong>on</strong>cept of significant cluster is defined and a guided clustering<br />
algorithm is proposed to retrieve significant clusters in an efficient manner. We<br />
c<strong>on</strong>duct experiments <strong>on</strong> real datasets with the size of more than 50 GB, the results<br />
show that the proposed method can provide more accurate informati<strong>on</strong> with <strong>on</strong>ly<br />
15% to 20% time cost of the baselines.<br />
HiCS: High C<strong>on</strong>trast Subspaces for Density-Based Outlier Ranking<br />
Fabian Keller (Karlsruhe institute of Technology)<br />
Emmanuel Müller (Karlsruhe institute of Technology)<br />
Klemens Böhm (Karlsruhe institute of Technology)<br />
Outlier mining is a major task in data analysis. Outliers are objects that highly deviate<br />
from regular objects in their local neighborhood. Density-based outlier ranking<br />
methods score each object based <strong>on</strong> its degree of deviati<strong>on</strong>. In many applicati<strong>on</strong>s,<br />
these ranking methods degenerate to random listings due to low c<strong>on</strong>trast between<br />
outliers and regular objects. Outliers do not show up in the scattered full space,<br />
they are hidden in multiple high c<strong>on</strong>trast subspace projecti<strong>on</strong>s of the data. Measuring<br />
the c<strong>on</strong>trast of such subspaces for outlier rankings is an open research chal-<br />
Page<br />
111
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
lenge. In this work, we propose a novel subspace search method that selects high<br />
c<strong>on</strong>trast subspaces for density-based outlier ranking. It is designed as pre-processing<br />
step to outlier ranking algorithms. It searches for high c<strong>on</strong>trast subspaces with<br />
a significant amount of c<strong>on</strong>diti<strong>on</strong>al dependence am<strong>on</strong>g the subspace dimensi<strong>on</strong>s.<br />
With our approach, we propose a first measure for the c<strong>on</strong>trast of subspaces. Thus,<br />
we enhance the quality of traditi<strong>on</strong>al outlier rankings by computing outlier scores in<br />
high c<strong>on</strong>trast projecti<strong>on</strong>s <strong>on</strong>ly. The evaluati<strong>on</strong> <strong>on</strong> real and synthetic data shows that<br />
our approach outperforms traditi<strong>on</strong>al dimensi<strong>on</strong>ality reducti<strong>on</strong> techniques, naive<br />
random projecti<strong>on</strong>s as well as state-of-the-art subspace search techniques and<br />
provides enhanced quality for outlier ranking.<br />
Extracting Analyzing and Visualizing Triangle K-Core Motifs<br />
within Networks<br />
yang Zhang (The ohio State University)<br />
Srinivasan Parthasarathy (The ohio State University)<br />
Cliques are topological structures that usually provide important informati<strong>on</strong><br />
for understanding the structure of a graph or network. However, detecting and<br />
extracting cliques efficiently is known to be very hard. In this paper, we define and<br />
introduce the noti<strong>on</strong> of a Triangle K-Core, a simpler topological structure and <strong>on</strong>e<br />
that is more tractable and can moreover be used as a proxy for extracting cliquelike<br />
structure from large graphs. Based <strong>on</strong> this definiti<strong>on</strong> we first develop a localized<br />
algorithm for extracting Triangle K-Cores from large graphs. Subsequently we<br />
extend the simple algorithm to accommodate dynamic graphs (where edges can<br />
be dynamically added and deleted). Finally, we extend the basic definiti<strong>on</strong> to support<br />
various template pattern cliques with applicati<strong>on</strong>s to network visualizati<strong>on</strong> and<br />
event detecti<strong>on</strong> <strong>on</strong> graphs and networks. Our empirical results reveal the efficiency<br />
and efficacy of the proposed methods <strong>on</strong> many real world datasets.<br />
SeSSi<strong>on</strong> 23: SiMiLAriTy SEArcH AND DETEcTioN<br />
Horiz<strong>on</strong>tal Reducti<strong>on</strong>: Instance-Level Dimensi<strong>on</strong>ality Reducti<strong>on</strong> for<br />
Similarity Search in Large Document <strong>Data</strong>bases<br />
Min Soo Kim (KAiST)<br />
Kyu-young Whang (KAiST)<br />
yang-Sae Mo<strong>on</strong> (Kangw<strong>on</strong> Nati<strong>on</strong>al University)<br />
Dimensi<strong>on</strong>ality reducti<strong>on</strong> is essential in text mining since the dimensi<strong>on</strong>ality of text<br />
documents could easily reach several tens of thousands. Most recent efforts <strong>on</strong><br />
dimensi<strong>on</strong>ality reducti<strong>on</strong>, however, are not adequate to large document databases<br />
due to lack of scalability. We hence propose a new type of simple but effective<br />
dimensi<strong>on</strong>ality reducti<strong>on</strong>, called horiz<strong>on</strong>tal (dimensi<strong>on</strong>ality) reducti<strong>on</strong>, for large<br />
document databases. Horiz<strong>on</strong>tal reducti<strong>on</strong> c<strong>on</strong>verts each text document to a few<br />
bitmap vectors and provides tight lower bounds of inter-document distances using<br />
those bitmap vectors. Bitmap representati<strong>on</strong> is very simple and extremely fast, and<br />
its instance-based nature makes it suitable for large and dynamic document databases.<br />
Using the proposed horiz<strong>on</strong>tal reducti<strong>on</strong>, we develop an efficient k-nearest<br />
neighbor (k-NN) search algorithm for text mining such as classificati<strong>on</strong> and clustering,<br />
and we formally prove its correctness. The proposed algorithm decreases I/O<br />
Page<br />
112
Abstracts<br />
and CPU overheads simultaneously since horiz<strong>on</strong>tal reducti<strong>on</strong> (1) reduces the number<br />
of accesses to documents significantly by exploiting the bitmap-based lower<br />
bounds in filtering dissimilar documents at an early stage, and accordingly, (2)<br />
decreases the number of CPU-intensive computati<strong>on</strong>s for obtaining a real distance<br />
between high-dimensi<strong>on</strong>al document vectors. Extensive experimental results show<br />
that horiz<strong>on</strong>tal reducti<strong>on</strong> improves the performance of the reducti<strong>on</strong> (preprocessing)<br />
process by <strong>on</strong>e to two orders of magnitude compared with existing reducti<strong>on</strong><br />
techniques, and our k-NN search algorithm significantly outperforms the existing<br />
<strong>on</strong>es by <strong>on</strong>e to three orders of magnitude.<br />
Adaptive Windows for Duplicate Detecti<strong>on</strong><br />
Uwe Draisbach (Hasso-Plattner-institute)<br />
Felix Naumann (Hasso-Plattner-institute)<br />
Sascha Szott (Zuse institute)<br />
oliver W<strong>on</strong>neberg (r. Lindner GmbH & co. KG)<br />
Duplicate detecti<strong>on</strong> is the task of identifying all groups of records within a data set<br />
that represent the same real-world entity, respectively. This task is difficult, because<br />
(i) representati<strong>on</strong>s might differ slightly, so some similarity measure must be defined<br />
to compare pairs of records and (ii) data sets might have a high volume making a<br />
pair-wise comparis<strong>on</strong> of all records infeasible. To tackle the sec<strong>on</strong>d problem, many<br />
algorithms have been suggested that partiti<strong>on</strong> the data set and compare all record<br />
pairs <strong>on</strong>ly within each partiti<strong>on</strong>. One well-known such approach is the Sorted Neighborhood<br />
Method (SNM), which sorts the data according to some key and then advances<br />
a window over the data comparing <strong>on</strong>ly records that appear within the same<br />
window. We propose with the Duplicate Count Strategy (DCS) a variati<strong>on</strong> of SNM that<br />
uses a varying window size. It is based <strong>on</strong> the intuiti<strong>on</strong> that there might be regi<strong>on</strong>s of<br />
high similarity suggesting a larger window size and regi<strong>on</strong>s of lower similarity suggesting<br />
a smaller window size. Next to the basic variant of DCS, we also propose and<br />
thoroughly evaluate a variant called DCS++ which is provably better than the original<br />
SNM in terms of efficiency (same results with fewer comparis<strong>on</strong>s).<br />
Efficient Dual-Resoluti<strong>on</strong> Layer Indexing for Top-k Queries<br />
J<strong>on</strong>gwuk Lee (Pohang University of Science and Technology (PoSTEcH))<br />
Hyunsouk cho (Pohang University of Science and Technology (PoSTEcH))<br />
Seung-w<strong>on</strong> Hwang (Pohang University of Science and Technology (PoSTEcH))<br />
Top-k queries have gained c<strong>on</strong>siderable attenti<strong>on</strong> as an effective means for narrowing<br />
down the overwhelming amount of data. This paper studies the problem<br />
of c<strong>on</strong>structing an indexing structure that efficiently supports top-k queries for<br />
varying scoring functi<strong>on</strong>s and retrieval sizes. The existing work can be categorized<br />
into three classes: list-, layer-, and view-based approaches. This paper focuses <strong>on</strong><br />
the layer-based approach, pre-materializing tuples into c<strong>on</strong>secutive multiple layers.<br />
The layer-based index enables us to return top-k answers efficiently by restricting<br />
access to tuples in the k layers. However, we observe that the number of tuples<br />
accessed in each layer can be reduced further. For this purpose, we propose a dualresoluti<strong>on</strong><br />
layer structure. Specifically, we iteratively build coarse-level layers using<br />
skylines, and divide each coarse-level layer into fine-level sublayers using c<strong>on</strong>vex<br />
skylines. The dual-resoluti<strong>on</strong> layer is able to leverage not <strong>on</strong>ly the dominance rela-<br />
Page<br />
113
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
ti<strong>on</strong>ship between coarse-level layers, named forall-dominance, but also a relaxed<br />
dominance relati<strong>on</strong>ship between fine-level sublayers, named exists-dominance. Our<br />
extensive evaluati<strong>on</strong> results dem<strong>on</strong>strate that our proposed method significantly<br />
reduces the number of tuples accessed than the state-of-the-art methods.<br />
Evaluating Probabilistic Queries over Uncertain Matching<br />
reynold cheng (The University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Jian G<strong>on</strong>g (The University of H<strong>on</strong>g K<strong>on</strong>g)<br />
David W. cheung (The University of H<strong>on</strong>g K<strong>on</strong>g)<br />
Jiefeng cheng (Shenzhen institute of Advanced Technology)<br />
A matching between two database schemas, generated by machine learning<br />
techniques (e.g., COMA++), is often uncertain. Handling the uncertainty of schema<br />
matching has recently raised a lot of research interest, because the quality of applicati<strong>on</strong>s<br />
rely <strong>on</strong> the matching result. We study query evaluati<strong>on</strong> over an inexact<br />
schema matching, which is represented as a set of ``possible mappings’’, as well<br />
as the probabilities that they are correct. Since the number of possible mappings<br />
can be large, evaluating queries through these mappings can be expensive. By<br />
observing the fact that the possible mappings between two schemas often exhibit<br />
a high degree of overlap, we develop two efficient soluti<strong>on</strong>s. We also present a fast<br />
algorithm to compute answers with the k highest probabilities. An extensive evaluati<strong>on</strong><br />
<strong>on</strong> real schemas shows that our approaches improve the query performance by<br />
almost an order of magnitude.<br />
SeSSi<strong>on</strong> 24: SENSorS NETWorK AND TrAJEcTory<br />
Detecting Outliers in Sensor Networks using the Geometric Approach<br />
Sabbas Burdakis (Technical University of crete)<br />
Ant<strong>on</strong>ios Deligiannakis (Technical University of crete)<br />
The topic of outlier detecti<strong>on</strong> in sensor networks has received significant attenti<strong>on</strong><br />
in recent years. Detecting when the measurements of a node become ``abnormal’’<br />
is interesting, because this event may help detect either a malfuncti<strong>on</strong>ing node, or a<br />
node that starts observing a local interesting phenomen<strong>on</strong> (i.e., a fire). In this paper<br />
we present a new algorithm for detecting outliers in sensor networks, based <strong>on</strong> the<br />
geometric approach. Unlike prior work. our algorithms perform a distributed m<strong>on</strong>itoring<br />
of outlier readings, exhibit 100% accuracy in their m<strong>on</strong>itoring (assuming no<br />
message losses), and require the transmissi<strong>on</strong> of messages <strong>on</strong>ly at a fracti<strong>on</strong> of the<br />
epochs, thus allowing nodes to safely refrain from transmitting in many epochs. Our<br />
approach is based <strong>on</strong> transforming comm<strong>on</strong> similarity metrics in a way that admits<br />
the applicati<strong>on</strong> of the recently proposed geometric approach. We then propose<br />
a general framework and suggest multiple modes of operati<strong>on</strong>, which allow each<br />
sensor node to accurately m<strong>on</strong>itor its similarity to other nodes. Our experiments<br />
dem<strong>on</strong>strate that our algorithms can accurately detect outliers at a fracti<strong>on</strong> of the<br />
communicati<strong>on</strong> cost that a centralized approach would require (even in the case<br />
where the central node lies just <strong>on</strong>e hop away from all sensor nodes). Moreover, we<br />
dem<strong>on</strong>strate that these bandwidth savings become even larger as we incorporate<br />
further optimizati<strong>on</strong>s in our proposed modes of operati<strong>on</strong>.<br />
Page<br />
114
Efficient Threshold M<strong>on</strong>itoring for Distributed Probabilistic <strong>Data</strong><br />
Mingwang Tang (University of Utah)<br />
Feifei Li (University of Utah)<br />
Jeff M. Phillips (University of Utah)<br />
Jeffrey Jestes (University of Utah)<br />
Abstracts<br />
In distributed data management, a primary c<strong>on</strong>cern is m<strong>on</strong>itoring the distributed<br />
data and generating an alarm when a user specified c<strong>on</strong>straint is violated. A particular<br />
useful instance is the threshold based c<strong>on</strong>straint, which is comm<strong>on</strong>ly known<br />
as the distributed threshold m<strong>on</strong>itoring problem. This work extends this useful and<br />
fundamental study to distributed probabilistic data that emerge in a lot of applicati<strong>on</strong>s,<br />
where uncertainty naturally exists when massive amounts of data are<br />
produced at multiple sources in distributed, networked locati<strong>on</strong>s. Examples include<br />
distributed observing stati<strong>on</strong>s, large sensor fields, geographically separate scientific<br />
institutes/units and many more. When dealing with probabilistic data, there<br />
are two thresholds involved, the score and the probability thresholds. One must<br />
m<strong>on</strong>itor both simultaneously, as such, techniques developed for deterministic data<br />
are no l<strong>on</strong>ger directly applicable. This work presents a comprehensive study to this<br />
problem. Our algorithms have significantly outperformed the baseline method in<br />
terms of both the communicati<strong>on</strong> cost (number of messages and bytes) and the<br />
running time, as shown by an extensive experimental evaluati<strong>on</strong> using several, real<br />
large datasets.<br />
Incorporating Durati<strong>on</strong> Informati<strong>on</strong> for Trajectory Classificati<strong>on</strong><br />
Dhaval Patel (Nati<strong>on</strong>al University of Singapore)<br />
chang Sheng (DBS Bank)<br />
Wynne Hsu (Nati<strong>on</strong>al University of Singapore)<br />
M<strong>on</strong>g Li Lee (Nati<strong>on</strong>al University of Singapore)<br />
Trajectory classificati<strong>on</strong> has many useful applicati<strong>on</strong>s. Existing works <strong>on</strong> trajectory<br />
classificati<strong>on</strong> do not c<strong>on</strong>sider the durati<strong>on</strong> informati<strong>on</strong> of trajectory. In this<br />
paper, we extract durati<strong>on</strong>-aware features from trajectories to build a classifier. Our<br />
method utilizes informati<strong>on</strong> theory to obtain regi<strong>on</strong>s where the trajectories have<br />
similar speeds and directi<strong>on</strong>s. Further, trajectories are summarized into a network<br />
based <strong>on</strong> the MDL principle that takes into account the durati<strong>on</strong> difference am<strong>on</strong>g<br />
trajectories of different classes. A graph traversal is performed <strong>on</strong> this trajectory<br />
network to obtain the top-k covering path rules for each trajectory. Based <strong>on</strong> the<br />
discovered regi<strong>on</strong>s and top-k path rules, we build a classifier to predict the class<br />
labels of new trajectories. Experiment results <strong>on</strong> real-world datasets show that the<br />
proposed durati<strong>on</strong>-aware classifier can obtain higher classificati<strong>on</strong> accuracy than<br />
the state-of-the-art trajectory classifier.<br />
Page<br />
115
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Reducing Uncertainty of Low-Sampling-Rate Trajectories<br />
Kai Zheng (The University of Queensland)<br />
yu Zheng (Microsoft research Asia)<br />
Xing Xie (Microsoft research Asia)<br />
Xiaofang Zhou (The University of Queensland)<br />
The increasing availability of GPS-embedded mobile devices has given rise to a new<br />
spectrum of locati<strong>on</strong>-based services, which have accumulated a huge collecti<strong>on</strong> of<br />
locati<strong>on</strong> trajectories. In practice, a large porti<strong>on</strong> of these trajectories are of lowsampling-rate.<br />
For instance, the time interval between c<strong>on</strong>secutive GPS points of<br />
some trajectories can be several minutes or even hours. With such a low sampling<br />
rate, most details of their movement are lost, which makes them difficult to process<br />
effectively. In this work, we investigate how to reduce the uncertainty in such kind<br />
of trajectories. Specifically, given a low-sampling-rate trajectory, we aim to infer its<br />
possible routes. The methodology adopted in our work is to take full advantage<br />
of the rich informati<strong>on</strong> extracted from the historical trajectories. We propose a<br />
systematic soluti<strong>on</strong>, History based Route Inference System (HRIS), which covers a<br />
series of novel algorithms that can derive the travel pattern from historical data and<br />
incorporate it into the route inference process. To validate the effectiveness of the<br />
system, we apply our soluti<strong>on</strong> to the map-matching problem which is an important<br />
applicati<strong>on</strong> scenario of this work, and c<strong>on</strong>duct extensive experiments <strong>on</strong> a real<br />
taxi trajectory dataset. The experiment results dem<strong>on</strong>strate that HRIS can achieve<br />
higher accuracy than the existing map-matching algorithms for low-sampling-rate<br />
trajectories.<br />
SeSSi<strong>on</strong> 25: Error rEDUcTioN AND DATA SEcUriTy<br />
Efficient Similarity Search over Encrypted <strong>Data</strong><br />
Mehmet Kuzu (The University of Texas at Dallas)<br />
Mohammad Saiful islam (The University of Texas at Dallas)<br />
Murat Kantarcioglu (The University of Texas at Dallas)<br />
In recent years, due to the appealing features of cloud computing, large amount<br />
of data have been stored in the cloud. Although cloud based services offer many<br />
advantages, privacy and security of the sensitive data is a big c<strong>on</strong>cern. To mitigate<br />
the c<strong>on</strong>cerns, it is desirable to outsource sensitive data in encrypted form. Encrypted<br />
storage protects the data against illegal access, but it complicates some basic,<br />
yet important functi<strong>on</strong>ality such as the search <strong>on</strong> the data. To achieve search over<br />
encrypted data without compromising the privacy, c<strong>on</strong>siderable amount of searchable<br />
encrypti<strong>on</strong> schemes have been proposed in the literature. However, almost all<br />
of them handle exact query matching but not similarity matching; a crucial requirement<br />
for real world applicati<strong>on</strong>s. Although some sophisticated secure multi-party<br />
computati<strong>on</strong> based cryptographic techniques are available for similarity tests, they<br />
are computati<strong>on</strong>ally intensive and do not scale for large data sources. In this paper,<br />
we propose an efficient scheme for similarity search over encrypted data. To do so,<br />
we utilize a state-of-the-art algorithm for fast near neighbor search in high dimensi<strong>on</strong>al<br />
spaces called locality sensitive hashing. To ensure the c<strong>on</strong>fidentiality of the<br />
sensitive data, we provide a rigorous security definiti<strong>on</strong> and prove the security of<br />
the proposed scheme under the provided definiti<strong>on</strong>. In additi<strong>on</strong>, we provide a real<br />
Page<br />
116
Abstracts<br />
world applicati<strong>on</strong> of the proposed scheme and verify the theoretical results with<br />
empirical observati<strong>on</strong>s <strong>on</strong> a real dataset.<br />
Obfuscating the Topical Intenti<strong>on</strong> in Enterprise Text Search<br />
HweeHwa Pang (Singapore Management University)<br />
Xiaokui Xiao (Nanyang Technological University)<br />
Jialie Shen (Singapore Management University)<br />
The text search queries in an enterprise can reveal the users’ topic of interest, and<br />
in turn c<strong>on</strong>fidential staff or business informati<strong>on</strong>. To safeguard the enterprise from<br />
c<strong>on</strong>sequences arising from a disclosure of the query traces, it is desirable to obfuscate<br />
the true user intenti<strong>on</strong> from the search engine, without requiring it to be reengineered.<br />
In this paper, we advocate a unique approach to profile the topics that<br />
are relevant to the user intenti<strong>on</strong>. Based <strong>on</strong> this approach, we introduce an (epsil<strong>on</strong> 1 ,<br />
epsil<strong>on</strong> 2 )-privacy model that allows a user to stipulate that topics relevant to her<br />
intenti<strong>on</strong> at epsil<strong>on</strong> 1 level should appear to any adversary to be innocuous at epsil<strong>on</strong><br />
2 level. We then present a TopPriv algorithm to achieve the customized (epsil<strong>on</strong> 1 ,<br />
epsil<strong>on</strong> 2 )-privacy requirement of individual users through injecting automatically<br />
formulated fake queries. The advantages of TopPriv over existing techniques are<br />
c<strong>on</strong>firmed through benchmark queries <strong>on</strong> a real corpus, with experiment settings<br />
fashi<strong>on</strong>ed after an enterprise search applicati<strong>on</strong>.<br />
Correlati<strong>on</strong> Support for Risk Evaluati<strong>on</strong> in <strong>Data</strong>bases<br />
Katrin Eisenreich (SAP research)<br />
Jochen Adamek (Technische Universität Berlin)<br />
Philipp rösch (SAP research)<br />
volker Markl (Technische Universität Berlin)<br />
Gregor Hackenbroich (SAP research)<br />
Investigating potential dependencies in data and their effect <strong>on</strong> future business<br />
developments can help experts to prevent misestimati<strong>on</strong>s of risks and chances. This<br />
makes correlati<strong>on</strong> a highly important factor in risk analysis tasks. Previous research<br />
<strong>on</strong> correlati<strong>on</strong> in uncertain data management addressed foremost the handling of<br />
dependencies between discrete rather than c<strong>on</strong>tinuous distributi<strong>on</strong>s. Also, n<strong>on</strong>e of<br />
the existing approaches provides a clear method for extracting correlati<strong>on</strong> structures<br />
from data and introducing assumpti<strong>on</strong>s about correlati<strong>on</strong> to independently<br />
represented data. To enable risk analysis under correlati<strong>on</strong> assumpti<strong>on</strong>s, we use<br />
an approximati<strong>on</strong> technique based <strong>on</strong> copula functi<strong>on</strong>s. This technique enables<br />
analysts to introduce arbitrary correlati<strong>on</strong> structures between arbitrary distributi<strong>on</strong>s<br />
and calculate relevant measures over thus correlated data. The correlati<strong>on</strong> informati<strong>on</strong><br />
can either be extracted at runtime from historic data or be accessed from a<br />
parametrically precomputed structure. We discuss the c<strong>on</strong>structi<strong>on</strong>, applicati<strong>on</strong> and<br />
querying of approximate correlati<strong>on</strong> representati<strong>on</strong>s for different analysis tasks. Our<br />
experiments dem<strong>on</strong>strate the efficiency and accuracy of the proposed approach,<br />
and point out several possibilities for optimizati<strong>on</strong>.<br />
Page<br />
117
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
A Game-Theoretic Approach for High-Assurance of <strong>Data</strong> Trustworthiness<br />
in Sensor Networks<br />
Hyo-Sang Lim (Purdue University & computer and Telecommunicati<strong>on</strong>s <strong>Engineering</strong><br />
Divisi<strong>on</strong>, South Korea)<br />
Gabriel Ghinita (University of Massachusetts at Bost<strong>on</strong>)<br />
Elisa Bertino (Purdue University)<br />
Murat Kantarcioglu (University of Texas at Dallas)<br />
Sensor networks are being increasingly deployed in many applicati<strong>on</strong> domains<br />
ranging from envir<strong>on</strong>ment m<strong>on</strong>itoring to supervising critical infrastructure systems<br />
(e.g., the power grid). Due to their ability to c<strong>on</strong>tinuously collect large amounts of<br />
data, sensor networks represent a key comp<strong>on</strong>ent in decisi<strong>on</strong>-making, enabling<br />
timely situati<strong>on</strong> assessment and resp<strong>on</strong>se. However, sensors deployed in hostile envir<strong>on</strong>ments<br />
may be subject to attacks by adversaries who intend to inject false data<br />
into the system. In this c<strong>on</strong>text, data trustworthiness is an important c<strong>on</strong>cern, as<br />
false readings may result in wr<strong>on</strong>g decisi<strong>on</strong>s with serious c<strong>on</strong>sequences (e.g., largescale<br />
power outages). To defend against this threat, it is important to establish trust<br />
levels for sensor nodes and adjust node trustworthiness scores to account for malicious<br />
interferences. In this paper, we develop a game-theoretic defense strategy<br />
to protect sensor nodes from attacks and to guarantee a high level of trustworthiness<br />
for sensed data. We use a discrete time model, and we c<strong>on</strong>sider that there is a<br />
limited attack budget that bounds the capability of the attacker in each round. The<br />
defense strategy objective is to ensure that sufficient sensor nodes are protected in<br />
each round such that the discrepancy between the value accepted and the truthful<br />
sensed value is below a certain threshold. We model the attack-defense interacti<strong>on</strong><br />
as a Stackelberg game, and we derive the Nash equilibrium c<strong>on</strong>diti<strong>on</strong> that is sufficient<br />
to ensure that the sensed data are truthful within a nominal error bound. We<br />
implement a prototype of the proposed strategy and we show through extensive<br />
experiments that our soluti<strong>on</strong> provides an effective and efficient way of protecting<br />
sensor networks from attacks.<br />
induStrial SeSSi<strong>on</strong> 1:<br />
SUPPorT For LArGE ScALE DATA ANALyTicS<br />
Exploiting Comm<strong>on</strong> Subexpressi<strong>on</strong>s for Cloud Query Processing<br />
yasin N. Silva (Ariz<strong>on</strong>a State University)<br />
Per-Ake Lars<strong>on</strong> (Microsoft research)<br />
Jingren Zhou (Microsoft corp.)<br />
Many companies now routinely run massive data analysis jobs – expressed in some<br />
scripting language – <strong>on</strong> large clusters of low-end servers. Many analysis scripts are<br />
complex and c<strong>on</strong>tain comm<strong>on</strong> subexpressi<strong>on</strong>s, that is, intermediate results that are<br />
subsequently joined and aggregated in multiple different ways. Applying c<strong>on</strong>venti<strong>on</strong>al<br />
optimizati<strong>on</strong> techniques to such scripts will produce plans that execute a<br />
comm<strong>on</strong> subexpressi<strong>on</strong> multiple times, <strong>on</strong>ce for each c<strong>on</strong>sumer, which is clearly<br />
wasteful. Moreover, different c<strong>on</strong>sumers may have different physical requirements<br />
<strong>on</strong> the result: <strong>on</strong>e c<strong>on</strong>sumer may want it partiti<strong>on</strong>ed <strong>on</strong> a column A and another<br />
<strong>on</strong>e partiti<strong>on</strong>ed <strong>on</strong> column B. To find a truly optimal plan, the optimizer must trade<br />
Page<br />
118
Abstracts<br />
off such c<strong>on</strong>flicting requirements in a cost-based manner. In this paper we show<br />
how to extend a Cascade-style optimizer to correctly optimize scripts c<strong>on</strong>taining<br />
comm<strong>on</strong> subexpressi<strong>on</strong>. The approach has been prototyped in SCOPE, Microsoft’s<br />
system for massive data analysis. Experimental analysis of both simple and large<br />
real-world scripts shows that the extended optimizer produces plans with 21 to 57%<br />
lower estimated costs.<br />
Vectorwise: a Vectorized Analytical DBMS<br />
Marcin Zukowski (Actian Netherlands)<br />
Mark van de Wiel (Actian corp.)<br />
Peter B<strong>on</strong>cz (cWi)<br />
vectorwise is a new entrant in the analytical database marketplace whose technology<br />
comes straight from innovati<strong>on</strong>s in the database research community in the past<br />
years. The product has since made waves due to its excellent performance in analytical<br />
customer workloads as well as benchmarks. We describe the history of vectorwise, as<br />
well as its basic architecture and the experiences in turning a technology developed in<br />
an academic c<strong>on</strong>text into a commercial-grade product. Finally, we turn our attenti<strong>on</strong> to<br />
recent performance results, most notably <strong>on</strong> the TPc-H benchmark at various sizes.<br />
Scalable and Numerically Stable Descriptive Statistics in SystemML<br />
yuanyuan Tian (iBM Almaden research center)<br />
Shirish Tatik<strong>on</strong>da (iBM Almaden research center)<br />
Berthold reinwald (iBM Almaden research center)<br />
There has been growing need for applying machine learning (ML) algorithms <strong>on</strong><br />
very large datasets. SystemML is a declarative approach to scalable statistical ML.<br />
In SystemML, statistical ML algorithms are expressed as simple scripts in a highlevel<br />
language. SystemML then complies and optimizes the scripts, and eventually<br />
translates them into efficient runtime <strong>on</strong> MapReduce. As the basis of virtually<br />
every quantitative analysis, descriptive statistics provide powerful tools to explore<br />
data in SystemML. This paper describes our experience in implementing descriptive<br />
statistics in SystemML. In particular, we elaborate <strong>on</strong> how to overcome the two<br />
major challenges: (1) numerical stability while operating <strong>on</strong> large datasets in the<br />
distributed setting of MapReduce; (2) efficient implementati<strong>on</strong> of order statistics in<br />
MapReduce.<br />
Page<br />
119
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
induStrial SeSSi<strong>on</strong> 2:<br />
EvoLviNG PLATForMS For NEW APPLicATioNS<br />
Earlybird: Real-Time Search at Twitter<br />
Michael Busch (Twitter)<br />
Krishna Gade (Twitter)<br />
Brian Lars<strong>on</strong> (Twitter)<br />
Patrick Lok (Twitter)<br />
Samuel Luckenbill (Twitter)<br />
Jimmy Lin (Twitter)<br />
The web today is increasingly characterized by social and real-time signals, which<br />
we believe represent two fr<strong>on</strong>tiers in informati<strong>on</strong> retrieval. In this paper, we present<br />
Earlybird, the core retrieval engine that powers Twitter’s real-time search service.<br />
Although Earlybird builds and maintains inverted indexes like nearly all modern retrieval<br />
engines, its index structures differ from those built to support traditi<strong>on</strong>al web<br />
search. We describe these differences and present the rati<strong>on</strong>ale behind our design.<br />
A key requirement of real-time search is the ability to ingest c<strong>on</strong>tent rapidly and<br />
make it searchable immediately, while c<strong>on</strong>currently supporting low-latency, highthroughput<br />
query evaluati<strong>on</strong>. These demands are met with a single-writer, multiplereader<br />
c<strong>on</strong>currency model and the targeted use of memory barriers. Earlybird represents<br />
a point in the design space of real-time search engines that has worked well<br />
for Twitter’s needs. By sharing our experiences, we hope to spur additi<strong>on</strong>al interest<br />
and innovati<strong>on</strong> in this exciting space.<br />
<strong>Data</strong> Infrastructure at LinkedIn<br />
Linkedin <strong>Data</strong> infrastructure Team<br />
LinkedIn is am<strong>on</strong>g the largest social networking sites in the world. As the company<br />
has grown, our core data sets and request processing requirements have grown as<br />
well. In this paper, we describe a few selected data infrastructure projects at LinkedIn<br />
that have helped us accommodate this increasing scale. Most of those projects<br />
build <strong>on</strong> existing open source projects and are themselves available as open source.<br />
The projects covered in this paper include: (1) Voldemort: a scalable and fault tolerant<br />
key-value store; (2) <strong>Data</strong>bus: a framework for delivering database changes to<br />
downstream applicati<strong>on</strong>s; (3) Espresso: a distributed data store that supports flexible<br />
schemas and sec<strong>on</strong>dary indexing; (4) Kafka: a scalable and efficient messaging<br />
system for collecting various user activity events and log data.<br />
The Credit Suisse Meta-data Warehouse<br />
claudio Jossen (credit Suisse AG)<br />
Lukas Blunschi (ETH Zurich)<br />
Magdalini Mori (credit Suisse AG)<br />
D<strong>on</strong>ald Kossmann (ETH Zurich)<br />
Kurt Stockinger (credit Suisse AG)<br />
This paper describes the meta-data warehouse of Credit Suisse that is productive<br />
since 2009. Like most other large organizati<strong>on</strong>s, Credit Suisse has a complex<br />
Page<br />
120
Abstracts<br />
applicati<strong>on</strong> landscape and several data warehouses in order to meet the informati<strong>on</strong><br />
needs of its users. The problem addressed by the meta-data warehouse is to<br />
increase the agility and flexibility of the organizati<strong>on</strong> with regards to changes such<br />
as the development of a new business process, a new business analytics report, or<br />
the implementati<strong>on</strong> of a new regulatory requirement. The meta-data warehouse<br />
supports these changes by providing services to search for informati<strong>on</strong> items in<br />
the data warehouses and to extract the lineage of informati<strong>on</strong> items. One difficulty<br />
in the design of such a meta-data warehouse is that there is no standard or wellknown<br />
meta-data model that can be used to support such search services. Instead,<br />
the meta-data structures need to be flexible themselves and evolve with the changing<br />
IT landscape. This paper describes the current data structures and implementati<strong>on</strong><br />
of the Credit Suisse meta-data warehouse and shows how its services help to<br />
increase the flexibility of the whole organizati<strong>on</strong>. A series of example meta-data<br />
structures, use cases, and screenshots are given in order to illustrate the c<strong>on</strong>cepts<br />
used and the less<strong>on</strong>s learned based <strong>on</strong> feedback of real business and IT users<br />
within Credit Suisse.<br />
induStrial SeSSi<strong>on</strong> 3:<br />
iNDEXiNG, UPDATES AND ProcESSiNG<br />
Efficient Support of XQuery Update Facility in XML Enabled RDBMS<br />
Zhen Hua Liu (oracle)<br />
Hui J. chang (oracle)<br />
Balasubramanyam Sthanikam (oracle)<br />
XQuery Update Facility (XQUF), which provides a declarative way of updating<br />
XML, has become recommendati<strong>on</strong> by W3C. The SQL/XML standard, <strong>on</strong> the other<br />
hand, defines XMLType as a column data type in RDBMS envir<strong>on</strong>ment and defines<br />
the standard SQL/XML operator, such as XMLQuery() to embed XQuery to query<br />
XMLType column in RDBMS. Based <strong>on</strong> this SQL/XML standard, XML enabled RD-<br />
BMS becomes industrial strength platforms to host XML applicati<strong>on</strong>s in a standard<br />
compliance way by providing XML store and query capability. However, updating<br />
XML capability support remains to be proprietary in RDBMS until XQUF becomes<br />
the recommendati<strong>on</strong>. XQUF is agnostic of how XML is stored so that propagati<strong>on</strong><br />
of actual update to any persistent XML store is bey<strong>on</strong>d the scope of XQUF. In this<br />
paper, we show how XQUF can be incorporated into XMLQuery() to effectively<br />
update XML stored in XMLType column in the envir<strong>on</strong>ment of XML enabled RDBMS,<br />
such as Oracle XMLDB. We present various compile time and run time optimisati<strong>on</strong><br />
techniques to show how XQUF can be efficiently implemented to declaratively<br />
update XML stored in RDBMS. We present how our approaches of optimising XQUF<br />
for comm<strong>on</strong> physical XML storage models: native binary XML storage model and<br />
relati<strong>on</strong>al decompositi<strong>on</strong> of XML storage model. Although our study is d<strong>on</strong>e using<br />
Oracle XMLDB, all of the presented optimisati<strong>on</strong> techniques are generic to XML<br />
stores that need to support update of persistent XML store and not specific to<br />
Oracle XMLDB implementati<strong>on</strong>.<br />
Page<br />
121
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Making Unstructured <strong>Data</strong> SPARQL Using Semantic Indexing in Oracle<br />
<strong>Data</strong>base<br />
Souripriya Das (oracle)<br />
Seema Sundara (oracle)<br />
Matthew Perry (oracle)<br />
Jagannathan Srinivasan (oracle)<br />
Jayanta Banerjee (oracle)<br />
Aravind yalamanchi (oracle)<br />
This paper describes the Semantic Indexing feature introduced in Oracle <strong>Data</strong>base<br />
for indexing unstructured text (document) columns. This capability enables searching<br />
for c<strong>on</strong>cepts (such as people, places, organizati<strong>on</strong>s, and events), in additi<strong>on</strong> to<br />
words or phrases, with further opti<strong>on</strong>s for sense disambiguati<strong>on</strong> and term expansi<strong>on</strong><br />
by c<strong>on</strong>sulting knowledge captured in OWL/RDF <strong>on</strong>tologies. The distinguishing<br />
aspects of our approach are: 1) Indexing: Instead of building a traditi<strong>on</strong>al inverted<br />
index of (annotated) token and/or named entity occurrences, we extract the entities,<br />
associati<strong>on</strong>s, and events present in a text column data and store them as RDF<br />
named graphs in the Oracle <strong>Data</strong>base Semantic Store. This base c<strong>on</strong>tent can be<br />
further augmented with knowledge bases and inferred triples (obtained by applying<br />
domain-specific <strong>on</strong>tologies and rulebases). 2) Querying: Instead of relying <strong>on</strong><br />
proprietary extensi<strong>on</strong>s for specifying a search, we allow users to specify a complete<br />
SPARQL query pattern that can capture arbitrarily complex relati<strong>on</strong>ships between<br />
query terms. We have implemented this feature by introducing a sem_c<strong>on</strong>tains<br />
SQL operator and the associated sem_indextype indexing scheme. The indexing<br />
scheme employs an extensible architecture that supports indexing of unstructured<br />
text using native as well as third party text extracti<strong>on</strong> tools. The paper presents a<br />
model for the semantic index and querying, describes the feature, and outlines its<br />
implementati<strong>on</strong> leveraging Oracle’s native support for RDF/OWL storage, inferencing,<br />
and querying. We also report a study involving use of this feature <strong>on</strong> a TREC<br />
collecti<strong>on</strong> of over 130,000 news articles.<br />
A meta-language for MDX queries in eLog Business Soluti<strong>on</strong><br />
S<strong>on</strong>ia Bergamaschi (University of Modena and reggio Emilia)<br />
Matteo interlandi (University of Modena and reggio Emilia)<br />
Mario L<strong>on</strong>go (eBilling S.p.A.)<br />
Laura Po (University of Modena and reggio Emilia)<br />
Maurizio vincini (University of Modena and reggio Emilia)<br />
The adopti<strong>on</strong> of business intelligence technology in industries is growing rapidly.<br />
Business managers are not satisfied with ad hoc and static reports and they ask for<br />
more flexible and easy to use data analysis tools. Recently, applicati<strong>on</strong> interfaces<br />
that expand the range of operati<strong>on</strong>s available to the user, hiding the underlying<br />
complexity, have been developed. The paper presents eLog, a business intelligence<br />
soluti<strong>on</strong> designed and developed in collaborati<strong>on</strong> between the database group of<br />
the University of Modena and Reggio Emilia and eBilling, an Italian SME supplier of<br />
soluti<strong>on</strong>s for the design, producti<strong>on</strong> and automati<strong>on</strong> of documentary processes for<br />
top Italian companies. eLog enables business managers to define OLAP reports by<br />
means of a web interface and to customize analysis indicators adopting a simple<br />
meta-language. The framework translates the user’s reports into MDX queries and<br />
Page<br />
122
Abstracts<br />
is able to automatically select the data cube suitable for each query. Over 140<br />
medium and large companies have exploited the technological services of eBilling<br />
S.p.A. to manage their documents flows. In particular, eLog services have been used<br />
by the major media and telecommunicati<strong>on</strong>s Italian companies and their foreign<br />
annex, such as Sky, Mediaset, H3G, Tim Brazil etc. The largest customer can provide<br />
up to 30 milli<strong>on</strong>s mail pieces within 6 m<strong>on</strong>ths (about 200 GB of data in the relati<strong>on</strong>al<br />
DBMS). In a period of 18 m<strong>on</strong>ths, eLog could reach 150 milli<strong>on</strong>s mail pieces (1<br />
TB of data) to handle.<br />
demo group 1:<br />
SMIX Live – A Self-Managing Index Infrastructure for Dynamic Workloads<br />
Thomas Kissinger (Dresden University of Technology)<br />
Hannes voigt (Dresden University of Technology)<br />
Wolfgang Lehner (Dresden University of Technology)<br />
As databases accumulate growing amounts of data at an increasing rate, adaptive<br />
indexing becomes more and more important. At the same time, applicati<strong>on</strong>s and<br />
their use get more agile and flexible, resulting in less steady and less predictable<br />
workload characteristics. Being inert and coarse-grained, state-of-the-art index tuning<br />
techniques become less useful in such envir<strong>on</strong>ments. Especially the full-column<br />
indexing paradigm results in lot of indexed but never queried data and prohibitively<br />
high memory and maintenance costs. In our dem<strong>on</strong>strati<strong>on</strong>, we present Self-Managing<br />
Indexes, a novel, adaptive, fine-grained, aut<strong>on</strong>omous indexing infrastructure.<br />
In its core, our approach builds <strong>on</strong> a novel access path that automatically collects<br />
useful index informati<strong>on</strong>, discards useless index informati<strong>on</strong>, and competes with<br />
its kind for resources to host its index informati<strong>on</strong>. Compared to existing technologies<br />
for adaptive indexing, we are able to dynamically grow and shrink our indexes,<br />
instead of incrementally enhancing the index granularity. In the dem<strong>on</strong>strati<strong>on</strong>, we<br />
visualize performance and system measures for different scenarios and allow the<br />
user to interactively change several system parameters.<br />
Multi-Query Stream Processing <strong>on</strong> FPGAs<br />
Mohammad Sadoghi (University of Tor<strong>on</strong>to)<br />
rija Javed (University of Tor<strong>on</strong>to)<br />
Naif Tarafdar (University of Tor<strong>on</strong>to)<br />
Harsh Singh (University of Tor<strong>on</strong>to)<br />
rohan Palaniappan (University of Tor<strong>on</strong>to)<br />
Hans-Arno Jacobsen (University of Tor<strong>on</strong>to)<br />
We present an efficient multi-query event stream platform to support query processing<br />
over high-frequency event streams. Our platform is built over rec<strong>on</strong>figurable<br />
hardware—-FPGAs—-to achieve line-rate multi-query processing by exploiting<br />
unprecedented degrees of parallelism and potential for pipelining, <strong>on</strong>ly available<br />
through custom-built, applicati<strong>on</strong>-specific and low-level logic design. Moreover, a<br />
multi-query event stream processing engine is at the core of a wide range of applicati<strong>on</strong>s<br />
including real-time data analytics, algorithmic trading, targeted advertisement,<br />
and (complex) event processing.<br />
Page<br />
123
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
EUDEMON: A System for Online Video Frame Copy Detecti<strong>on</strong> by Earth<br />
Mover’s Distance<br />
Jia Xu (Northeastern University, china)<br />
Qiushi Bai (Northeastern University, china)<br />
yu Gu (Northeastern University, china)<br />
Anth<strong>on</strong>y K.H. Tung (Nati<strong>on</strong>al University of Singapore)<br />
Guoren Wang (Northeastern University, china)<br />
Ge yu (Northeastern University, china)<br />
Zhenjie Zhang (Advanced Digital Sciences center, illinois at Singapore Pte.)<br />
The Earth Mover’s Distance, or EMD for short, has been proven to be effective for<br />
c<strong>on</strong>tent-based image retrieval. However, due to the cubic complexity of EMD computati<strong>on</strong>,<br />
it remains difficult to use EMD in applicati<strong>on</strong>s with stringent requirement<br />
for efficiency. In this paper, we present our new system, called EUDEMON, which<br />
utilizes new techniques to support fast Online Video Frame Copy Detecti<strong>on</strong> based<br />
<strong>on</strong> the EMD. Given a group of registered frames as queries and a set of targeted<br />
detecti<strong>on</strong> videos, EUDEMON is capable of identifying relevant frames from the<br />
video stream in real time. The significant improvement <strong>on</strong> efficiency mainly relies<br />
<strong>on</strong> the primal-dual theory in linear programming and well-designed B+ tree filters<br />
for adaptive candidate pruning. Generally speaking, our system includes a variety<br />
of new features crucial to the deployment of EUDEMON in real applicati<strong>on</strong>s. First,<br />
EUDEMON achieves high throughput even when a large number of queries are registered<br />
in the system. Sec<strong>on</strong>d, EUDEMON c<strong>on</strong>tains self-optimizati<strong>on</strong> comp<strong>on</strong>ent to<br />
automatically enhance the effectiveness of the filters based <strong>on</strong> the recent c<strong>on</strong>tent<br />
of the video stream. Finally, EUDEMON provides a user-friendly visualizati<strong>on</strong> interface,<br />
named EMD Flow Chart, to help the users to better understand the alarm with<br />
the perspective of the EMD.<br />
A <strong>Data</strong>set Search Engine for the Research Document Corpus<br />
Meiyu Lu (Nati<strong>on</strong>al University of Singapore)<br />
Srinivas Bangalore (AT&T Labs–research)<br />
Graham cormode (AT&T Labs–research)<br />
Marios Hadjieleftheriou (AT&T Labs–research)<br />
Divesh Srivastava (AT&T Labs–research)<br />
A key step in validating a proposed idea or system is to evaluate over a suitable<br />
dataset. However, to this date there have been no useful tools for researchers to<br />
understand which datasets have been used for what purpose, or in what prior work.<br />
Instead, they have to manually browse through papers to find the suitable datasets<br />
and their corresp<strong>on</strong>ding URLs, which is laborious and inefficient. To better aid the<br />
dataset discovery process, and provide a better understanding of how and where<br />
datasets have been used, we propose a framework to effectively identify datasets<br />
within the scientific corpus. The key technical challenges are identificati<strong>on</strong> of datasets,<br />
and discovery of the associati<strong>on</strong> between a dataset and the URLs where they<br />
can be accessed. Based <strong>on</strong> this, we have built a user friendly web-based search<br />
interface for users to c<strong>on</strong>veniently explore the dataset-paper relati<strong>on</strong>ships, and find<br />
relevant datasets and their properties.<br />
Page<br />
124
AskFuzzy: Attractive Visual Fuzzy Query Builder<br />
Keivan Kianmehr (University of Western <strong>on</strong>tario)<br />
Negar Koochakzadeh (University of calgary)<br />
reda Alhajj (University of calgary)<br />
Abstracts<br />
The user-centric query interface is very comm<strong>on</strong> applicati<strong>on</strong> that allows expressing<br />
both the input and the output using fuzzy terms. This is becoming a need in the<br />
evolving internet-based era where web-based applicati<strong>on</strong>s are very comm<strong>on</strong> and<br />
the number of users accessing structured databases is increasing rapidly. Restricting<br />
the user group to <strong>on</strong>ly experts in query coding must be avoided. The AskFuzzy<br />
system has been developed to address this vital issue which has social and industrial<br />
impact. It is an attractive and friendly visual user interface that facilitates<br />
expressing queries using both fuzziness and traditi<strong>on</strong>al methods. The fuzziness is<br />
not expressed explicitly inside the database; it is rather absorbed and effectively<br />
handled by an intermediate layer which is cleverly incorporated between the fr<strong>on</strong>tend<br />
visual user-interface and the back-end database.<br />
F2DB: The Flash-Forward <strong>Data</strong>base System<br />
Ulrike Fischer (Dresden University of Technology)<br />
Frank rosenthal (Dresden University of Technology)<br />
Wolfgang Lehner (Dresden University of Technology)<br />
Forecasts are important to decisi<strong>on</strong>-making and risk assessment in many domains.<br />
Since current database systems do not provide integrated support for forecasting,<br />
it is usually d<strong>on</strong>e outside the database system by specially trained experts using<br />
forecast models. However, integrating model-based forecasting as a first-class<br />
citizen inside a DBMS speeds up the forecasting process by avoiding exporting the<br />
data and by applying database-related optimizati<strong>on</strong>s like reusing created forecast<br />
models. It especially allows subsequent processing of forecast results inside the database.<br />
In this demo, we present our prototype F2DB based <strong>on</strong> PostgreSQL, which<br />
allows for transparent processing of forecast queries. Our system automatically<br />
takes care of model maintenance when the underlying dataset changes. In additi<strong>on</strong>,<br />
we offer optimizati<strong>on</strong>s to save maintenance costs and increase accuracy by using<br />
derivati<strong>on</strong> schemes for multidimensi<strong>on</strong>al data. Our approach reduces the required<br />
expert knowledge by enabling arbitrary users to apply forecasting in a declarative<br />
way.<br />
Provenance-Based Debugging and Drill-Down in <strong>Data</strong>-Oriented Workflows<br />
robert ikeda (Stanford University)<br />
Junsang cho (Stanford University)<br />
charlie Fang (Stanford University)<br />
Semih Salihoglu (Stanford University)<br />
Satoshi Torikai (Stanford University)<br />
Jennifer Widom (Stanford University)<br />
Panda (for Provenance and <strong>Data</strong>) is a system that supports the creati<strong>on</strong> and executi<strong>on</strong><br />
of data-oriented workflows, with automatic provenance generati<strong>on</strong> and built-in<br />
provenance tracing operati<strong>on</strong>s. Workflows in Panda are arbitrary acyclic graphs<br />
Page<br />
125
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
c<strong>on</strong>taining both relati<strong>on</strong>al (SQL) processing nodes and opaque processing nodes<br />
programmed in Pyth<strong>on</strong>. For both types of nodes, Panda generates logical provenance—-provenance<br />
informati<strong>on</strong> stored at the processing-node level—-and uses<br />
the generated provenance to support record-level backward tracing and forward<br />
tracing operati<strong>on</strong>s. In our dem<strong>on</strong>strati<strong>on</strong> we use Panda to integrate, process, and<br />
analyze actual educati<strong>on</strong> data from multiple sources. We specifically dem<strong>on</strong>strate<br />
how Panda’s provenance generati<strong>on</strong> and tracing capabilities can be very useful for<br />
workflow debugging, and for drilling down <strong>on</strong> specific results of interest.<br />
demo group 2:<br />
M 3 : Stream Processing <strong>on</strong> Main-Memory MapReduce<br />
Ahmed M. Aly (Purdue University)<br />
Asmaa Sallam (Purdue University)<br />
Bala M. Gnanasekaran (Purdue University)<br />
L<strong>on</strong>g-van Nguyen-Dinh (Purdue University)<br />
Walid G. Aref (Purdue University)<br />
Mourad ouzzani (Qatar computing research institute)<br />
Arif Ghafoor (Purdue University)<br />
The c<strong>on</strong>tinuous growth of social web applicati<strong>on</strong>s al<strong>on</strong>g with the development of<br />
sensor capabilities in electr<strong>on</strong>ic devices is creating countless opportunities to analyze<br />
the enormous amounts of data that is c<strong>on</strong>tinuously steaming from these applicati<strong>on</strong>s<br />
and devices. To process large scale data <strong>on</strong> large scale computing clusters,<br />
MapReduce has been introduced as a framework for parallel computing. However,<br />
most of the current implementati<strong>on</strong>s of the MapReduce framework support <strong>on</strong>ly<br />
the executi<strong>on</strong> of fixed-input jobs. Such restricti<strong>on</strong> makes these implementati<strong>on</strong>s<br />
inapplicable for most streaming applicati<strong>on</strong>s, in which queries are c<strong>on</strong>tinuous in<br />
nature, and input data streams are c<strong>on</strong>tinuously received at high arrival rates. In<br />
this dem<strong>on</strong>strati<strong>on</strong>, we showcase M 3 , a prototype implementati<strong>on</strong> of the MapReduce<br />
framework in which c<strong>on</strong>tinuous queries over streams of data can be efficiently<br />
answered. M 3 extends Hadoop, the open source implementati<strong>on</strong> of MapReduce, bypassing<br />
the Hadoop Distributed File System (HDFS) to support main-memory-<strong>on</strong>ly<br />
processing. Moreover, M 3 supports c<strong>on</strong>tinuous executi<strong>on</strong> of the Map and Reduce<br />
phases where individual Mappers and Reducers never terminate.<br />
A Deep Embedding of Queries into Ruby<br />
Torsten Grust (University of Tübingen)<br />
Manuel Mayr (University of Tübingen)<br />
We dem<strong>on</strong>strate SWITCH, a deep embedding of relati<strong>on</strong>al queries into Ruby and<br />
Ruby <strong>on</strong> Rails. With SWITCH, there is no syntactic or stylistic difference between<br />
Ruby programs that operate over in-memory array objects or database-resident<br />
tables, even if these programs rely <strong>on</strong> array order or nesting. SWITCH’s built-in<br />
compiler and SQL code generator guarantee to emit few queries, addressing l<strong>on</strong>gstanding<br />
performance problems that trace back to Rails’ ActiveRecord database<br />
binding. “Looks likes Ruby, but performs like handcrafted SQL,” is the ideal that<br />
drives the research and development effort behind SWITCH.<br />
Page<br />
126
Asking the Right Questi<strong>on</strong>s in Crowd <strong>Data</strong> Sourcing<br />
rubi Boim (Tel-Aviv University)<br />
ohad Greenshpan (Tel-Aviv University)<br />
Tova Milo (Tel-Aviv University)<br />
Slava Novgorodov (Tel-Aviv University)<br />
Neoklis Polyzotis (University of california, Santa cruz)<br />
Wang-chiew Tan (University of california, Santa cruz)<br />
Abstracts<br />
Crowd-based data sourcing is a new and powerful data procurement paradigm that<br />
engages Web users to collectively c<strong>on</strong>tribute informati<strong>on</strong>. In this work, we target<br />
the problem of gathering data from the crowd in an ec<strong>on</strong>omical and principled<br />
fashi<strong>on</strong>. We present AskIt!, a system that allows interactive data sourcing applicati<strong>on</strong>s<br />
to effectively determine which questi<strong>on</strong>s should be directed to which users<br />
for reducing the uncertainty about the collected data. AskIt! uses a set of novel<br />
algorithms for minimizing the number of probing (questi<strong>on</strong>s) required from the<br />
different users. We dem<strong>on</strong>strate the challenge and our soluti<strong>on</strong> in the c<strong>on</strong>text of a<br />
multiple-choice questi<strong>on</strong> game played by the <strong>ICDE</strong>’12 attendees, targeted to gather<br />
informati<strong>on</strong> <strong>on</strong> the c<strong>on</strong>ference’s publicati<strong>on</strong>s, authors and colleagues.<br />
LotusX: A Positi<strong>on</strong>-Aware XML Graphical Search System with<br />
Auto-Completi<strong>on</strong><br />
chunbin Lin (renmin University of china)<br />
Jiaheng Lu (renmin University of china)<br />
Tok Wang Ling (Nati<strong>on</strong>al Universtiy of Singapore)<br />
Bogdan cautis (Télécom ParisTech)<br />
The existing query languages for XML (e.g., XQuery) require professi<strong>on</strong>al programming<br />
skills to be formulated, however, such complex query languages burden the<br />
query processing. In additi<strong>on</strong>, when issuing an XML query, users are required to<br />
be familiar with the c<strong>on</strong>tent (including the structural and textual informati<strong>on</strong>) of<br />
the hierarchical XML, which is diffcult for comm<strong>on</strong> users. The need for designing<br />
userfriendly interfaces to reduce the burden of query formulati<strong>on</strong> is fundamental to<br />
the spreading of XML community. We present a twig-based XML graphical search<br />
system, called LotusX, that provides a graphical interface to simplify the query<br />
processing without the need of learning query language and data schemas and the<br />
knowledge of the c<strong>on</strong>tent of the XML document. The basic idea is that LotusX proposes<br />
“positi<strong>on</strong>-aware” and “auto-completi<strong>on</strong>” features to help users to create treemodeled<br />
queries (twig pattern) by providing the possible candidates <strong>on</strong>-the-fly.<br />
In additi<strong>on</strong>, complex twig queries (including ordersensitive queries) are supported<br />
in LotusX. Furthermore, a new ranking strategy and a query rewriting soluti<strong>on</strong> are<br />
implemented to rank and rewrite the query effectively.<br />
Efficient Top-k Keyword Search in Graphs with Polynomial Delay<br />
Mehdi Kargar (york University)<br />
Aijun An (york University)<br />
A system for efficient keyword search in graphs is dem<strong>on</strong>strated. The system has<br />
two comp<strong>on</strong>ents, a search through <strong>on</strong>ly the nodes c<strong>on</strong>taining the input keywords<br />
Page<br />
127
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
for a set of nodes that are close to each other and together cover the input keywords<br />
and an explorati<strong>on</strong> for finding how these nodes are related to each other. The<br />
system generates all or top-k answers in polynomial delay. Answers are presented<br />
to the user according to a ranking criteri<strong>on</strong> so that the answers with nodes closer to<br />
each other are presented before the <strong>on</strong>es with nodes farther away from each other.<br />
In additi<strong>on</strong>, the set of answers produced by our system is duplicati<strong>on</strong> free. The<br />
system uses two methods for presenting the final answer to the user. The presentati<strong>on</strong><br />
methods reveal relati<strong>on</strong>ships am<strong>on</strong>g the nodes in an answer through a tree or<br />
a multi-center graph. We will show that each method has its own advantages and<br />
disadvantages. The system is dem<strong>on</strong>strated using two challenging datasets, very<br />
large DBLP and highly cyclic M<strong>on</strong>dial. Challenges and difficulties in implementing<br />
an efficient keyword search system are also dem<strong>on</strong>strated.<br />
TEDAS: a Twitter-based Event Detecti<strong>on</strong> and Analysis System<br />
rui Li (University of illinois at Urbana-champaign)<br />
Kin Hou Lei (Brigham young University)<br />
ravi Khadiwala (University of illinois at Urbana-champaign)<br />
Kevin chen-chuan chang (University of illinois at Urbana-champaign)<br />
Witnessing the emergence of Twitter, we propose a Twitter-based Event Detecti<strong>on</strong><br />
and Analysis System (TEDAS), which helps to (1) detect new events, to (2) analyze<br />
the spatial and temporal pattern of an event, and to (3) identify importance of<br />
events. In this dem<strong>on</strong>strati<strong>on</strong>, we show the overall system architecture, explain in<br />
detail the implementati<strong>on</strong> of the comp<strong>on</strong>ents that crawl, classify, and rank tweets<br />
and extract locati<strong>on</strong> from tweets, and present some interesting results of our system.<br />
AutoDict: Automated Dicti<strong>on</strong>ary Discovery<br />
Fei chiang (University of Tor<strong>on</strong>to)<br />
Periklis Andritsos (University of Tor<strong>on</strong>to)<br />
Erkang Zhu (University of Tor<strong>on</strong>to)<br />
renee J. Miller (University of Tor<strong>on</strong>to)<br />
An attribute dicti<strong>on</strong>ary is a set of attributes together with a set of comm<strong>on</strong> values<br />
of each attribute. Such dicti<strong>on</strong>aries are valuable in understanding unstructured<br />
or loosely structured textual descripti<strong>on</strong>s of entity collecti<strong>on</strong>s, such as product<br />
catalogs. Dicti<strong>on</strong>aries provide the supervised data for learning product or entity<br />
descripti<strong>on</strong>s. In this dem<strong>on</strong>strati<strong>on</strong>, we will present AutoDict, a system that analyzes<br />
input data records, and discovers high quality dicti<strong>on</strong>aries using informati<strong>on</strong><br />
theoretic techniques. To the best of our knowledge, AutoDict is the first end-to-end<br />
system for building attribute dicti<strong>on</strong>aries. Our dem<strong>on</strong>strati<strong>on</strong> will showcase the<br />
different informati<strong>on</strong> analysis and extracti<strong>on</strong> features within AutoDict, and highlight<br />
the process of generating high quality attribute dicti<strong>on</strong>aries.<br />
Page<br />
128
demo group 3:<br />
Abstracts<br />
Trust & Share: Trusted Informati<strong>on</strong> Sharing in Online Social Networks<br />
Barbara carminati (University of insubria)<br />
Elena Ferrari (University of insubria)<br />
Jacopo Girardi (University of insubria)<br />
Trust & Share (T&S) aims at providing relati<strong>on</strong>ship-based access c<strong>on</strong>trol in the<br />
Facebook realm. T&S is a third-party Facebook applicati<strong>on</strong>, designed to support a<br />
flexible and c<strong>on</strong>trolled sharing of user data. It makes users able to upload resources<br />
(i.e., any file) and specify for each of them which users have to be authorized by<br />
T&S to access them. To enforce this c<strong>on</strong>trolled informati<strong>on</strong> sharing, T&S relies <strong>on</strong><br />
the OSN access c<strong>on</strong>trol model proposed in \cite{tissec}, where social network relati<strong>on</strong>ships<br />
have an enhanced semantics than the c<strong>on</strong>tacts in Facebook. According to<br />
\cite{tissec}, OSN users associate with each of their c<strong>on</strong>tacts a type, representing<br />
the nature of the relati<strong>on</strong>ship (e.g., friends, colleagues, parents). Moreover, the creator<br />
of the relati<strong>on</strong>ship can assign to it also a trust level to represent the strength<br />
of the c<strong>on</strong>necti<strong>on</strong>. This graph enables users to specify more expressive rules for<br />
the c<strong>on</strong>trolled informati<strong>on</strong> sharing. Indeed, <strong>on</strong> top of this enhanced social graph,<br />
T&S users can specify access c<strong>on</strong>straints <strong>on</strong> the type, trust level and depth of the<br />
relati<strong>on</strong>ship it must exist with a given Facebook c<strong>on</strong>tact in order to access a certain<br />
resource.<br />
Evaluati<strong>on</strong> of Clusterings – Metrics and Visual Support<br />
Elke Achtert (Ludwig-Maximilians-Universität München)<br />
Sascha Goldhofer (Ludwig-Maximilians-Universität München)<br />
Hans-Peter Kriegel (Ludwig-Maximilians-Universität München)<br />
Erich Schubert (Ludwig-Maximilians-Universität München)<br />
Arthur Zimek (Ludwig-Maximilians-Universität München)<br />
When comparing clustering results, any evaluati<strong>on</strong> metric breaks down the available<br />
informati<strong>on</strong> to a single number. However, a lot of evaluati<strong>on</strong> metrics are around,<br />
that are not always c<strong>on</strong>cordant nor easily interpretable in judging the agreement of<br />
a pair of clusterings. Here, we provide a tool to visually support the assessment of<br />
clustering results in comparing multiple clusterings. Al<strong>on</strong>g the way, the suitability of<br />
a couple of clustering comparis<strong>on</strong> measures can be judged in different scenarios.<br />
Hort<strong>on</strong>: Online Query Executi<strong>on</strong> Engine For Large Distributed Graphs<br />
Mohamed Sarwat (University of Minnesota)<br />
Sameh Elnikety (Microsoft research)<br />
yuxi<strong>on</strong>g He (Microsoft research)<br />
Gabriel Kliot (Microsoft research)<br />
Graphs are used in many large-scale applicati<strong>on</strong>s, such as social networking. The<br />
management of these graphs poses new challenges as such graphs are too large<br />
for a single server to manage efficiently. Current distributed techniques such as<br />
map-reduce and Pregel are not well-suited to processing interactive ad-hoc queries<br />
against large graphs. In this paper we dem<strong>on</strong>strate Hort<strong>on</strong>, a distributed interac-<br />
Page<br />
129
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
tive query executi<strong>on</strong> engine for large graphs. Hort<strong>on</strong> defines a query language that<br />
allows the expressi<strong>on</strong> of regular language reachability queries and provides a query<br />
executi<strong>on</strong> engine with a query optimizer that allows interactive executi<strong>on</strong> of queries<br />
<strong>on</strong> large distributed graphs in parallel. In the demo, we show the functi<strong>on</strong>ality of<br />
Hort<strong>on</strong> managing a large graph for a social networking applicati<strong>on</strong> called Codebook,<br />
whose graph represents data <strong>on</strong> software comp<strong>on</strong>ents, developers, development<br />
artifacts such as bug reports, and their interacti<strong>on</strong>s in large software projects.<br />
MXQuery With Hardware Accelerati<strong>on</strong><br />
Jens Teubner (ETH Zurich)<br />
Peter Fischer (University of Freiburg)<br />
We dem<strong>on</strong>strate MXQuery/H, a modified versi<strong>on</strong> of MXQuery that uses hardware<br />
accelerati<strong>on</strong> to speed up XML processing. The main goal of this dem<strong>on</strong>strati<strong>on</strong> is to<br />
give an interactive example of hardware/software co-design and show how system<br />
performance and energy efficiency can be improved by off-loading tasks to FPGA<br />
hardware. To this end, we equipped MXQuery/H with various hooks to inspect the<br />
different parts of the system. Besides that, our system can finally really leverage the<br />
idea of XML projecti<strong>on</strong>. Though the idea of projecti<strong>on</strong> had been around for a while,<br />
its effectiveness remained always limited because of the unavoidable and high parsing<br />
overhead. By performing the task in hardware, we relieve the software part from<br />
this overhead and achieve processing speed-ups of several factors.<br />
<strong>Data</strong> 3 – A Kinect Interface for OLAP using Complex Event Processing<br />
Steffen Hirte (ilmenau University of Technology)<br />
Andreas Seifert (ilmenau University of Technology)<br />
Stephan Baumann (ilmenau University of Technology)<br />
Daniel Klan (ilmenau University of Technology)<br />
Kai-Uwe Sattler (ilmenau University of Technology)<br />
Moti<strong>on</strong> sensing input devices like Microsoft’s Kinect offer an alternative to traditi<strong>on</strong>al<br />
computer input devices like keyboards and mouses. Daily new applicati<strong>on</strong>s using<br />
this in- terface appear. Most of them implement their own gesture detecti<strong>on</strong>. In our<br />
dem<strong>on</strong>strati<strong>on</strong> we show a new approach using the data stream engine AnduIN. The<br />
gesture detecti<strong>on</strong> is d<strong>on</strong>e based <strong>on</strong> AnduIN’s complex event processing functi<strong>on</strong>ality.<br />
This way we build a system that allows to define new and complex gestures <strong>on</strong><br />
the basis of a declarative programming interface. On this basis our dem<strong>on</strong>strati<strong>on</strong><br />
data 3 provides a basic natural interacti<strong>on</strong> OLAP interface for a sample star schema<br />
database using Microsoft’s Kinect.<br />
Analyzing Query Optimizati<strong>on</strong> Process: Portraits of Join<br />
Enumerati<strong>on</strong> Algorithms<br />
Anisoara Nica (Sybase, An SAP company)<br />
ian charlesworth (University of Waterloo)<br />
Maysum Panju (University of Waterloo)<br />
Search spaces generated by query optimizers during the optimizati<strong>on</strong> process<br />
encapsulate characteristics of the join enumerati<strong>on</strong> algorithms, the cost models, as<br />
Page<br />
130
Abstracts<br />
well as critical decisi<strong>on</strong>s made for pruning and choosing the best plan. We dem<strong>on</strong>strate<br />
the JoinEnumerati<strong>on</strong>Viewer which is a tool designed for visualizing, mining,<br />
and comparing plan search spaces generated by different join enumerati<strong>on</strong> algorithms<br />
when optimizing same SQL statement. We have enhanced Sybase SQL Anywhere<br />
relati<strong>on</strong>al database management system to log, in a very compact format,<br />
its search space during an optimizati<strong>on</strong> process. Such optimizati<strong>on</strong> log can then<br />
be analyzed by the JoinEnumerati<strong>on</strong>Viewer which internally builds the logical and<br />
physical plan graphs representing complete and partial plans c<strong>on</strong>sidered during the<br />
optimizati<strong>on</strong> process. The optimizati<strong>on</strong> logs also c<strong>on</strong>tain statistics of the resource<br />
c<strong>on</strong>sumpti<strong>on</strong> during the query optimizati<strong>on</strong> such as optimizati<strong>on</strong> time breakdown,<br />
for example, for logical join enumerati<strong>on</strong> versus costing physical plans, and memory<br />
allocati<strong>on</strong> for different optimizati<strong>on</strong> structures. The SQL Anywhere Optimizer<br />
implements a highly adaptable, self-managing, search space generati<strong>on</strong> algorithm<br />
by having several join enumerati<strong>on</strong> algorithms to choose from, each enhanced with<br />
different ordering and pruning techniques. The emphasis of the dem<strong>on</strong>strati<strong>on</strong> will<br />
be <strong>on</strong> comparing and c<strong>on</strong>trasting these join enumerati<strong>on</strong> algorithms by analyzing<br />
their optimizati<strong>on</strong> logs. The dem<strong>on</strong>strati<strong>on</strong> scenarios will include optimizing<br />
SQL statements under various c<strong>on</strong>diti<strong>on</strong>s which will exercise different algorithms,<br />
pruning and ordering techniques. These search spaces will then be visualized and<br />
compared using the JoinEnumerati<strong>on</strong>Viewer.<br />
DPCube: Releasing Differentially Private <strong>Data</strong> Cubes for Health Informati<strong>on</strong><br />
y<strong>on</strong>ghui Xiao (Emory University)<br />
James Gardner (Digital reas<strong>on</strong>ing Systems inc.)<br />
Li Xi<strong>on</strong>g (Emory University)<br />
We propose to dem<strong>on</strong>strate DPCube, a comp<strong>on</strong>ent in our Health Informati<strong>on</strong> DEidentificati<strong>on</strong><br />
(HIDE) framework, for releasing differentially private data cubes (or<br />
multidimensi<strong>on</strong>al histograms) for sensitive data. HIDE is a framework we developed<br />
for integrating heterogenous structured and unstructured health informati<strong>on</strong> and<br />
provides methods for privacy preserving data publishing. The DPCube comp<strong>on</strong>ent<br />
provides the differentially private multidimensi<strong>on</strong>al data cube release. The DPCube<br />
algorithm uses the differentially private access mechanisms as provided by HIDE<br />
and guarantees differential privacy for the released data. It utilizes an innovative<br />
two-step multidimensi<strong>on</strong>al partiti<strong>on</strong>ing technique to publish a generalized data<br />
cube or multi-dimensi<strong>on</strong>al histogram that achieve good utility while satisfying the<br />
privacy requirement. We dem<strong>on</strong>strate that the released data cubes can serve as a<br />
sanitized synopsis of the raw database and, together with an opti<strong>on</strong>al synthesized<br />
dataset based <strong>on</strong> the data cubes, can support various Online Analytical Processing<br />
(OLAP) queries and learning tasks.<br />
Page<br />
131
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
demo group 4:<br />
Nyaya: a System Supporting the Uniform Management of Large Sets of<br />
Semantic <strong>Data</strong><br />
roberto De virgilio (Universitá roma Tre)<br />
Giorgio orsi (University of oxford)<br />
Letizia Tanca (Politecnico di Milano)<br />
riccardo Torl<strong>on</strong>e (Universitá roma Tre)<br />
We present Nyaya, a flexible system for the management of large-scale semantic<br />
data which couples a general-purpose storage mechanism with efficient <strong>on</strong>tological<br />
query answering. Nyaya rapidly imports semantic data expressed in different<br />
formalisms into semantic data kiosks. Each kiosk exposes the native <strong>on</strong>tological<br />
c<strong>on</strong>straints in a uniform fashi<strong>on</strong> using datalog+-, a very general rule-based language<br />
for the representati<strong>on</strong> of <strong>on</strong>tological c<strong>on</strong>straints. A group of kiosks forms a semantic<br />
data market where the data in each kiosk can be uniformly accessed using c<strong>on</strong>junctive<br />
queries and where users can specify user-defined c<strong>on</strong>straints over the data.<br />
Nyaya is easily extensible and robust to updates of both data and meta-data in the<br />
kiosk and can readily adapt to different logical organizati<strong>on</strong>s of the persistent storage.<br />
In the dem<strong>on</strong>strati<strong>on</strong>, we will show the capabilities of Nyaya over real-world<br />
case studies and dem<strong>on</strong>strate its efficiency over well-known benchmarks.<br />
R2DB: A System for Querying and Visualizing Weighted RDF Graphs<br />
S<strong>on</strong>gling Liu (Ariz<strong>on</strong>a State University)<br />
Juan P. cedeno (Ariz<strong>on</strong>a State University)<br />
K. Selcuk candan (Ariz<strong>on</strong>a State University)<br />
Maria Luisa Sapino (University of Torino)<br />
Shengyu Huang (Ariz<strong>on</strong>a State University)<br />
Xinsheng Li (Ariz<strong>on</strong>a State University)<br />
Existing RDF query languages and RDF stores fail to support a large class of<br />
knowledge applicati<strong>on</strong>s which associate utilities or costs <strong>on</strong> the available knowledge<br />
statements. A recent proposal includes (a) a ranked RDF (R2DF) specificati<strong>on</strong><br />
to enhance RDF triples with an applicati<strong>on</strong> specific weights and (b) a SPARankQL<br />
query language specificati<strong>on</strong>, which provides novel primitives <strong>on</strong> top of the<br />
SPARQL language to express top-k queries using traditi<strong>on</strong>al query patterns as well<br />
as novel flexible path predicates. We introduce and dem<strong>on</strong>strate R2DB, a database<br />
system for querying weighted RDF graphs. R2DB relies <strong>on</strong> the AR2Q query processing<br />
engine, which leverages novel index structures to support efficient ranked<br />
path search and includes query optimizati<strong>on</strong> strategies based <strong>on</strong> proximity and<br />
sub-result inter-arrival times. In additi<strong>on</strong> to being the first data management system<br />
for the R2DF data model, R2DB also provides an innovative features-of-interest<br />
(FoI) based method for visualizing large sets of query results (i.e., subgraphs of the<br />
data graph).<br />
Page<br />
132
Project Dayt<strong>on</strong>a: <strong>Data</strong> Analytics as a Cloud Service<br />
roger Barga (Microsoft)<br />
Jaliya Ekanayake (Microsoft research)<br />
Wei Lu (Microsoft research)<br />
Abstracts<br />
Spreadsheets are established data collecti<strong>on</strong> and analysis tools in business, technical<br />
computing and academic research. Excel, for example, offers an attractive<br />
user interface, provides an easy to use data entry model, and offers substantial<br />
interactivity for what-if analysis. However, spreadsheets and other comm<strong>on</strong> client<br />
applicati<strong>on</strong>s do not offer scalable computati<strong>on</strong> for large scale data analytics and<br />
explorati<strong>on</strong>. Increasingly researchers in domains ranging from the social sciences<br />
to envir<strong>on</strong>mental sciences are faced with a deluge of data, often sitting in spreadsheets<br />
such as Excel or other client applicati<strong>on</strong>s, and they lack a c<strong>on</strong>venient way to<br />
explore the data, to find related data sets, or to invoke scalable analytical models<br />
over the data. To address these limitati<strong>on</strong>s, we have developed a cloud data analytics<br />
service based <strong>on</strong> Dayt<strong>on</strong>a, which is an iterative MapReduce runtime optimized<br />
for data analytics. In our model, Excel and other existing client applicati<strong>on</strong>s provide<br />
the data entry and user interacti<strong>on</strong> surfaces, Dayt<strong>on</strong>a provides a scalable runtime<br />
<strong>on</strong> the cloud for data analytics, and our service seamlessly bridges the gap between<br />
the client and cloud. Any analyst can use our data analytics service to discover<br />
and import data from the cloud, invoke cloud scale data analytics algorithms<br />
to extract informati<strong>on</strong> from large datasets, invoke data visualizati<strong>on</strong>, and then store<br />
the data back to the cloud all through a spreadsheet or other client applicati<strong>on</strong> they<br />
are already familiar with.<br />
Interactive User Feedback in Ontology Matching Using Signature Vectors<br />
isabel F. cruz (University of illinois at chicago)<br />
cosmin Stroe (University of illinois at chicago)<br />
Matteo Palm<strong>on</strong>ari (University of Milano-Bicocca)<br />
When compared to a gold standard, the set of mappings that are generated by an<br />
automatic <strong>on</strong>tology matching process is neither complete nor are the individual<br />
mappings always correct. However, given the explosi<strong>on</strong> in the number, size, and<br />
complexity of available <strong>on</strong>tologies, domain experts no l<strong>on</strong>ger have the capability<br />
to create <strong>on</strong>tology mappings without c<strong>on</strong>siderable effort. We present a soluti<strong>on</strong><br />
to this problem that c<strong>on</strong>sists of making the <strong>on</strong>tology matching process interactive<br />
so as to incorporate user feedback in the loop. Our approach clusters mappings to<br />
identify where user feedback will be most beneficial in reducing the number of user<br />
interacti<strong>on</strong>s and system iterati<strong>on</strong>s. This feedback process has been implemented<br />
in the AgreementMaker system and is supported by visual analytic techniques that<br />
help users to better understand the matching process. Experimental results using<br />
the OAEI benchmarks show the effectiveness of our approach. We will dem<strong>on</strong>strate<br />
how users can interact with the <strong>on</strong>tology matching process through the AgreementMaker<br />
user interface to match real-world <strong>on</strong>tologies.<br />
Page<br />
133
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
DObjects+: Enabling Privacy-Preserving <strong>Data</strong> Federati<strong>on</strong> Services<br />
Pawel Jurczyk (Google inc.)<br />
Li Xi<strong>on</strong>g (Emory University)<br />
Slawomir Goryczka (Emory University)<br />
The emergence of cloud computing implies and facilitates managing large collecti<strong>on</strong>s<br />
of highly distributed, aut<strong>on</strong>omous, and possibly private databases. While<br />
there is an increasing need for services that allow integrati<strong>on</strong> and sharing of various<br />
data repositories, it remains a challenge to ensure the privacy, interoperability, and<br />
scalability for such services. In this paper we dem<strong>on</strong>strate a scalable and extensible<br />
framework that is aimed to enable privacy preserving data federati<strong>on</strong>s. The framework<br />
is built <strong>on</strong> top of a distributed mediator-wrapper architecture where nodes<br />
can form collaborative groups for secure an<strong>on</strong>ymizati<strong>on</strong> and secure query processing<br />
when private data need to be accessed. New an<strong>on</strong>ymizati<strong>on</strong> models and protocols<br />
will be dem<strong>on</strong>strated that counter potential attacks in the distributed setting.<br />
DRAGOON: An Informati<strong>on</strong> Accountability System for<br />
High-Performance <strong>Data</strong>bases<br />
Kyriacos E. Pavlou (The University of Ariz<strong>on</strong>a)<br />
richard T. Snodgrass (The University of Ariz<strong>on</strong>a)<br />
Regulati<strong>on</strong>s and societal expectati<strong>on</strong>s have recently emphasized the need to mediate<br />
access to valuable databases, even access by insiders. Fraud occurs when a<br />
pers<strong>on</strong>, often an insider, tries to hide illegal activity. Companies would like to be<br />
assured that such tampering has not occurred, or if it does, that it will be quickly<br />
discovered and used to identify the perpetrator. At <strong>on</strong>e end of the compliance spectrum<br />
lies the approach of restricting access to informati<strong>on</strong> and <strong>on</strong> the other that of<br />
informati<strong>on</strong> accountability. We focus <strong>on</strong> effecting informati<strong>on</strong> accountability of data<br />
stored in high-performance databases. The dem<strong>on</strong>strated work ensures appropriate<br />
use and thus end-to-end accountability of database informati<strong>on</strong> via a c<strong>on</strong>tinuous<br />
assurance technology based <strong>on</strong> cryptographic hashing techniques. A prototype<br />
tamper detecti<strong>on</strong> and forensic analysis system named DRAGOON was designed and<br />
implemented to determine when tampering(s) occurred and what data were tampered<br />
with. DRAGOON is scalable, customizable, and intuitive. This work will show<br />
that informati<strong>on</strong> accountability is a viable alternative to informati<strong>on</strong> restricti<strong>on</strong> for<br />
ensuring the correct storage, use, and maintenance of databases <strong>on</strong> extant DBMSes.<br />
Intuitive Interacti<strong>on</strong> With Encrypted Query Executi<strong>on</strong> in <strong>Data</strong>Storm<br />
Ken Smith (MiTrE)<br />
Ameet Kini (MiTrE)<br />
William Wang (MiTrE)<br />
chris Wolf (MiTrE)<br />
M. David Allen (MiTrE)<br />
Andrew Sillers (MiTrE)<br />
The encrypted executi<strong>on</strong> of database queries promises powerful security protecti<strong>on</strong>s,<br />
however users are currently unlikely to benefit without significant expertise. In<br />
this dem<strong>on</strong>strati<strong>on</strong>, we illustrate a simple workflow enabling users to design secure<br />
Page<br />
134
Abstracts<br />
executi<strong>on</strong>s of their queries. The <strong>Data</strong>Storm system dem<strong>on</strong>strated simplifies both the<br />
design and executi<strong>on</strong> of encrypted executi<strong>on</strong> plans, and represents progress toward<br />
the challenge of developing a general planner for encrypted query executi<strong>on</strong>.<br />
Seminar 1: <strong>Data</strong> Management Issues <strong>on</strong> the Semantic Web<br />
oktie Hassanzadeh (University of Tor<strong>on</strong>to & iBM research)<br />
Anastasios Kementsietsidis (iBM research)<br />
yannis velegrakis (University of Trento)<br />
We provide an overview of the current data management research issues in the<br />
c<strong>on</strong>text of the Semantic Web. The objective is to introduce the audience into the<br />
area of the Semantic Web, and to highlight the fact that the area provides many<br />
interesting research opportunities for the data management community. A new<br />
model, the Resource Descripti<strong>on</strong> Framework (RDF), coupled with a new query<br />
language, called SPARQL, lead us to revisit some classical data management problems,<br />
including efficient storage, query optimizati<strong>on</strong>, and data integrati<strong>on</strong>. These<br />
are problems that the Semantic Web community has <strong>on</strong>ly recently started to explore,<br />
and therefore the experience and l<strong>on</strong>g traditi<strong>on</strong> of the database community<br />
can prove valuable. We target both experienced and novice researchers that are<br />
looking for a thorough presentati<strong>on</strong> of the area and its key research topics.<br />
Seminar 2: Discovering Multiple Clustering Soluti<strong>on</strong>s: Grouping Objects<br />
in Different Views of the <strong>Data</strong><br />
Emmanuel Müller (Karlsruhe institute of Technology)<br />
Stephan Günnemann (rWTH Aachen University)<br />
ines Färber (rWTH Aachen University)<br />
Thomas Seidl (rWTH Aachen University)<br />
Traditi<strong>on</strong>al clustering algorithms identify just a single clustering of the data. Today’s<br />
complex data, however, allow multiple interpretati<strong>on</strong>s leading to several valid<br />
groupings hidden in different views of the database. Each of these multiple clustering<br />
soluti<strong>on</strong>s is valuable and interesting as different perspectives <strong>on</strong> the same data<br />
and several meaningful groupings for each object are given. Especially for high<br />
dimensi<strong>on</strong>al data, where each object is described by multiple attributes, alternative<br />
clusters in different attribute subsets are of major interest. In this tutorial, we<br />
describe several real world applicati<strong>on</strong> scenarios for multiple clustering soluti<strong>on</strong>s.<br />
We abstract from these scenarios and provide the general challenges in this emerging<br />
research area. We describe state-of-the-art paradigms, we highlight specific<br />
techniques, and we give an overview of this topic by providing a tax<strong>on</strong>omy of the<br />
existing clustering methods. By focusing <strong>on</strong> open challenges, we try to attract<br />
young researchers for participating in this emerging research field.<br />
Seminar 3: Detecting Cl<strong>on</strong>es, Copying and Reuse <strong>on</strong> the Web<br />
Xin Luna D<strong>on</strong>g (AT&T Labs–research)<br />
Divesh Srivastava (AT&T Labs–research)<br />
The Web has enabled the availability of a vast amount of useful informati<strong>on</strong> in<br />
recent years. However, the web technologies that have enabled sources to share<br />
Page<br />
135
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
their informati<strong>on</strong> have also made it easy for sources to copy from each other and<br />
often publish without proper attributi<strong>on</strong>. Understanding the copying relati<strong>on</strong>ships<br />
between sources has many benefits, including helping data providers protect their<br />
own rights, im- proving various aspects of data integrati<strong>on</strong>, and facilitating in-<br />
depth analysis of informati<strong>on</strong> flow. The importance of copy detecti<strong>on</strong> has led to a<br />
substantial amount of research in many disciplines of Computer Science, based <strong>on</strong><br />
the type of informati<strong>on</strong> c<strong>on</strong>sidered, such as text, images, videos, software code, and<br />
structured data. This seminar explores the similarities and differences between the<br />
techniques proposed for copy detecti<strong>on</strong> across the different types of informati<strong>on</strong>.<br />
We also examine the computati<strong>on</strong>al challenges associated with large-scale copy<br />
detecti<strong>on</strong>, indicating how they could be detected efficiently, and identify a range of<br />
open problems for the community.<br />
Seminar 4: Mining Knowledge from <strong>Data</strong>: An Informati<strong>on</strong> Network<br />
Analysis Approach<br />
Jiawei Han (University of illinois at Urbana-champaign)<br />
yizhou Sun (University of illinois at Urbana-champaign)<br />
Xifeng yan (University of california at Santa Barbara)<br />
Philip S. yu (University of illinois at chicago)<br />
Most people c<strong>on</strong>sider a database is merely a data repos- itory that supports data<br />
storage and retrieval. Actually, a database c<strong>on</strong>tains rich, inter-related, multi-typed<br />
data and informati<strong>on</strong>, forming <strong>on</strong>e or a set of gigantic, interc<strong>on</strong>nected, heterogeneous<br />
informati<strong>on</strong> networks. Much knowledge can be derived from such informati<strong>on</strong><br />
networks if we systematically develop an effective and scalable database-oriented<br />
informati<strong>on</strong> network analysis technology. In this tutorial, we systematically introduce<br />
database-oriented informati<strong>on</strong> network analysis methods and dem<strong>on</strong>strate how<br />
such a technology can be used to turn database data into useful knowledge and<br />
how such informati<strong>on</strong> networks can be used to enhance data qual- ity, c<strong>on</strong>sistency,<br />
and the generati<strong>on</strong> of interesting knowl- edge. This tutorial presents an organized<br />
picture <strong>on</strong> how to turn a database into <strong>on</strong>e or a set of organized heteroge- neous<br />
informati<strong>on</strong> networks, how such informati<strong>on</strong> net- works can be used for data cleaning,<br />
data c<strong>on</strong>solidati<strong>on</strong>, and data qualify improvement, how to perform OLAP in<br />
such informati<strong>on</strong> networks, how to discover various kinds of knowledge from such<br />
informati<strong>on</strong> networks, and how to transform database data into knowledge by<br />
informati<strong>on</strong> network analysis. Moreover, we present interesting case studies <strong>on</strong> real<br />
datasets, including DBLP and Flickr, and show how interesting and organized knowledge<br />
can be generated from such database-oriented informati<strong>on</strong> networks.<br />
Seminar 5: Emerging Graph Queries In Linked <strong>Data</strong><br />
Arijit Khan (University of california, Santa Barbara)<br />
yinghui Wu (University of california, Santa Barbara)<br />
Xifeng yan (University of california, Santa Barbara)<br />
In a wide array of disciplines, data can be modeled as an interc<strong>on</strong>nected network of<br />
entities, where various attributes could be associated with both the entities and the<br />
relati<strong>on</strong>s am<strong>on</strong>g them. Knowledge is often hidden in the complex structure and attributes<br />
inside these networks. While querying and mining these linked datasets are essential<br />
for various applicati<strong>on</strong>s, traditi<strong>on</strong>al graph queries may not be able to capture<br />
Page<br />
136
Abstracts<br />
the rich semantics in these networks. With the advent of complex informati<strong>on</strong> networks,<br />
new graph queries are emerging, including graph pattern matching and mining,<br />
similarity search, ranking and expert finding, graph aggregati<strong>on</strong> and OLAP. These<br />
queries require both the topology and c<strong>on</strong>tent informati<strong>on</strong> of the network data, and<br />
hence, different from classical graph algorithms such as shortest path, reachability<br />
and minimum cut, which depend <strong>on</strong>ly <strong>on</strong> the structure of the network. In this tutorial,<br />
we shall give an introducti<strong>on</strong> of the emerging graph queries, their indexing and resoluti<strong>on</strong><br />
techniques, the current challenges and the future research directi<strong>on</strong>s.<br />
Seminar 6: Boolean Matrix Decompositi<strong>on</strong> Problem: Theory, Variati<strong>on</strong>s<br />
and Applicati<strong>on</strong>s to <strong>Data</strong> <strong>Engineering</strong><br />
Jaideep vaidya (rutgers University)<br />
With the ubiquitous nature and sheer scale of data collecti<strong>on</strong>, the problem of data<br />
summarizati<strong>on</strong> is most critical for effective data management. Classical matrix<br />
decompositi<strong>on</strong> techniques have often been used for this purpose, and have been<br />
the subject of much study. In recent years, several other forms of decompositi<strong>on</strong>,<br />
including Boolean Matrix Decompositi<strong>on</strong> have become of significant practical<br />
interest. Since much of the data collected is categorical in nature, it can be viewed<br />
in terms of a Boolean matrix. Boolean matrix decompositi<strong>on</strong> (BMD), wherein a<br />
boolean matrix is expressed as a product of two Boolean matrices, can be used<br />
to provide c<strong>on</strong>cise and interpretable representati<strong>on</strong>s of Boolean data sets. The<br />
decomposed matrices give the set of meaningful c<strong>on</strong>cepts and their combinati<strong>on</strong><br />
which can be used to rec<strong>on</strong>struct the original data. Such decompositi<strong>on</strong>s are useful<br />
in a number of applicati<strong>on</strong> domains including role engineering, text mining as<br />
well as knowledge discovery from databases. In this seminar, we look at the theory<br />
underlying the BMD problem, study some of its variants and soluti<strong>on</strong>s, and examine<br />
different practical applicati<strong>on</strong>s.<br />
Page<br />
137
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
138
Co-Located Workshops<br />
<strong>ICDE</strong> Workshop <strong>on</strong> <strong>Data</strong>-DrIvEn DECIsI<strong>on</strong> support<br />
anD GuIDanCE systEms (DGss <strong>2012</strong>)<br />
http://dgss.vse.gmu.edu/<br />
Decisi<strong>on</strong> support systems (Dss) are widely used to support business or organizati<strong>on</strong>al<br />
decisi<strong>on</strong>-making at the management, operati<strong>on</strong>s and planning levels of an organizati<strong>on</strong>.<br />
Decisi<strong>on</strong> guidance systems (DGs) are decisi<strong>on</strong> support systems that go bey<strong>on</strong>d organizing<br />
and displaying informati<strong>on</strong>, providing acti<strong>on</strong>able recommendati<strong>on</strong>s to and extracting<br />
knowledge from human decisi<strong>on</strong>-makers. this workshop will bring together DGss<br />
researchers and practiti<strong>on</strong>ers to present novel methodologies, models, algorithms,<br />
systems, tools, applicati<strong>on</strong>s and case studies of DGss. most importantly, the workshop<br />
will be a forum to discuss how to utilize advances from multiple disciplines for building<br />
DGss that can intelligently merge human knowledge and expertise with formal<br />
mathematical models to make better decisi<strong>on</strong>s. the workshop will include both formal<br />
presentati<strong>on</strong>s and informal discussi<strong>on</strong> of important research directi<strong>on</strong>s in DGss, and<br />
their interacti<strong>on</strong>s with knowledge and data engineering.<br />
Page<br />
139
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Program<br />
8:50 – 9 am opening remarks<br />
9 – 10 am paper sessi<strong>on</strong> 1<br />
10 – 10:30 am Coffee break<br />
10:30 am – no<strong>on</strong> paper sessi<strong>on</strong> 2<br />
no<strong>on</strong> – 2 pm Lunch<br />
Page<br />
140<br />
A MAUT Approach for Reusing Ontologies<br />
Ant<strong>on</strong>io Jiménez, Mari Carmen Suárez-Figueroa, Alf<strong>on</strong>so<br />
Mateos, Mariano Fernández-López and Asunción Gómez-<br />
Pérez<br />
Online Optimizati<strong>on</strong> through Preprocessing for Multi-<br />
Stage Producti<strong>on</strong> Decisi<strong>on</strong> Guidance Queries<br />
Nathan Egge, Alexander Brodsky and Igor Griva<br />
A Decisi<strong>on</strong>-theoretic Model of Disease Surveillance<br />
and C<strong>on</strong>trol and a Prototype Implementati<strong>on</strong> for the<br />
Disease Influenza<br />
Michael Wagner, Gregory Cooper, Fuchiang Tsui,<br />
Jeremy Espino, Hendrik Harkema, John Levander,<br />
Ricardo Villamarin, Nicholas Millett, Shawn Brown and<br />
Anth<strong>on</strong>y Gallagher<br />
Pers<strong>on</strong>al Health Explorer: A Semantic Health Recommendati<strong>on</strong><br />
System<br />
Thomas Morrell and Larry Kerschberg<br />
Striving for Market Dominance in UK’s Private Healthcare<br />
Sector: A Case of Cygnet Healthcare<br />
Mlungisi Masilela, Fenio Annansingh and Shaofeng Liu<br />
2 – 3:30 pm poster sessi<strong>on</strong>: brief overview presentati<strong>on</strong>s followed up with<br />
parallel poster presentati<strong>on</strong>s<br />
Towards a DGSS Prototype for Early Warning for Ski<br />
Injuries<br />
Boris Delibašić and Zoran Obradović
3:30 – 4 pm Coffee break<br />
4 – 5 pm paper sessi<strong>on</strong> 3<br />
Co-Located Workshops<br />
N<strong>on</strong>-Parametric Synthesis Of Private Probabilistic<br />
Predicti<strong>on</strong>s<br />
Phan Giang<br />
Battle Management System: An Optimizati<strong>on</strong> for Military<br />
Decisi<strong>on</strong> Makers<br />
Richard Haberlin and Alexander Brodsky<br />
An explanati<strong>on</strong> of decisi<strong>on</strong>-making under uncertainty –<br />
a qualitative research approach<br />
Eurico Lopes<br />
Agent Negotiati<strong>on</strong> Strategies for Composing Service<br />
Workflows<br />
John Mcdowall and Larry Kerschberg<br />
A Scalable <strong>Data</strong> Warehouse Model based <strong>on</strong> Complex<br />
Semantic Event Processing in Distributed Systems<br />
Dingyu Yang and Jian Cao<br />
A Stigmergic Guiding System to Facilitate the Group<br />
Decisi<strong>on</strong> Process<br />
C<strong>on</strong>stantin-Bala Zamfirescu and Ciprian Candea<br />
A Regressi<strong>on</strong> Based Algorithm for Optimizing Top-K<br />
Selecti<strong>on</strong> in Simulati<strong>on</strong> Query Language<br />
Susan Farley, Alexander Brodsky and Chun-Hung Chen<br />
Towards a Training-Oriented Adaptive Decisi<strong>on</strong> Guidance<br />
and Support System<br />
Farhana Zulkernine, Patrick Martin, Sima Soltani, Wendy<br />
Powley, Serge Mankovskii and Mark Addleman<br />
5 – 5:30 pm Wrap-up sessi<strong>on</strong>: Discussi<strong>on</strong> <strong>on</strong> the future and<br />
organizati<strong>on</strong> of DGss<br />
Page<br />
141
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
3rD IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> <strong>Data</strong> EnGInEErInG<br />
mEEts thE sEmantIC WEb (DEsWEb <strong>2012</strong>)<br />
https://sites.google.com/site/desweb<strong>2012</strong>/<br />
DEsWeb brings together researchers and practiti<strong>on</strong>ers from <strong>Data</strong> management and<br />
semantic Web. <strong>on</strong> <strong>on</strong>e hand, the semantic Web brings several new data management<br />
problems, while <strong>on</strong> the other hand, several <strong>Data</strong> management problems can be<br />
solved with the help of semantic Web technologies. DEsWeb attracts papers <strong>on</strong> three<br />
broad areas: semantics in <strong>Data</strong> management, management of semantic Web <strong>Data</strong>, and<br />
semantic search and Linked <strong>Data</strong>. DEsWeb <strong>2012</strong> features an invited talk by prof. tim<br />
Finin <strong>on</strong> how to make semantic Web tools easier to use, as well as four regular c<strong>on</strong>tributi<strong>on</strong>s<br />
<strong>on</strong> the topics of improving query processing, benchmarking, schema matching,<br />
and challenges related to enabling semantic Web tools within <strong>Data</strong>spaces.<br />
Program<br />
9 – 10 am sessi<strong>on</strong> 1<br />
10 – 10:30 am Coffee break<br />
10:30 am – no<strong>on</strong> Invited talk<br />
no<strong>on</strong> – 2 pm Lunch<br />
2 – 3 pm sessi<strong>on</strong> 2<br />
Page<br />
142<br />
Scientific SparQL<br />
Andrej Andrejev and Tore Risch<br />
A Benchmark for RDF-Based metadata<br />
Ivan Subotic, Lukas Rosenthaler and Heiko Schuldt<br />
Making the Semantic Web Easier to Use<br />
Tim Finin<br />
Opaque Attribute Alignment<br />
Jennifer Sleeman, Rafael Al<strong>on</strong>so, Hua Li, Art Pope and<br />
Ant<strong>on</strong>io Badia<br />
Linked <strong>Data</strong> and Live Querying for Enabling Support<br />
Platforms for Web <strong>Data</strong>spaces<br />
Jürgen Umbrich, Marcel Karnstedt, Josiane Xavier Parreira,<br />
Axel Polleres and Manfred Hauswirth<br />
3 – 3:30 pm Discussi<strong>on</strong> and Wrap-up
Co-Located Workshops<br />
1st IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> <strong>Data</strong> manaGEmEnt<br />
In thE CLouD (DmC <strong>2012</strong>)<br />
http://www.nec-labs.com/dm/dmc<strong>2012</strong>/<br />
the cloud computing has emerged as a promising computing and business model. by providing<br />
<strong>on</strong>-demand scaling capabilities without any large upfr<strong>on</strong>t investment or l<strong>on</strong>g-term<br />
commitment, it is attracting wide range of users. the database community has also shown<br />
great interest in exploiting this new platform for data management services in a highly<br />
scalable and cost-efficient manner. as a result, the cloud computing presents challenges<br />
and opportunities for data management. the DmC workshop aims at bringing researchers<br />
and practiti<strong>on</strong>ers in cloud computing and data management systems together to discuss<br />
the research issues at the intersecti<strong>on</strong> of those areas, and also to draw more attenti<strong>on</strong> from<br />
the larger data management research community to this new and highly promising field.<br />
Program<br />
8:50 – 9 am Welcome<br />
9 – 10 am keynote<br />
10 – 10:30 am Coffee break<br />
10:30 am – no<strong>on</strong> sessi<strong>on</strong> 1<br />
no<strong>on</strong> – 2:30 pm Lunch<br />
Supporting Extensible Performance SLAs for Cloud<br />
<strong>Data</strong>bases<br />
Olga Papaemmanouil (Brandeis University)<br />
Applicati<strong>on</strong>-Managed <strong>Data</strong>base Replicati<strong>on</strong> <strong>on</strong> Virtualized<br />
Cloud Envir<strong>on</strong>ments<br />
Liang Zhao (Nati<strong>on</strong>al ICT Australia), Sherif Sakr (Nati<strong>on</strong>al<br />
ICT Australia), Alan Fekete (University of Sydney, Australia),<br />
Hiroshi Wada (Nati<strong>on</strong>al ICT Australia), and Anna Liu<br />
(Nati<strong>on</strong>al ICT Australia)<br />
Efficient Updates for Web-scale Indexes over the Cloud<br />
Panagiotis Ant<strong>on</strong>opoulos (Microsoft Corp), Ioannis<br />
K<strong>on</strong>stantinou (Nati<strong>on</strong>al Technical University of Athens),<br />
Dimitrios Tsoumakos, and Nectarios Koziris (Nati<strong>on</strong>al<br />
Technical University of Athens)<br />
Page<br />
143
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
2:30 – 3:30 pm sessi<strong>on</strong> 2<br />
3:30 – 4 pm Coffee break<br />
4 – 5 pm sessi<strong>on</strong> 3<br />
Page<br />
144<br />
Secure Access for Healthcare <strong>Data</strong> in the Cloud Using<br />
Ciphertext-Policy Attribute-Based Encrypti<strong>on</strong><br />
Suhair Alshehri (Rochester Inst. of Technology),<br />
Stanislaw Radziszowski, and Rajendra Raj (Rochester<br />
Inst. of Technology)<br />
Achieving <strong>Data</strong>base Informati<strong>on</strong> Accountability in the<br />
Cloud<br />
Kyriacos Pavlou (The University of Ariz<strong>on</strong>a), and Richard<br />
Snodgrass (The University of Ariz<strong>on</strong>a)<br />
Building Large XML Stores in the Amaz<strong>on</strong> Cloud<br />
Jesús Camacho-Rodríguez* (LRI, Universite Paris-Sud 11),<br />
Dario Colazzo (LRI, Universite Paris-Sud 11), and Ioana<br />
Manolescu (INRIA Saclay)<br />
Stream As You Go: The Case for Incremental <strong>Data</strong><br />
Access and Processing in the Cloud<br />
Romeo Kienzler (ETH Zurich), Rémy Bruggmann (University<br />
of Berne), Anand Ranganathan (IBM Research), and<br />
Nesime Tatbul* (ETH Zurich)<br />
3rD IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> Graph <strong>Data</strong><br />
manaGEmEnt: tEChnIquEs anD appLICatI<strong>on</strong>s<br />
(GDm <strong>2012</strong>)<br />
http://www.cse.unsw.edu.au/~iwgdm/<strong>2012</strong>/<br />
recently, there has been a lot of interest in the applicati<strong>on</strong> of graphs in different domains.<br />
they have been widely used for data modeling of different applicati<strong>on</strong> domains<br />
such as chemical compounds, multimedia databases, protein networks, social networks<br />
and semantic web. With the c<strong>on</strong>tinued emergence and increase of massive and complex<br />
structural graph data, a graph database that efficiently supports elementary data<br />
management mechanisms is crucially required to effectively understand and utilize any<br />
collecti<strong>on</strong> of graphs. the overall goal of the workshop is to bring people from different<br />
fields together, exchange research ideas and results, and encourage discussi<strong>on</strong> about<br />
how to provide efficient graph data management techniques in different applicati<strong>on</strong><br />
domains and to understand the research challenges of such area.
Program<br />
9 – 10 am Welcome and keynote presentati<strong>on</strong><br />
10 – 10:30 am Coffee break<br />
10:30 – no<strong>on</strong> research sessi<strong>on</strong><br />
no<strong>on</strong> – 2 pm Lunch break<br />
Co-Located Workshops<br />
Keynote Speaker<br />
Prof. Jiawei Han - Univ. of Illinois at Urbana-Champaign<br />
A Comparis<strong>on</strong> of Current Graph <strong>Data</strong>base Models<br />
Renzo Angles<br />
Design of Declarative Graph Query Languages: On the<br />
Choice between Value, Pattern and Object-based Representati<strong>on</strong>s<br />
for Graphs<br />
Hasan M Jamil<br />
Benchmarking traversal operati<strong>on</strong>s over graph databases<br />
Marek Ciglan, Alex Averbuch, and Ladialav Hluchy<br />
Mining Associati<strong>on</strong>s Using Directed Hypergraphs<br />
Ramanuja Simha, Rahul Tripathi, and Mayur Thakur<br />
2 – 3:30 pm research sessi<strong>on</strong> (Invited papers)<br />
3:30 – 4 pm Coffee break<br />
Finding Skyline Nodes in Large Networks<br />
Arijit Khan, Vishwakarma Singh, and Jian Wu<br />
Partiti<strong>on</strong>ing Social Networks for Fast Retrieval of Timedependent<br />
Queries<br />
Mindi Yuan, David Stein, Berenice Carrasco, Joana M. F.<br />
Trindade, and Yi Lu<br />
Will Graph <strong>Data</strong> Management Techniques C<strong>on</strong>tribute<br />
to the Successful Large-Scale Deployment of Semantic<br />
Web Technologies?<br />
Philippe Cudre-Mauroux<br />
Page<br />
145
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
4 – 6 pm Industrial sessi<strong>on</strong><br />
Page<br />
146<br />
Virtuoso 7 - Column Store and Adaptive Techniques for<br />
Graph<br />
Orri Erling<br />
HyperGraphDB: Model and Applicati<strong>on</strong>s<br />
Borislav Iordanov<br />
The Bigdata(r) parallel graph database<br />
Bryan Thomps<strong>on</strong><br />
RDF Graph Stores<br />
Christopher J. Matheus<br />
<strong>ICDE</strong> Workshop <strong>on</strong> sECurE <strong>Data</strong> manaGEmEnt <strong>on</strong><br />
smartph<strong>on</strong>Es anD mobILEs (sDmsm <strong>2012</strong>)<br />
http://dig.csail.mit.edu/<strong>2012</strong>/<strong>ICDE</strong>-SDMSM/<br />
there has been a widespread adopti<strong>on</strong> of powerful mobile devices such as smartph<strong>on</strong>es<br />
and tablets within the enterprise in the recent past. this widespread adopti<strong>on</strong> of mobile<br />
devices raises serious data management challenges around data privacy and security of<br />
pers<strong>on</strong>al and enterprise data <strong>on</strong> these devices. the further adopti<strong>on</strong> of mobile devices<br />
within the enterprise depends <strong>on</strong> str<strong>on</strong>g guarantees that the enterprise is still in c<strong>on</strong>trol<br />
of its sensitive data <strong>on</strong> mobile endpoints in the wild, and no data leakage or unauthorized<br />
modificati<strong>on</strong>s to the data can happen through these devices. popular mobile platforms such<br />
as android and ios allow users to download apps from respective marketplaces, and enterprises<br />
can host their own market places to distribute their own apps. however, given the<br />
pers<strong>on</strong>al nature of these devices, most users run both enterprise as well as pers<strong>on</strong>al apps<br />
<strong>on</strong> the same device simultaneously. since most apps <strong>on</strong> the public marketplaces are not<br />
security certified, and existing platform security soluti<strong>on</strong>s are lacking, for example by being<br />
coarse grained or being checked <strong>on</strong>ly at applicati<strong>on</strong> install time, it is possible for malicious<br />
apps to steal/modify enterprise sensitive informati<strong>on</strong> that is resident <strong>on</strong> these devices.<br />
similarly, given the compact dimensi<strong>on</strong>s of mobile devices such as smartph<strong>on</strong>es, users<br />
could potentially lose their ph<strong>on</strong>es, which carry sensitive data. Furthermore, most devices<br />
come packed with an array of sensors and communicati<strong>on</strong> capabilities such as Gps, cameras,<br />
near field communicati<strong>on</strong> (nFC), accelerometers, WiFi and bluetooth. these myriad<br />
<strong>on</strong>-device sensors generate large amounts of raw sensor data and managing this data to<br />
infer high-level events about the user and the end device remains a challenge. additi<strong>on</strong>ally,<br />
devices like ipads and Internet tablets are now being increasingly used in a multi-user envir<strong>on</strong>ment<br />
where c<strong>on</strong>tinuous and secure authenticati<strong>on</strong> and authorizati<strong>on</strong>s for data access is<br />
critical. In this workshop, we focus <strong>on</strong> the data management challenges that arise from the<br />
use of enterprise and other privacy sensitive data <strong>on</strong> mobile devices such as smartph<strong>on</strong>es.
Program<br />
9 – 9:15 am opening address & speaker Introducti<strong>on</strong><br />
Co-Located Workshops<br />
9:15 – 10 am Invited Talk: “Privacy in Mobile, Collaborative,<br />
C<strong>on</strong>text-aware Systems”<br />
Prof Tim Finin<br />
10 – 10:30 am break<br />
10:30 – no<strong>on</strong> research papers (3 papers, 30 mins each)<br />
no<strong>on</strong> – 2 pm Lunch break<br />
2 – 2:45 pm Invited talk<br />
2:45 – 3:30 pm panel “managing data <strong>on</strong> smart ph<strong>on</strong>es: Enterprises and bey<strong>on</strong>d”<br />
3:30 – 4 pm break<br />
4 – 5 pm research papers (3 papers, 30 mins each)<br />
5 pm – 5:15 pm Group Discussi<strong>on</strong><br />
5:15 – 5:30 pm Closing remarks<br />
7th IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> sELF-manaGInG<br />
<strong>Data</strong>basE systEms (smDb <strong>2012</strong>)<br />
http://smdb<strong>2012</strong>.dvs.informatik.tu-darmstadt.de/<br />
aut<strong>on</strong>omic, or self-managing, systems are a promising approach to achieve the goal of<br />
systems that are easier to use and maintain in the face of growing system complexity. a<br />
system is c<strong>on</strong>sidered to be aut<strong>on</strong>omic if it is self-c<strong>on</strong>figuring, self-optimizing, self-healing<br />
and/or self-protecting. the aim of the smDb workshop is to provide a forum for<br />
researchers from both industry and academia to present and discuss ideas and experiences<br />
related to self-management and self-organizati<strong>on</strong> in all areas of Informati<strong>on</strong> management<br />
(Im) in general. smDb targets not <strong>on</strong>ly classical databases but also the new<br />
generati<strong>on</strong> of storage engines such as column stores, key-value stores and in-memory<br />
databases. bey<strong>on</strong>d databases smDb aims to cover aut<strong>on</strong>omic aspects of data intensive<br />
systems represented by large-scale map-reduce (e.g., hadoop) and cloud envir<strong>on</strong>ments<br />
where much work <strong>on</strong> self-management is needed. Last but not least, smDb wants to<br />
expand its horiz<strong>on</strong>s to include self-management of n<strong>on</strong>-traditi<strong>on</strong>al, new areas of Im<br />
such as social networks and peer-to-peer systems.<br />
Page<br />
147
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Program<br />
9 – 10 am sessi<strong>on</strong> 1<br />
10 – 10:30 am break<br />
10:30 - 12:30 pm sessi<strong>on</strong> 2<br />
12:30 – 2 pm Lunch break<br />
Page<br />
148<br />
Opening<br />
Alejandro Buchmann (TU Darmstadt), Malu Castellanos<br />
(HP Labs)<br />
Keynote: Quantitative Methods for Workload Management<br />
in Integrated Large Scale <strong>Data</strong> Platforms<br />
Nachum Shacham (eBay)<br />
Discovering Indicators for C<strong>on</strong>gesti<strong>on</strong> in DBMSs<br />
Mingyi Zhang (Queens University, Canada), Pat Martin<br />
(Queen’s University), Wendy Powley (Queen’s University),<br />
Paul Bird (IBM Tor<strong>on</strong>to Lab), and Keith McD<strong>on</strong>ald (IBM<br />
Tor<strong>on</strong>to Lab)<br />
Online load balancing in parallel database queries with<br />
model predictive c<strong>on</strong>trol<br />
Anastasios Gounaris (Aristotle University of Thessal<strong>on</strong>iki),<br />
and Christos Yfoulis (ATEI of Thessal<strong>on</strong>iki)<br />
Same Queries, Different <strong>Data</strong>: Can we Predict Runtime<br />
Performance?<br />
Adrian Daniel Popescu (EPFL), Vuk Ercegovac (IBM<br />
Almaden), Andrey Balmin (IBM Almaden), Miguel Branco<br />
(EPFL), and Anastasia Ailamaki (EPFL)<br />
Elastic Scale-out for Partiti<strong>on</strong>-Based <strong>Data</strong>base Systems<br />
Umar Farooq Minhas (University of Waterloo), Rui Liu<br />
(University of Waterloo), Ashraf Aboulnaga (University of<br />
Waterloo), Ken Salem (University of Waterloo), J<strong>on</strong>athan<br />
Ng (University of Waterloo), and Sean Roberts<strong>on</strong><br />
(University of Waterloo)
2 – 3:30 pm sessi<strong>on</strong> 3<br />
3:30 – 4 pm break<br />
4 – 6 pm sessi<strong>on</strong> 4<br />
Co-Located Workshops<br />
Adaptive class-based scheduling of c<strong>on</strong>tinuous queries<br />
Lory Al Moakar (University of Pittsburgh), Alexandros<br />
Labrinidis (University of Pittsburgh), and Panos<br />
Chrysanthis (University of Pittsburgh)<br />
Adaptive Provisi<strong>on</strong>ing of Stream Processing Systems in<br />
the Cloud<br />
Javier Cervio (Universidad Politcnica de Madrid), Evangelia<br />
Kalyvianaki (Imperial College L<strong>on</strong>d<strong>on</strong>), Joaqun Salvacha<br />
(Universidad Politcnica de Madrid), and Peter Pietzuch<br />
(Imperial College L<strong>on</strong>d<strong>on</strong>)<br />
Lifting the burden of history in adaptive ordering of<br />
pipelined stream filters<br />
Efthymia Tsamoura (Aristotle University of Thessal<strong>on</strong>iki),<br />
Anastasios Gounaris (Aristotle University of Thessal<strong>on</strong>iki),<br />
and Yannis Manolopoulos (Aristotle University of<br />
Thessal<strong>on</strong>iki)<br />
Adaptive Index Buffer<br />
Hannes Voigt (TU Dresden), Tobias Jaekel (TU Dresden),<br />
Thomas Kissinger (TU Dresden), and Wolfgang Lehner<br />
(TU Dresden)<br />
Applicati<strong>on</strong> of Micro-Specializati<strong>on</strong> to Query Evaluati<strong>on</strong><br />
Operators<br />
Rui Zhang (University of Ariz<strong>on</strong>a), Richard Snodgrass<br />
(University of<br />
Ariz<strong>on</strong>a), and Saumya Debray (University of Ariz<strong>on</strong>a)<br />
Automatic <strong>Data</strong> Placement in MPP <strong>Data</strong>bases<br />
Carlos Garcia-Alvarado (University of Houst<strong>on</strong>), Venkatesh<br />
Raghavan (Greenplum EMC), Sivaramakrishnan<br />
Narayanan (Greenplum EMC), and Florian Waas (Greenplum<br />
EMC)<br />
Discussi<strong>on</strong> & closing<br />
Alejandro Buchmann (TU Darmstadt), and Malu Castellanos<br />
(HP Labs)<br />
Page<br />
149
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
IntErnatI<strong>on</strong>aL Workshop <strong>on</strong> spatIo tEmporaL <strong>Data</strong><br />
IntEGratI<strong>on</strong> anD rEtrIEvaL (stIr <strong>2012</strong>)<br />
http://research.ihost.com/stir12/<br />
the increasing world populati<strong>on</strong> is putting higher demands <strong>on</strong> the planet’s limited<br />
resources due to shifting life-styles. C<strong>on</strong>sequently, we not <strong>on</strong>ly need to m<strong>on</strong>itor how we<br />
c<strong>on</strong>sume resources but also optimize resource usage. some examples of the planet’s<br />
limited resources are water, energy, land, food and air. today, significant challenges exist<br />
for reducing usage of these resources, while maintaining quality of life. the challenges<br />
range from understanding regi<strong>on</strong>ally varied impacts of global envir<strong>on</strong>mental change,<br />
through tracking diffusi<strong>on</strong> of avian flu and resp<strong>on</strong>ding to natural disasters, to adapting<br />
business practice to dynamically changing resources, markets and geopolitical situati<strong>on</strong>s.<br />
this workshop is focused <strong>on</strong> making the research in informati<strong>on</strong> integrati<strong>on</strong><br />
and retrieval more relevant to the challenges in systems with significant spatial and<br />
temporal comp<strong>on</strong>ents. the workshop will build up<strong>on</strong> traditi<strong>on</strong>al themes of interest<br />
namely integrati<strong>on</strong> architectures, informati<strong>on</strong> extracti<strong>on</strong>, record linkage, named entity<br />
extracti<strong>on</strong>, source meta-data learning, query executi<strong>on</strong> and optimizati<strong>on</strong>. however, we<br />
gave special emphasis to how this can be applied to integrating informati<strong>on</strong> arising<br />
from systems that are (likely to be) deployed over wide geographic spaces, and collects<br />
and uses data that changes over time.<br />
Program<br />
8:30 – 10 am sessi<strong>on</strong> 1<br />
10 – 10:30 am Coffee break<br />
10:30 – no<strong>on</strong> sessi<strong>on</strong> 2<br />
Page<br />
150<br />
Opening and Welcome<br />
Invited Talk: “On the Roles of Spatio-Temporal <strong>Data</strong> in<br />
Web Search”<br />
Prof. Christian S Jensen, ACM & <str<strong>on</strong>g>IEEE</str<strong>on</strong>g> Fellow (Aarhus<br />
University, Denmark)<br />
TNeT: Tensor-based Neighborhood Discovery in<br />
Traffic Networks<br />
Yanan Sun, Vandana P Janeja, Aryya Gangopadhayay<br />
(University of Maryland, Baltimore County, USA) and<br />
Michael P McGuire (Tows<strong>on</strong> University, USA)
no<strong>on</strong> – 1:30 pm Lunch<br />
1:30 – 3:30 pm sessi<strong>on</strong> 3<br />
3:30 – 4 pm Coffee break<br />
4 – 5:30 pm sessi<strong>on</strong> 4<br />
Co-Located Workshops<br />
A Study of the Correlati<strong>on</strong> between the Spatial Attributes<br />
<strong>on</strong> Twitter<br />
Bumsuk Lee and Byung-Ye<strong>on</strong> Hwang (The Catholic<br />
University of Korea, Korea)<br />
Multi-representati<strong>on</strong> Lens for Visual Analytics<br />
Sandro Danilo Gatto and Andre Santanche<br />
(UNICAMP, Brazil)<br />
Invited Talk/Panel Discussi<strong>on</strong> - TBD<br />
Who was Where, When? Spatiotemporal Analysis of<br />
Researcher Mobility in Nuclear Science<br />
Miray Kas, Kathleen M Carley, and L. Richard Carley<br />
(Carnegie Mell<strong>on</strong> University, USA)<br />
Architecting the <strong>Data</strong>base Access for a IT Infrastructure<br />
and <strong>Data</strong> Center M<strong>on</strong>itoring tool<br />
Pradeep Unde, Harrick Vin, Maitreya Natu, Vaishali Kulkarni,<br />
Dilys Thomas, Sreeram Vasudevan, Amruta Dh<strong>on</strong>dage,<br />
Chinmay Jog, Shivam Sahai, and Rekha Pathak (Tata<br />
Research Development and Design Center, Pune, India)<br />
Moving Objects and KML Files<br />
Karine Reis Ferreira, Lúbia Vinhas, Antônio Miguel Vieira<br />
M<strong>on</strong>teiro and Gilberto Camara (Nati<strong>on</strong>al Institute of<br />
Space Research, Brazil)<br />
Closing Remarks<br />
Page<br />
151
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
152
Local Informati<strong>on</strong><br />
Washingt<strong>on</strong>, DC: capItaL of tHE USa<br />
the city, which is located <strong>on</strong> the north bank of the potomac River, is bordered by the<br />
states of Virginia to the southwest and Maryland to the other sides. the District has a<br />
resident populati<strong>on</strong> of 599,657; because of commuters from the surrounding suburbs, its<br />
populati<strong>on</strong> rises to over <strong>on</strong>e milli<strong>on</strong> during the workweek. the Washingt<strong>on</strong> Metropolitan<br />
area, of which the District is a part, has a populati<strong>on</strong> of 5.3 milli<strong>on</strong>, the ninth-largest metropolitan<br />
area in the country. the District has a total area of 68.3 square miles (177 km 2 ),<br />
of which 61.4 square miles (159 km 2 ) is land and 6.9 square miles (18 km 2 ) (10.16%) is<br />
water. the District has three major natural flowing streams: the potomac River and its<br />
tributaries, the anacostia River, and Rock creek, and tiber creek, a watercourse that <strong>on</strong>ce<br />
passed through the Nati<strong>on</strong>al Mall, but was fully enclosed underground during the 1870s.<br />
the highest natural point in the District of columbia is point Reno, located in fort<br />
Reno park, in the tenleytown neighborhood, at 409 feet (125 m) above sea level. the<br />
lowest point is sea level at the potomac River. the geographic center of Washingt<strong>on</strong> is<br />
located near the intersecti<strong>on</strong> of 4th and L Streets NW.<br />
approximately 19.4% of Washingt<strong>on</strong>, D.c. is parkland, which ties New York city for<br />
largest percentage of parkland am<strong>on</strong>g high-density U.S. cities. the U.S. Nati<strong>on</strong>al park<br />
Service manages most of the natural habitat in Washingt<strong>on</strong>, D.c., including Rock creek<br />
park, the chesapeake and ohio canal Nati<strong>on</strong>al Historical park, the Nati<strong>on</strong>al Mall,<br />
theodore Roosevelt Island, the c<strong>on</strong>stituti<strong>on</strong> Gardens, Meridian Hill park, and anacostia<br />
park. the <strong>on</strong>ly significant area of natural habitat not managed by the Nati<strong>on</strong>al park<br />
Page<br />
153
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
LOCAL INFORMATION<br />
Service is the U.S. Nati<strong>on</strong>al arboretum, which is operated by the U.S. Department of<br />
agriculture. the Great falls of the potomac River are located upstream (northwest) of<br />
Washingt<strong>on</strong>. During the 19th century, the chesapeake and ohio canal, which starts in<br />
Washingt<strong>on</strong>,<br />
Georgetown,<br />
D.C<br />
was<br />
the<br />
used<br />
capital<br />
to allow<br />
of USA.<br />
barge traffic to bypass the falls.<br />
The city, which is located <strong>on</strong> the north bank of the Potomac River, is bordered by the states of<br />
Virginia<br />
Washingt<strong>on</strong><br />
to the southwest<br />
is located<br />
and<br />
in the<br />
Maryland<br />
humid<br />
to<br />
subtropical<br />
the other sides.<br />
climate<br />
The<br />
z<strong>on</strong>e,<br />
District<br />
exhibiting<br />
has a resident<br />
four distinct<br />
populati<strong>on</strong><br />
of seas<strong>on</strong>s. 599,657; Its because climate of commuters is typical of from Mid-atlantic the surrounding U.S. areas suburbs, removed its populati<strong>on</strong> from bodies rises to of over water.<br />
<strong>on</strong>e Spring milli<strong>on</strong> and during fall are the warm, workweek. with The low Washingt<strong>on</strong> humidity, while Metropolitan winter is Area, cool, of with which annual the District snowfall is a<br />
part, averaging has a populati<strong>on</strong> 14.7 inches of 5.3 (370 milli<strong>on</strong>, mm). the average ninth-largest winter metropolitan lows tend to area be around in the country. 30°f (-1°c) The from<br />
District mid-December has a total area to mid-february. of 68.3 square Blizzards miles (177 affect km Washingt<strong>on</strong> <strong>on</strong> average <strong>on</strong>ce every four<br />
to six years. the most violent storms are called “nor’easters”, which typically feature high<br />
winds, heavy rains, and occasi<strong>on</strong>al snow. these storms often affect large secti<strong>on</strong>s of the<br />
U.S. East coast. Summers are hot and humid, with highs averaging in the upper 80s°f<br />
(lower 30s°c) and lows averaging in the upper 60s °f (lower 20s°c). the combinati<strong>on</strong> of<br />
heat and humidity in the summer brings very frequent thunderstorms, some of which<br />
occasi<strong>on</strong>ally produce tornadoes in the area. While hurricanes (or their remnants) occasi<strong>on</strong>ally<br />
track through the area in late summer and early fall, they have often weakened<br />
by the time they reach Washingt<strong>on</strong>, partly due to the city’s inland locati<strong>on</strong>. flooding of<br />
the potomac River, however, caused by a combinati<strong>on</strong> of high tide, storm surge, and<br />
runoff, has been known to cause extensive property damage in Georgetown.<br />
History<br />
an alg<strong>on</strong>quian people known as the Nacotchtank inhabited the area around the anacostia<br />
River where Washingt<strong>on</strong> now lies when the first Europeans arrived in the 17th<br />
century; however, Native american people had largely relocated from the area by the<br />
early 18th century. Georgetown was chartered by the province of Maryland <strong>on</strong> the north<br />
bank of the potomac River in 1751. the town would be included within the new federal<br />
territory established nearly 40 years later. the city of alexandria, Virginia, founded in<br />
1749, was also originally included within the District.<br />
James Madis<strong>on</strong> expounded the need for a federal district <strong>on</strong> January 23, 1788, in his<br />
“federalist No. 43”, arguing that the nati<strong>on</strong>al capital needed to be distinct from the<br />
2 ), of which 61.4 square miles (159 km 2 ) is<br />
land and 6.9 square miles (18 km 2 ) (10.16%) is water. The District has three major natural<br />
flowing streams: the Potomac River and its tributaries, the Anacostia River, and Rock Creek, and<br />
Tiber Creek, a watercourse that <strong>on</strong>ce passed through the Nati<strong>on</strong>al Mall, but was fully enclosed<br />
underground during the 1870s.<br />
The highest natural point in the District of Columbia is Point Reno, located in Fort Reno Park, in<br />
the Tenleytown neighborhood, at 409 feet (125 m) above sea level. The lowest point is sea level<br />
at the Potomac River. The geographic center of Washingt<strong>on</strong> is located near the intersecti<strong>on</strong> of 4th<br />
and L Streets NW.<br />
Approximately 19.4% of Washingt<strong>on</strong>, D.C. is parkland, which ties New York City for largest<br />
percentage of parkland am<strong>on</strong>g high-density U.S. cities. The U.S. Nati<strong>on</strong>al Park Service manages<br />
most of the natural habitat in Washingt<strong>on</strong>, D.C., including Rock Creek Park, the Chesapeake and<br />
Ohio Canal Nati<strong>on</strong>al Historical Park, the Nati<strong>on</strong>al Mall, Theodore Roosevelt Island, the<br />
C<strong>on</strong>stituti<strong>on</strong> Gardens, Meridian Hill Park, and Anacostia Park. The <strong>on</strong>ly significant area of<br />
natural habitat not managed by the Nati<strong>on</strong>al Park Service is the U.S. Nati<strong>on</strong>al Arboretum, which<br />
is operated by the U.S. Department of Agriculture. The Great Falls of the Potomac River are<br />
located upstream (northwest) of Washingt<strong>on</strong>. During the 19th century, the Chesapeake and Ohio<br />
Canal, which starts in Georgetown, was used to allow barge traffic to bypass the falls.<br />
Washingt<strong>on</strong> is located in the humid subtropical climate z<strong>on</strong>e, exhibiting four distinct seas<strong>on</strong>s. Its<br />
climate is typical of Mid-Atlantic U.S. areas removed from bodies of water. Spring and fall are<br />
warm, with low humidity, while winter is cool, with annual snowfall averaging 14.7 inches<br />
(370 mm). Average winter lows tend to be around 30 °F (-1 °C) from mid-December to mid-<br />
Page<br />
154
Local Informati<strong>on</strong><br />
states in order to provide for its own maintenance and safety. an attack <strong>on</strong> the c<strong>on</strong>gress<br />
at philadelphia by a mob of angry soldiers, known as the pennsylvania Mutiny of 1783,<br />
had emphasized the need for the government to see to its own security. therefore, the<br />
authority to establish a federal capital was provided in article <strong>on</strong>e, Secti<strong>on</strong> Eight, of the<br />
United States c<strong>on</strong>stituti<strong>on</strong>, which permits a “District (not exceeding ten miles square),<br />
by cessi<strong>on</strong> of particular states, and the acceptance of c<strong>on</strong>gress, become the seat of<br />
the government of the United States”. the c<strong>on</strong>stituti<strong>on</strong> does not, however, specify a<br />
locati<strong>on</strong> for the new capital. In what later became known as the compromise of 1790,<br />
Madis<strong>on</strong>, alexander Hamilt<strong>on</strong>, and thomas Jeffers<strong>on</strong> came to an agreement that the<br />
federal government would assume war debt carried by the states, <strong>on</strong> the c<strong>on</strong>diti<strong>on</strong> that<br />
the new nati<strong>on</strong>al capital would be located in the South.<br />
<strong>on</strong> July 16, 1790, the Residence act provided for a new permanent capital to be located<br />
<strong>on</strong> the potomac River, the exact area to be selected by president Washingt<strong>on</strong>. as permitted<br />
by the U.S. c<strong>on</strong>stituti<strong>on</strong>, the initial shape of the federal district was a square,<br />
measuring 10 miles (16 km) <strong>on</strong> each side, totaling 100 square miles (260 km2). During<br />
1791-1792, andrew Ellicott and several assistants, including Benjamin Banneker,<br />
surveyed the border of the District with both Maryland and Virginia, placing boundary<br />
st<strong>on</strong>es at every mile point; many of the st<strong>on</strong>es are still standing. a new “federal city”<br />
was then c<strong>on</strong>structed <strong>on</strong> the north bank of the potomac, to the east of the established<br />
settlement at Georgetown. <strong>on</strong> September 9, 1791, the federal city was named in h<strong>on</strong>or<br />
of George Washingt<strong>on</strong>, and the district was named the territory of columbia, columbia<br />
being a poetic name for the United States in use at that time. c<strong>on</strong>gress held its first sessi<strong>on</strong><br />
in Washingt<strong>on</strong> <strong>on</strong> November 17, 1800.<br />
the organic act of 1801 officially organized the District of columbia and placed the<br />
entire federal territory, including the cities of Washingt<strong>on</strong>, Georgetown, and alexandria,<br />
under the exclusive c<strong>on</strong>trol of c<strong>on</strong>gress. further, the unincorporated territory within the<br />
District was organized into two counties: the county of Washingt<strong>on</strong> to the east of the<br />
potomac and the county of alexandria to the west. following this act, citizens located<br />
in the District were no l<strong>on</strong>ger c<strong>on</strong>sidered residents of Maryland or Virginia, thus ending<br />
their representati<strong>on</strong> in c<strong>on</strong>gress.<br />
<strong>on</strong> august 24–25, 1814, in a raid known as the Burning of Washingt<strong>on</strong>, British forces<br />
invaded the capital during the War of 1812, following the sacking and burning of York<br />
(modern-day tor<strong>on</strong>to). the capitol, treasury, and White House were burned and gutted<br />
during the attack. Most government buildings were quickly repaired, but the capitol,<br />
which was at the time largely under c<strong>on</strong>structi<strong>on</strong>, was not completed in its current form<br />
until 1868.<br />
Since 1800, the District’s residents have protested their lack of voting representati<strong>on</strong><br />
in c<strong>on</strong>gress. to correct this, various proposals have been offered to return the land<br />
ceded to form the District back to Maryland and Virginia. this process is known as<br />
retrocessi<strong>on</strong>. However, such efforts failed to earn enough support until the 1830s when<br />
the District’s southern county of alexandria went into ec<strong>on</strong>omic decline due to neglect<br />
by c<strong>on</strong>gress. alexandria was also a major market in the american slave trade, and<br />
Page<br />
155
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
rumors circulated that aboliti<strong>on</strong>ists in c<strong>on</strong>gress were attempting to end slavery in the<br />
District; such an acti<strong>on</strong> would have further depressed alexandria’s ec<strong>on</strong>omy. Unhappy<br />
with c<strong>on</strong>gressi<strong>on</strong>al authority over alexandria, in 1840 the people began to petiti<strong>on</strong> for<br />
the retrocessi<strong>on</strong> of the District’s southern territory back to Virginia. the state legislature<br />
complied in february 1846, partly because the return of alexandria provided two<br />
additi<strong>on</strong>al pro-slavery delegates to the Virginia General assembly. <strong>on</strong> July 9, 1846,<br />
c<strong>on</strong>gress agreed to return all of the District’s territory south of the potomac River to the<br />
comm<strong>on</strong>wealth of Virginia.<br />
c<strong>on</strong>firming the fears of pro-slavery alexandrians, the compromise of 1850 outlawed the<br />
slave trade in the District, though not slavery itself. By 1860, approximately 80% of the<br />
city’s african american residents were free blacks. the outbreak of the american civil<br />
War in 1861 led to notable growth in the District’s populati<strong>on</strong> due to the expansi<strong>on</strong> of the<br />
federal government and a large influx of freed slaves. In 1862, president abraham Lincoln<br />
signed the compensated Emancipati<strong>on</strong> act, which ended slavery in the District of columbia<br />
and freed about 3,100 enslaved pers<strong>on</strong>s, nine m<strong>on</strong>ths prior to the Emancipati<strong>on</strong><br />
proclamati<strong>on</strong>. By 1870, the District’s populati<strong>on</strong> had grown to nearly 132,000. Despite the<br />
city’s growth, Washingt<strong>on</strong> still had dirt roads and lacked basic sanitati<strong>on</strong>; the situati<strong>on</strong> was<br />
so bad that some members of c<strong>on</strong>gress proposed moving the capital elsewhere.<br />
With the organic act of 1871, c<strong>on</strong>gress created a new government for the entire federal<br />
territory. this act effectively combined the city of Washingt<strong>on</strong>, Georgetown, and Washingt<strong>on</strong><br />
county into a single municipality officially named the District of columbia. Even<br />
though the city of Washingt<strong>on</strong> legally ceased to exist after 1871, the name c<strong>on</strong>tinued<br />
in use and the whole city became comm<strong>on</strong>ly known as Washingt<strong>on</strong>, D.c. In the same<br />
organic act, c<strong>on</strong>gress also appointed a Board of public Works charged with modernizing<br />
the city. In 1873, president Grant appointed the board’s most influential member,<br />
alexander Shepherd, to the new post of governor. that year, Shepherd spent $20 milli<strong>on</strong><br />
<strong>on</strong> public works ($357 milli<strong>on</strong> in 2007), which modernized Washingt<strong>on</strong> but also<br />
bankrupted the city. In 1874, c<strong>on</strong>gress abolished Shepherd’s office in favor of direct<br />
rule. additi<strong>on</strong>al projects to renovate the city were not executed until the McMillan plan<br />
in 1901.<br />
the District’s populati<strong>on</strong> remained relatively stable until the Great Depressi<strong>on</strong> in the<br />
1930s when president franklin D. Roosevelt’s New Deal legislati<strong>on</strong> expanded the bureaucracy<br />
in Washingt<strong>on</strong>. World War II further increased government activity, adding to the<br />
number of federal employees in the capital; by 1950, the District’s populati<strong>on</strong> had reached<br />
a peak of 802,178 residents. the twenty-third amendment to the United States c<strong>on</strong>stituti<strong>on</strong><br />
was ratified in 1961, granting the District three votes in the Electoral college.<br />
after the assassinati<strong>on</strong> of civil rights leader Dr. Martin Luther King, Jr., <strong>on</strong> april 4,<br />
1968, riots broke out in the District, primarily in the U Street, 14th Street, 7th Street,<br />
and H Street corridors, centers of black residential and commercial areas. the riots<br />
raged for three days until over 13,000 federal and Nati<strong>on</strong>al Guard troops managed to<br />
quell the violence. Many stores and other buildings were burned; rebuilding was not<br />
complete until the late 1990s. In 1973, c<strong>on</strong>gress enacted the District of columbia<br />
Page<br />
156
Local Informati<strong>on</strong><br />
Home Rule act, providing for an elected mayor and city council for the District. In 1975,<br />
Walter Washingt<strong>on</strong> became the first elected and first black mayor of the District. However,<br />
Board during to oversee the later all municipal 1980s spending and 1990s, and rehabilitate city administrati<strong>on</strong>s the city government. were The District criticized for mismanagement<br />
regained c<strong>on</strong>trol and over waste. its finances In 1995, in September c<strong>on</strong>gress 2001 and created the oversight the District board's operati<strong>on</strong>s of columbia were financial<br />
Board suspended. to oversee all municipal spending and rehabilitate the city government. The District<br />
c<strong>on</strong>trol Board to oversee all municipal spending and rehabilitate the city government.<br />
regained c<strong>on</strong>trol over its finances in September 2001 and the oversight board's operati<strong>on</strong>s were<br />
the suspended. District regained c<strong>on</strong>trol over its finances in September 2001 and the oversight<br />
board’s Attracti<strong>on</strong>s operati<strong>on</strong>s in Washingt<strong>on</strong>, were D.C. suspended.<br />
Attracti<strong>on</strong>s White House in Washingt<strong>on</strong>, D.C.<br />
Attracti<strong>on</strong>s The White House in Washingt<strong>on</strong>, is the official residence D.C. and principal workplace of the President of the<br />
United States. Located at 1600 Pennsylvania Avenue NW in Washingt<strong>on</strong>, D.C., it was<br />
White<br />
White designed House<br />
House by Irish-born James Hoban and built<br />
The between White House 1792 and is 1800 the official in the late residence Georgian and style. principal workplace of the President of the<br />
the United It White has States. been House the Located residence is the at 1600 of official every Pennsylvania U.S. residence President Avenue NW in Washingt<strong>on</strong>, D.C., it was<br />
and designed since principal John by Irish-born Adams. workplace In 1814, James of during Hoban the president the and War built of of<br />
the between 1812, United 1792 the States. mansi<strong>on</strong> and 1800 Located was set in the ablaze at late 1600 by Georgian the pennsyl- British style.<br />
vania It has Army<br />
avenue been in the Burning residence of Washingt<strong>on</strong>,<br />
NW in Washingt<strong>on</strong>, of every U.S. destroying<br />
D.c., President it<br />
was<br />
since the<br />
designed<br />
John interior Adams. and charring<br />
by Irish-born<br />
In 1814, much during of the<br />
James<br />
the exterior.<br />
Hoban<br />
War of<br />
Rec<strong>on</strong>structi<strong>on</strong> began almost immediately, and<br />
1812, the mansi<strong>on</strong> was set ablaze by the British<br />
and President built between James M<strong>on</strong>roe 1792 moved and 1800 into the in partially the late<br />
Army in the Burning of Washingt<strong>on</strong>, destroying<br />
Georgian rec<strong>on</strong>structed style. house It has in been October the 1817. residence Under<br />
the Harry interior S. Truman, and charring the interior much rooms of the were exterior.<br />
of Rec<strong>on</strong>structi<strong>on</strong> every completely U.S. dismantled president began almost and since a new immediately, John internal adams. load- and<br />
In President 1814, bearing during James steel frame M<strong>on</strong>roe the c<strong>on</strong>structed War moved of 1812, inside into the man- partially walls.<br />
si<strong>on</strong> rec<strong>on</strong>structed Once was this set work ablaze house was in by completed, October the British 1817. the interior army Under rooms in<br />
the Harry were Burning S. rebuilt. Truman, of Today, Washingt<strong>on</strong>, the interior the White rooms House destroying were Complex the includes interior the and Executive charring Residence, much West of the exterior.<br />
Rec<strong>on</strong>structi<strong>on</strong><br />
completely Wing, Cabinet dismantled Room,<br />
began<br />
and Roosevelt<br />
almost<br />
a new Room,<br />
immediately,<br />
internal East load- Wing, and the Old Executive Office<br />
and president James M<strong>on</strong>roe moved into<br />
bearing<br />
Building,<br />
steel<br />
which<br />
frame<br />
houses<br />
c<strong>on</strong>structed<br />
the executive<br />
inside<br />
offices<br />
the walls.<br />
of the President and Vice President.<br />
the partially rec<strong>on</strong>structed house in october 1817. Under harry s. truman, the inte-<br />
Once this work was completed, the interior rooms<br />
rior Washingt<strong>on</strong> rooms were M<strong>on</strong>ument<br />
were completely dismantled and a new internal load-bearing steel frame<br />
The rebuilt. Washingt<strong>on</strong> Today, M<strong>on</strong>ument the White is an House obelisk Complex near the includes west end the of the Executive Nati<strong>on</strong>al Mall Residence, in West<br />
c<strong>on</strong>structed Wing, inside the walls. <strong>on</strong>ce this work was completed, the interior rooms were<br />
Washingt<strong>on</strong>, Cabinet Room, D.C., built Roosevelt to commemorate Room, East the first Wing, and the Old Executive Office<br />
rebuilt. Building, U.S. president, today, which the houses General White the George house executive Washingt<strong>on</strong>. Complex offices The of includes the President the Executive and Vice President. Residence, West Wing,<br />
Cabinet m<strong>on</strong>ument Room, is both Roosevelt the world's Room, tallest st<strong>on</strong>e East structure Wing, and the old Executive office Building,<br />
which Washingt<strong>on</strong> and houses the world's M<strong>on</strong>ument the tallest executive obelisk, offices standing of 555 the feet president and Vice President.<br />
The 5⅛ Washingt<strong>on</strong> inches (169.294 M<strong>on</strong>ument m). There is are an obelisk taller m<strong>on</strong>umental near the west end of the Nati<strong>on</strong>al Mall in<br />
Washingt<strong>on</strong>, columns, but D.C., they built are neither to commemorate all st<strong>on</strong>e nor true the first<br />
Washingt<strong>on</strong> obelisks. The corner M<strong>on</strong>ument st<strong>on</strong>e was laid <strong>on</strong> July 4, 1848.<br />
U.S. president, General George Washingt<strong>on</strong>. The<br />
the The Washingt<strong>on</strong> same trowel M<strong>on</strong>ument was used that George is an obelisk Washingt<strong>on</strong><br />
m<strong>on</strong>ument<br />
near the<br />
used to lay<br />
is both<br />
the cornerst<strong>on</strong>e<br />
the world's<br />
of<br />
tallest<br />
the Capitol<br />
st<strong>on</strong>e<br />
way<br />
structure<br />
back<br />
west and end of the Nati<strong>on</strong>al Mall in Washingt<strong>on</strong>, D.c.,<br />
in the 1793. world's tallest obelisk, standing 555 feet<br />
built 5⅛ inches to commemorate (169.294 m). There the first are taller U.S. president, m<strong>on</strong>umental<br />
General columns, Lincoln George but Memorial they Washingt<strong>on</strong>. are neither all the st<strong>on</strong>e m<strong>on</strong>ument nor true is<br />
both obelisks. The the Lincoln world’s The corner Memorial tallest st<strong>on</strong>e commemorates st<strong>on</strong>e was laid structure <strong>on</strong> the July life and 4, of 1848. the<br />
Abraham Lincoln, the 16th President of the United<br />
world’s The same tallest trowel obelisk, was used standing that George 555 feet Washingt<strong>on</strong> 5-1/8 inch-<br />
used States. to lay It the is located cornerst<strong>on</strong>e in Potomac of the Park, Capitol Washingt<strong>on</strong>, way back D.C. The Memorial was designed by<br />
es (169.294 Henry Bac<strong>on</strong>; m). the there style is are that taller of a Greek m<strong>on</strong>umental Doric temple with 36 enormous columns. Inside<br />
in 1793.<br />
columns, but they are neither all st<strong>on</strong>e nor true<br />
obelisks. Lincoln Memorial the corner st<strong>on</strong>e was laid <strong>on</strong> July 4, 1848.<br />
the The same Lincoln trowel Memorial was commemorates used that George the life Washingt<strong>on</strong> of<br />
used Abraham to lay Lincoln, the cornerst<strong>on</strong>e the 16th President of the capitol of the United way back in 1793.<br />
States. It is located in Potomac Park, Washingt<strong>on</strong>, D.C. The Memorial was designed by<br />
Lincoln Henry Bac<strong>on</strong>; Memorial the style is that of a Greek Doric temple with 36 enormous columns. Inside<br />
the Lincoln Memorial commemorates the life of abraham Lincoln, the 16th president<br />
of the United States. It is located in potomac park, Washingt<strong>on</strong>, D.c. the Memorial was<br />
Page<br />
157
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
the building is a huge statue of a sitting Lincoln. Also in th<br />
The World War II Memorial h<strong>on</strong>ors the 16 milli<strong>on</strong> who served in th<br />
designed by Henry Bac<strong>on</strong>; the style is that of a and Greek st<strong>on</strong>e Doric engravings temple of with Lincoln's 36 enormous sec<strong>on</strong>d inaugural addres<br />
columns. Inside U.S., the building the more is a than huge statue 400,000 of a sitting who died, Lincoln. and also all in who the Memorial supported the wa<br />
are two murals, and st<strong>on</strong>e engravings of Lincoln’s On August sec<strong>on</strong>d 28, inaugural 1963, Martin address Luther Symbolic and King, the Jr., of made the his def "I<br />
Gettysburg address.<br />
steps of the Lincoln Memorial (the speech was delivered o<br />
20th Century, the m<br />
Lincoln's statue); there is now an inscripti<strong>on</strong> <strong>on</strong> the step w<br />
<strong>on</strong> august 28, 1963, Martin Luther King, Jr., commemorating made his “I Have that a Dream” historic event. m<strong>on</strong>ument speech Dr. <strong>on</strong> King the was to speakin the sp<br />
steps of the Lincoln Memorial (the speech was for delivered Jobs and <strong>on</strong> Freedom. the landing 18 steps commitment below of the<br />
Lincoln’s statue); there is now an inscripti<strong>on</strong> <strong>on</strong> the step where Dr. King stood, The comSec<strong>on</strong>d<br />
World<br />
Nati<strong>on</strong>al World War II Memorial<br />
memorating that historic event. Dr. King was speaking at the March <strong>on</strong> Washingt<strong>on</strong> for<br />
The World War II Memorial h<strong>on</strong>ors Century the 16 event milli<strong>on</strong> who comm se<br />
Jobs and freedom.<br />
U.S., the more than 400,000 who Nati<strong>on</strong>al died, and all Mall’s who suppor cen<br />
Symbolic<br />
Nati<strong>on</strong>al World War II Memorial<br />
20th Cent<br />
the World War II Memorial h<strong>on</strong>ors the 16<br />
m<strong>on</strong>umen<br />
milli<strong>on</strong> who served in the armed forces of<br />
commitm<br />
the U.S., the more than 400,000 who died,<br />
The Seco<br />
and all who supported the war effort from<br />
Century e<br />
home. Symbolic of the defining event of the<br />
Japanese Cherry Nati<strong>on</strong>al BlM<br />
20th century, the memorial is a m<strong>on</strong>ument<br />
The Nati<strong>on</strong>al Cherr<br />
to the spirit, sacrifice, a spring and celebrati<strong>on</strong> commitment of in Washingt<strong>on</strong>, D.C. commemorating the March<br />
the american people. the Sec<strong>on</strong>d World War<br />
is the <strong>on</strong>ly 20th Japanese century event cherry commemo- trees from Mayor Yukio Ozaki of Tokyo to the city<br />
rated <strong>on</strong> the Nati<strong>on</strong>al Mayor Mall’s Ozaki central d<strong>on</strong>ated axis. the trees in an effort to enhance the growing Japanese f<br />
The Natio<br />
United States and Japan and also celebrate the c<strong>on</strong>tinued close relati<br />
Japanese Cherry Blossom Trees<br />
a spring celebrati<strong>on</strong> in Washingt<strong>on</strong>, D.C. commemorating<br />
two nati<strong>on</strong>s.<br />
Japanese cherry trees from Mayor Yukio Ozaki of Tokyo<br />
the Nati<strong>on</strong>al cherry Blossom festival is a spring celebrati<strong>on</strong> in Washingt<strong>on</strong>, D.c.<br />
commemorating In the 1994 March the 27, Festival 1912, gift was Mayor<br />
of Japanese expanded Ozaki d<strong>on</strong>ated<br />
cherry to trees two the<br />
from weeks trees in<br />
Mayor to an effort to enhance the<br />
Yukio accommodate th<br />
United States and Japan and also celebrate the c<strong>on</strong>tinued c<br />
ozaki of tokyo happen to the city during of Washingt<strong>on</strong>. the trees’ Mayor blooming.<br />
two<br />
ozaki<br />
nati<strong>on</strong>s.<br />
d<strong>on</strong>ated Today the trees the in Nati<strong>on</strong>al an effort Cherry Blos<br />
to enhance the coordinated growing friendship by the between the<br />
In 1994 the Festival was expanded to two weeks Nati<strong>on</strong>a to accom<br />
United States Festival, and Japan and Inc., also an celebrate umbrella the happen during the trees’ blooming. Today the Nati<strong>on</strong>al organiza C<br />
c<strong>on</strong>tinued close relati<strong>on</strong>ship between the two coordinated by the<br />
nati<strong>on</strong>s. representatives of business<br />
Festival, Inc., an umbrella<br />
governmental representatives of<br />
organiza<br />
In 1994 the festival 700,000 was expanded people visit to two weeks governmental<br />
Washing<br />
to accommodate the many activities that happen 700,000 people visit<br />
admire the blossoming cherry t<br />
during the trees’ blooming. today the Nati<strong>on</strong>al admire the blossoming<br />
cherry Blossom beginning festival is coordinated of spring by in the the beginning of spring in the<br />
nati<strong>on</strong>’s<br />
Nati<strong>on</strong>al cherry This Blossom year’s festival, Festival Inc., an (100th umThis<br />
year’s Festival (100th<br />
Anniver<br />
brella organizati<strong>on</strong> c<strong>on</strong>sisting of representatives Trees) will be March 31 –<br />
Trees) will be March 31 – April 15<br />
of business, civic, and governmental organizaSaturday,<br />
April 14.<br />
Saturday, April 14.<br />
ti<strong>on</strong>s. More than 700,000 people visit Washing-<br />
(www.nati<strong>on</strong>alcherryblossomfestival.org)<br />
t<strong>on</strong> each year to admire the blossoming cherry<br />
Franklin Delano Roosevelt Memorial<br />
trees that herald the beginning of spring in the nati<strong>on</strong>’s capital.<br />
this year’s festival (100th anniversary of the Gift of trees) will be March 31 – april 15;<br />
with the parade <strong>on</strong> Saturday, april 14. (www.nati<strong>on</strong>alcherryblossomfestival.org)<br />
Page<br />
158<br />
commemorating that historic event. Dr. King was speaking at the M<br />
for Jobs and Freedom.<br />
Nati<strong>on</strong>al World War II Memorial<br />
(www.nati<strong>on</strong>alcherryblossomfestival.org)<br />
Franklin Delano Roosevelt Memorial
Local Informati<strong>on</strong><br />
Franklin Delano Roosevelt Memorial<br />
Located al<strong>on</strong>g the famous cherry tree Walk <strong>on</strong> the Western edge of the tidal Basin near<br />
the Nati<strong>on</strong>al Mall, this is a memorial not <strong>on</strong>ly to fDR, but also to the era he represents.<br />
the memorial traces twelve years of american History through a sequence of four outdoor<br />
rooms - each <strong>on</strong>e devoted to <strong>on</strong>e of fDR’s terms of office. Sculptures inspired by<br />
photographs depict the 32nd president: a 10-foot statue shows him in a wheeled chair;<br />
a bas-relief depicts him riding in a car during his first inaugural. at the very beginning<br />
of the memorial in a prologue room there is a statue with fDR seated in a wheelchair<br />
much like the <strong>on</strong>e he actually used.<br />
Jeffers<strong>on</strong> Memorial<br />
this presidential memorial is dedicated to thomas Jeffers<strong>on</strong>, an american founding<br />
father and the third president of the United States. the neoclassical building was<br />
designed by John Russell pope. c<strong>on</strong>structi<strong>on</strong> began in 1939, the building was completed<br />
in 1943, and the br<strong>on</strong>ze statue of Jeffers<strong>on</strong> was added in 1947. When completed,<br />
the memorial occupied <strong>on</strong>e of the last significant sites left in the city. In 2007, it was<br />
ranked fourth <strong>on</strong> the List of america’s favorite architecture by the american Institute<br />
of architects.<br />
Smiths<strong>on</strong>ian<br />
this is an educati<strong>on</strong>al foundati<strong>on</strong> chartered by c<strong>on</strong>gress in 1846 that maintains most of<br />
the nati<strong>on</strong>’s official museums and galleries in Washingt<strong>on</strong>, D.c. the U.S. government<br />
partially funds the Smiths<strong>on</strong>ian, thus making its collecti<strong>on</strong>s open to the public free of<br />
charge. the most visited of the Smiths<strong>on</strong>ian museums in 2007 was the Nati<strong>on</strong>al Museum<br />
of Natural History located <strong>on</strong> the Nati<strong>on</strong>al Mall. other Smiths<strong>on</strong>ian Instituti<strong>on</strong><br />
museums and galleries located <strong>on</strong> the mall are: the Nati<strong>on</strong>al air and Space Museum;<br />
the Nati<strong>on</strong>al Museum of african art; the Nati<strong>on</strong>al Museum of american History; the<br />
Nati<strong>on</strong>al Museum of the american Indian; the Sackler and freer galleries, which both<br />
focus <strong>on</strong> asian art and culture; the Hirshhorn Museum and Sculpture Garden; the arts<br />
and Industries Building; the S. Dill<strong>on</strong> Ripley center; and the Smiths<strong>on</strong>ian Instituti<strong>on</strong><br />
Building (also known as “the castle”), which serves as the instituti<strong>on</strong>’s headquarters.<br />
the Smiths<strong>on</strong>ian american art Museum (formerly known as the Nati<strong>on</strong>al Museum of<br />
american art) and the Nati<strong>on</strong>al portrait Gallery are located in the same building, the<br />
D<strong>on</strong>ald W. Reynolds center, near Washingt<strong>on</strong>’s chinatown. the Reynolds center is<br />
also known as the old patent office Building. the Renwick Gallery is officially part of<br />
the Smiths<strong>on</strong>ian american art Museum but is located in a separate building near the<br />
White House. other Smiths<strong>on</strong>ian museums and galleries include: the anacostia community<br />
Museum in Southeast Washingt<strong>on</strong>; the Nati<strong>on</strong>al postal Museum near Uni<strong>on</strong><br />
Stati<strong>on</strong>; and the Nati<strong>on</strong>al Zoo in Woodley park.<br />
Nati<strong>on</strong>al Gallery of Art<br />
the Nati<strong>on</strong>al Gallery is located <strong>on</strong> the Nati<strong>on</strong>al Mall near the capitol, but is not a part<br />
of the Smiths<strong>on</strong>ian Instituti<strong>on</strong>. It is instead wholly owned by the U.S. government;<br />
thus admissi<strong>on</strong> to the gallery is free. the gallery’s West Building features the nati<strong>on</strong>’s<br />
collecti<strong>on</strong> of american and European art through the 19th century. the East Building,<br />
designed by architect I. M. pei, features works of modern art. the Smiths<strong>on</strong>ian ameri-<br />
Page<br />
159
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
can art Museum and the Nati<strong>on</strong>al portrait Gallery are often c<strong>on</strong>fused with the Nati<strong>on</strong>al<br />
Gallery of art when they are in fact entirely separate instituti<strong>on</strong>s. the Nati<strong>on</strong>al Building<br />
Museum occupies the former pensi<strong>on</strong> Building located near Judiciary Square, and was<br />
chartered by c<strong>on</strong>gress as a private instituti<strong>on</strong> to host exhibits <strong>on</strong> architecture, urban<br />
planning, and design. there are many private art museums in the District of columbia,<br />
which house major collecti<strong>on</strong>s and exhibits open to the public such as: the Nati<strong>on</strong>al Museum<br />
of Women in the arts; the corcoran Gallery of art, the largest private museum in<br />
Washingt<strong>on</strong>; and the phillips collecti<strong>on</strong> in Dup<strong>on</strong>t circle, the first museum of modern<br />
art in the United States. other private museums in Washingt<strong>on</strong> include the Newseum,<br />
the <str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> Spy Museum, the Nati<strong>on</strong>al Geographic Society Museum, and the<br />
Marian Koshland Science Museum. the United States Holocaust Memorial Museum<br />
located near the Nati<strong>on</strong>al Mall maintains exhibits, documentati<strong>on</strong>, and artifacts related<br />
to the Holocaust.<br />
Performing Arts and Music<br />
Washingt<strong>on</strong>, D.c. is a nati<strong>on</strong>al center for the arts. the John f. Kennedy center for the<br />
performing arts, which is located al<strong>on</strong>g the potomac River, is home to the Nati<strong>on</strong>al<br />
Symph<strong>on</strong>y orchestra, the Washingt<strong>on</strong> Nati<strong>on</strong>al opera, and the Washingt<strong>on</strong> Ballet. the<br />
Kennedy center H<strong>on</strong>ors are awarded each year to those in the performing arts who<br />
have c<strong>on</strong>tributed greatly to the cultural life of the United States. the president and first<br />
Lady typically attend the H<strong>on</strong>ors cerem<strong>on</strong>y, as the first Lady is the h<strong>on</strong>orary chair of<br />
the Kennedy center Board of trustees. Washingt<strong>on</strong> also has a local independent theater<br />
traditi<strong>on</strong>. Instituti<strong>on</strong>s such as arena Stage, the Shakespeare theatre company, and the<br />
Studio theatre feature classic works and new american plays.<br />
the U street Corridor in Northwest D.c., known as “Washingt<strong>on</strong>’s Black Broadway”,<br />
is home to instituti<strong>on</strong>s like Bohemian Caverns and the Lincoln theatre, which hosted<br />
music legends such as Washingt<strong>on</strong>-native Duke Ellingt<strong>on</strong>, John Coltrane, and Miles<br />
Davis. other jazz venues feature modern blues such as Madam’s organ in adams Morgan<br />
and Blues alley in Georgetown. D.c. has its own native music genre called go-go;<br />
a post-funk, percussi<strong>on</strong>-driven flavor of R&B that blends live sets with relentless dance<br />
rhythms. the most accomplished practiti<strong>on</strong>er was D.c. band leader Chuck Brown, who<br />
brought go-go to the brink of nati<strong>on</strong>al recogniti<strong>on</strong> with his 1979 Lp Bustin’ Loose.<br />
Green Initiatives<br />
• 70 percent of land in Washingt<strong>on</strong>, DC is c<strong>on</strong>trolled by the Nati<strong>on</strong>al Park Service.<br />
there are 250,000 acres of parkland in the Greater Washingt<strong>on</strong> Metropolitan area.<br />
• In 2007, DC was named the most walkable city in the US in a study by the Brookings<br />
Institute.<br />
• In late 2006, City Council passed an initiative making the nati<strong>on</strong>’s capital the first<br />
major city to require developers to adhere to guidelines established by the U.S. Green<br />
Building council.<br />
• The Washingt<strong>on</strong> Nati<strong>on</strong>als Ballpark is striving to be the country’s first green-certified<br />
ballpark<br />
• The Walter E. Washingt<strong>on</strong> C<strong>on</strong>venti<strong>on</strong> Center is a green meeting facility, with<br />
earth-friendly features like low emissi<strong>on</strong> glass that c<strong>on</strong>trols heat gain and loss and<br />
Page<br />
160
Local Informati<strong>on</strong><br />
maximizes natural lighting; energy-c<strong>on</strong>serving heating, ventilati<strong>on</strong> and air c<strong>on</strong>diti<strong>on</strong>ing<br />
systems that operate in z<strong>on</strong>es; high-efficiency lighting; automatic c<strong>on</strong>trols <strong>on</strong><br />
restroom fixtures; plus recycling programs and easy public transportati<strong>on</strong> access.<br />
• DC’s hotels have implemented green initiatives, including wind power, renewable<br />
energy credits, recycling and adopt-a-park programs with neighborhood green spaces.<br />
<str<strong>on</strong>g>Internati<strong>on</strong>al</str<strong>on</strong>g> DC<br />
• 84,000 DC residents (15%) speaking a language other than English at home.<br />
• 74,000 DC residents (12%) are foreign-born.<br />
• The Greater Washingt<strong>on</strong> regi<strong>on</strong> is home to 400 internati<strong>on</strong>al associati<strong>on</strong>, 700 internati<strong>on</strong>ally<br />
owned companies and more than 150 embassies and internati<strong>on</strong>al cultural<br />
centers.<br />
Page<br />
161
<strong>ICDE</strong> <strong>2012</strong> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g><br />
Page<br />
162
Platinum Sp<strong>on</strong>sors<br />
Gold Sp<strong>on</strong>sors<br />
Silver Sp<strong>on</strong>sors<br />
Br<strong>on</strong>ze Sp<strong>on</strong>sor<br />
Supported By