4 <strong>Computer</strong> <strong>Science</strong>Q & A with Stan MatwinWhat is your line <strong>of</strong> research?I work in Machine Learning and <strong>Data</strong>Mining. Machine Learning is a researcharea in which a computer is givenexamples <strong>of</strong> something (e.g. what is andwhat isn’t an oil spill in a satellite image<strong>of</strong> the sea) and, from these examples,it learns how to classify or predict newexamples <strong>of</strong> that ‘something’ (e.g. torecognize oil spills in new, unseen images).This is an old idea, dating back tothe 1950s, and it was part <strong>of</strong> the originalArtificial Intelligence manifesto. Everybodyagrees that learning is an inherentpart <strong>of</strong> intelligence, but I like to see itmore pragmatically. I am interested inthe use <strong>of</strong> learning programs to learnpractical things: to predict who in theemergency room will need hospitalization,to recognize oil spills, to categorizemedical articles or to catch emergingtrends in a political campaign or inpublic opinion.Of particular interest for me is learningfrom text data: papers, blogs, tweets,notes, etc. I believe that such data callsfor methods that take into accountits linguistic character – we will havestronger methods if they understand thelexical, syntactic and semantic character<strong>of</strong> such data. This is the main topic <strong>of</strong>my Canada Research Chair here at Dal.<strong>Data</strong> Mining, for me, is MachineLearning in the large. First, one is dealingwith large data sets in millions <strong>of</strong>records and terabytes <strong>of</strong> volume. Second– in data mining – it is recognized thatone spends most <strong>of</strong> their effort not in the“model building” phase, but instead inthe data cleaning and data preparationphase (e.g. doing “attribute engineering”).In order to do this, the data minermust learn the basics <strong>of</strong> the domainfrom which the data is coming: they willhave to create in their head fundamental“ontology” <strong>of</strong> that domain: what are themain entities and what are the relationshipsbetween those entities.I am also interested in data privacy.I work on methods that make it hard, orpractically impossible, to identify a givenperson in a dataset.How did you get interestedin that?Well, it all started many years ago when Iwas involved in one <strong>of</strong> the early projectsin Expert Systems (ES), a joint projectwith Cognos. At that time, we weretrying to build an ES that would process(or assist in processing) governmenttravel claims. I got to learn more thanI ever wished about that “fascinating”topic! A question which arose was, howdoes one acquire rules which form theknowledge base <strong>of</strong> an expert system?Somebody suggested that I look atMachine Learning – indeed, one <strong>of</strong> itsearly goals was to replace the classical“Knowledge Acquisition” approach withlearning the rules from examples. I wentto spend a sabbatical with one <strong>of</strong> the
<strong>Computer</strong> <strong>Science</strong>5leading centres <strong>of</strong> Machine Learning atthat time, George Mason University inVirginia, and I caught the bug. I likedthe fact that Machine Learning wasdrawing on a variety <strong>of</strong> disciplines (AI,logic, databases and statistics) to buildits tools. I also liked the fact that it wasdirectly applicable almost everywhere. Iam always interested in applications –they are an opportunity to learn aboutsomething completely new, fromneuro-ophthalmology to forestryto electronic components (toname a few applications Iwas involved in). Applicationsalso attract studentsand, last but not least,research funds. Done well,they <strong>of</strong>ten present a generalresearch problem that canbe shared with the communityand initiate a new line <strong>of</strong>research. That has happenedto our work on oil spill detectionwith R.C. Holte and M. Kubat thatopened the active field <strong>of</strong> learningfrom imbalanced data.My interest in data privacy is a littledifferent. I am concerned about the factthat modern computers may become atool that can be used to breach and violatepeople’s privacy easier and on a muchlarger scale than it was possible, say, 30years ago. I believe that since the computerresearch community invented thetools that make it possible – databases,the internet, image and voice recognition,barcodes, etc., – it is then our moralobligation to at least think about tools thatwould make privacy easier and that wouldavoid many privacy-averse incidents.What do you hope to achieve in thenext five years?I have several goals. First and foremost, Ihope to create – together with colleaguesfrom Dal – an active, dynamic centre <strong>of</strong>excellence in our joint field <strong>of</strong> research,which we call <strong>Big</strong> <strong>Data</strong> Analytics. Wehave recently created the Institute for<strong>Big</strong> <strong>Data</strong> Analytics to focus research onthis area. The Institute will attract talent,ideas and applications, and will make<strong>Dalhousie</strong> a globally visible centre forthis type <strong>of</strong> research. We’re getting a verypowerful computer, IBM Netezza, a uniquemachine not only here but on campusesgenerally, which will provide an excellentinfrastructure for <strong>Big</strong> <strong>Data</strong> applications.I believe that sincethe computer research communityinvented the tools that make itpossible – databases, the internet, imageand voice recognition, barcodes, etc., – it isthen our moral obligation to at least thinkabout tools that would make privacy easierand that would avoid manyprivacy-averse incidents.At the research level, I hope to makeinroads into a linguistically informedbut still scalable text model (“representation”).I want to complete severalreal-life, deployed applications <strong>of</strong> dataand text mining techniques. I also wantto continue with a start-up, DeveraLogic, that I founded several years agowith colleagues in Ottawa in the area <strong>of</strong>computer security, and to bring it to afruitful completion.Who else is involved inthis research?Here at Dal there are several excellentresearchers involved in this type <strong>of</strong>research. My closest collaborators intext analytics will be Dr. Vlado Keselj, Dr.Evangelos Milios and Dr. Mike Shepherd.I will also collaborate with otherfaculty members at Dal in the areas<strong>of</strong> visualization, HCI, databases anddata structures and privacy: Dr. RazaAbidi, Dr. Dirk Arnold, Dr. Robert Beiko,Dr. Jamie Blustein, Dr. Stephen Brooks,Dr. Qigang Gao, Dr. Kirstie Hawkey, Dr.Andrew Rau-Chaplin, Dr. Derek Reilly,Dr. Thomas Trappenberg, Dr. CarolynWatters, Dr. Norbert Zeh and Dr. NurZincir-Heywood.In Canada, I cooperate activelywith several researchers acrossthe country: Dr. Nick Cercone(former FCS Dean at Dal, nowat York), Dr. Fred Popowich atSimon Fraser, Dr. Diana Inkpen,Dr. Nathalie Japkowicz at theUniversity <strong>of</strong> Ottawa, Dr. ChrisDrummond at NRC and Dr.Guy Lapalme at Universite deMontreal.I also plan to continue andfurther develop my rich internationalcollaboration, mainly with Brazil whereI already have a very active, ongoingcooperation; with France and Spainthrough Dal’s partnership in the DMKMErasmus Mundus program; and with mynative Poland, where I hold a Pr<strong>of</strong>essorshipwith the Academy <strong>of</strong> <strong>Science</strong>s andhave many contacts with several leadingacademic and research centres.What attracts your interest outsideyour research area?I am interested in current affairsand politics – I believe we have to beinformed to influence decision makers onmatters that concern us. I spend a lot <strong>of</strong>time reading (online) newspapers in atleast three languages – English, Frenchand Polish. I am also an avid reader <strong>of</strong>literature in these three languagees.Classical music is my major hobby – Ihave a large CD collection, I go to concertswherever I can, also during my frequenttravel. I like hiking and swimming,but I do not do enough <strong>of</strong> that.