12.07.2015 Views

10 challenging problems in data mining research - Computer Science

10 challenging problems in data mining research - Computer Science

10 challenging problems in data mining research - Computer Science

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

International Journal of Information Technology & Decision Mak<strong>in</strong>gVol. 5, No. 4 (2006) 597–604c○ World Scientific Publish<strong>in</strong>g Company<strong>10</strong> CHALLENGING PROBLEMS IN DATA MINING RESEARCHQIANG YANGDepartment of <strong>Computer</strong> <strong>Science</strong>Hong Kong University of <strong>Science</strong> and TechnologyClearwater Bay, Kowloon, Hong Kong, Ch<strong>in</strong>aXINDONG WUDepartment of <strong>Computer</strong> <strong>Science</strong>University of Vermont33 Colchester Avenue, Burl<strong>in</strong>gton, Vermont 05405, USAxwu@cs.uvm.eduCONTRIBUTORS: PEDRO DOMINGOS, CHARLES ELKAN, JOHANNES GEHRKE,JIAWEI HAN, DAVID HECKERMAN, DANIEL KEIM, JIMING LIU,DAVID MADIGAN, GREGORY PIATETSKY-SHAPIRO, VIJAY V. RAGHAVAN,RAJEEV RASTOGI, SALVATORE J. STOLFO,ALEXANDER TUZHILIN and BENJAMIN W. WAHIn October 2005, we took an <strong>in</strong>itiative to identify <strong>10</strong> <strong>challeng<strong>in</strong>g</strong> <strong>problems</strong> <strong>in</strong> <strong>data</strong> m<strong>in</strong><strong>in</strong>g<strong>research</strong>, by consult<strong>in</strong>g some of the most active <strong>research</strong>ers <strong>in</strong> <strong>data</strong> m<strong>in</strong><strong>in</strong>g and mach<strong>in</strong>elearn<strong>in</strong>g for their op<strong>in</strong>ions on what are considered important and worthy topics for future<strong>research</strong> <strong>in</strong> <strong>data</strong> m<strong>in</strong><strong>in</strong>g. We hope their <strong>in</strong>sights will <strong>in</strong>spire new <strong>research</strong> efforts, andgive young <strong>research</strong>ers (<strong>in</strong>clud<strong>in</strong>g PhD students) a high-level guidel<strong>in</strong>e as to where thehot <strong>problems</strong> are located <strong>in</strong> <strong>data</strong> m<strong>in</strong><strong>in</strong>g.Due to the limited amount of time, we were only able to send out our survey requeststo the organizers of the IEEE ICDM and ACM KDD conferences, and we received anoverwhelm<strong>in</strong>g response. We are very grateful for the contributions provided by these<strong>research</strong>ers despite their busy schedules. This short article serves to summarize the <strong>10</strong>most <strong>challeng<strong>in</strong>g</strong> <strong>problems</strong> of the 14 responses we have received from this survey. Theorder of the list<strong>in</strong>g does not reflect their level of importance.Keywords: Data m<strong>in</strong><strong>in</strong>g; mach<strong>in</strong>e learn<strong>in</strong>g; knowledge discovery.1. Develop<strong>in</strong>g a Unify<strong>in</strong>g Theory of Data M<strong>in</strong><strong>in</strong>gSeveral respondents feel that the current state of the art of <strong>data</strong> m<strong>in</strong><strong>in</strong>g <strong>research</strong>is too “ad-hoc.” Many techniques are designed for <strong>in</strong>dividual <strong>problems</strong>, such asclassification or cluster<strong>in</strong>g, but there is no unify<strong>in</strong>g theory. However, a theoreticalframework that unifies different <strong>data</strong> m<strong>in</strong><strong>in</strong>g tasks <strong>in</strong>clud<strong>in</strong>g cluster<strong>in</strong>g, classification,association rules, etc., as well as different <strong>data</strong> m<strong>in</strong><strong>in</strong>g approaches (such asstatistics, mach<strong>in</strong>e learn<strong>in</strong>g, <strong>data</strong>base systems, etc.), would help the field and providea basis for future <strong>research</strong>.597


<strong>10</strong> Challeng<strong>in</strong>g Problems <strong>in</strong> Data M<strong>in</strong><strong>in</strong>g Research 599A particularly <strong>challeng<strong>in</strong>g</strong> problem is the noise <strong>in</strong> time series <strong>data</strong>. It is an importantopen issue to tackle. Many time series used for predictions are contam<strong>in</strong>atedby noise, mak<strong>in</strong>g it difficult to do accurate short-term and long-term predictions.Examples of these applications <strong>in</strong>clude the predictions of f<strong>in</strong>ancial time series andseismic time series. Although signal process<strong>in</strong>g techniques, such as wavelet analysisand filter<strong>in</strong>g, can be applied to remove the noise, they often <strong>in</strong>troduce lags<strong>in</strong> the filtered <strong>data</strong>. Such lags reduce the accuracy of predictions because the predictormust overcome the lags before it can predict <strong>in</strong>to the future. Exist<strong>in</strong>g <strong>data</strong>m<strong>in</strong><strong>in</strong>g methods also have difficulty <strong>in</strong> handl<strong>in</strong>g noisy <strong>data</strong> and learn<strong>in</strong>g mean<strong>in</strong>gful<strong>in</strong>formation from the <strong>data</strong>.Some of the key issues that need to be addressed <strong>in</strong> the design of a practical<strong>data</strong> m<strong>in</strong>er for noisy time series <strong>in</strong>clude:• Information/search agents to get <strong>in</strong>formation: Use of wrong, too many, or toolittle search criteria; possibly <strong>in</strong>consistent <strong>in</strong>formation from many sources; semanticanalysis of (meta-) <strong>in</strong>formation; assimilation of <strong>in</strong>formation <strong>in</strong>to <strong>in</strong>puts topredictor agents.• Learner/m<strong>in</strong>er to modify <strong>in</strong>formation selection criteria: apportion<strong>in</strong>g of biases tofeedback; develop<strong>in</strong>g rules for Search Agents to collect <strong>in</strong>formation; develop<strong>in</strong>grules for Information Agents to assimilate <strong>in</strong>formation.• Predictor agents to predict trends: Incorporation of qualitative <strong>in</strong>formation; multiobjectiveoptimization not <strong>in</strong> closed form.4. M<strong>in</strong><strong>in</strong>g Complex Knowledge from Complex DataOne important type of complex knowledge is <strong>in</strong> the form of graphs. Recent <strong>research</strong>has touched on the topic of discover<strong>in</strong>g graphs and structured patterns from large<strong>data</strong>, but clearly, more needs to be done.Another form of complexity is from <strong>data</strong> that are non-i.i.d. (<strong>in</strong>dependent andidentically distributed). This problem can occur when m<strong>in</strong><strong>in</strong>g <strong>data</strong> from multiplerelations. In most doma<strong>in</strong>s, the objects of <strong>in</strong>terest are not <strong>in</strong>dependent of each other,and are not of a s<strong>in</strong>gle type. We need <strong>data</strong> m<strong>in</strong><strong>in</strong>g systems that can soundly m<strong>in</strong>ethe rich structure of relations among objects, such as <strong>in</strong>terl<strong>in</strong>ked Web pages, socialnetworks, metabolic networks <strong>in</strong> the cell, etc.Yet another important problem is how to m<strong>in</strong>e non-relational <strong>data</strong>. A greatmajority of most organizations’ <strong>data</strong> is <strong>in</strong> text form, not <strong>data</strong>bases, and <strong>in</strong> morecomplex <strong>data</strong> formats <strong>in</strong>clud<strong>in</strong>g Image, Multimedia, and Web <strong>data</strong>. Thus, there isa need to study <strong>data</strong> m<strong>in</strong><strong>in</strong>g methods that go beyond classification and cluster<strong>in</strong>g.Some <strong>in</strong>terest<strong>in</strong>g questions <strong>in</strong>clude how to perform better automatic summarizationof text and how to recognize the movement of objects and people from Web andWireless <strong>data</strong> logs <strong>in</strong> order to discover useful spatial and temporal knowledge.There is now a strong need for <strong>in</strong>tegrat<strong>in</strong>g <strong>data</strong> m<strong>in</strong><strong>in</strong>g and knowledge <strong>in</strong>ference.It is an important future topic. In particular, one important area is to <strong>in</strong>corporatebackground knowledge <strong>in</strong>to <strong>data</strong> m<strong>in</strong><strong>in</strong>g. The biggest gap between what <strong>data</strong> m<strong>in</strong><strong>in</strong>g


600 Q. Yang & X. Wusystems can do today and what we’d like them to do is that they’re unable to relatethe results of m<strong>in</strong><strong>in</strong>g to the real-world decisions they affect — all they can do ishand the results back to the user. Do<strong>in</strong>g these <strong>in</strong>ferences, and thus automat<strong>in</strong>g thewhole <strong>data</strong> m<strong>in</strong><strong>in</strong>g loop, requires represent<strong>in</strong>g and us<strong>in</strong>g world knowledge with<strong>in</strong> thesystem. One important application of the <strong>in</strong>tegration is to <strong>in</strong>ject doma<strong>in</strong> <strong>in</strong>formationand bus<strong>in</strong>ess knowledge <strong>in</strong>to the knowledge discovery process.Related to m<strong>in</strong><strong>in</strong>g complex knowledge, the topic of m<strong>in</strong><strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g knowledgerema<strong>in</strong>s important. In the past, several <strong>research</strong>ers have tackled this problem fromdifferent angles, but we still do not have a very good understand<strong>in</strong>g of what makesdiscovered patterns “<strong>in</strong>terest<strong>in</strong>g” from the end-user perspective.5. Data M<strong>in</strong><strong>in</strong>g <strong>in</strong> a Network Sett<strong>in</strong>g5.1. Community and social networksToday’s world is <strong>in</strong>terconnected through many types of l<strong>in</strong>ks. These l<strong>in</strong>ks <strong>in</strong>cludeWeb pages, blogs, and emails. Many respondents consider community m<strong>in</strong><strong>in</strong>g andthe m<strong>in</strong><strong>in</strong>g of social networks as important topics. Community structures are importantproperties of social networks. The identification problem <strong>in</strong> itself is a <strong>challeng<strong>in</strong>g</strong>one. First, it’s critical to have the right characterization of the notion of“community” that is to be detected. Second, the entities/nodes <strong>in</strong>volved are distributed<strong>in</strong> real-life applications, and hence distributed means of identification willbe desired. Third, a snapshot-based <strong>data</strong>set may not be able to capture the realpicture; what is most important lies <strong>in</strong> the local relationships (e.g. the nature andfrequency of local <strong>in</strong>teractions) between the entities/nodes. Under these circumstances,our challenge is to understand (1) the network’s static structures (e.g.topologies and clusters) and (2) dynamic behavior (such as growth factors, robustness,and functional efficiency). A similar challenge exists <strong>in</strong> bio-<strong>in</strong>formatics, as weare currently mov<strong>in</strong>g our attention to the dynamic studies of regulatory networks.A questions related to this issue is what local algorithms/protocols are necessary<strong>in</strong> order to detect (or form) communities <strong>in</strong> a bottom-up fashion (as <strong>in</strong> the realworld).A concrete question is as follows. Email exchanges with<strong>in</strong> an organization or <strong>in</strong>one’s own mailbox over a long period of time can be m<strong>in</strong>ed to show how variousnetworks of common practice or friendship start to emerge. How can we obta<strong>in</strong> andm<strong>in</strong>e useful knowledge from them?5.2. M<strong>in</strong><strong>in</strong>g <strong>in</strong> and for computer networks — high-speed m<strong>in</strong><strong>in</strong>gof high-speed streamsNetwork m<strong>in</strong><strong>in</strong>g <strong>problems</strong> pose a key challenge. Network l<strong>in</strong>ks are <strong>in</strong>creas<strong>in</strong>g <strong>in</strong>speed, and service providers are now deploy<strong>in</strong>g 1 Gig Ethernet and <strong>10</strong> Gig Ethernetl<strong>in</strong>k speeds. To be able to detect anomalies (e.g. sudden traffic spikes due to a DoS(Denial of Service) attack or catastrophic event), service providers will need to be


<strong>10</strong> Challeng<strong>in</strong>g Problems <strong>in</strong> Data M<strong>in</strong><strong>in</strong>g Research 601able to capture IP packets at high l<strong>in</strong>k speeds and also analyze massive amounts(several hundred GB) of <strong>data</strong> each day. One will need highly scalable solutions here.Good algorithms are, therefore, needed to detect whether DoS attacks do notexist. Also, once an attack has been detected, how does one discrim<strong>in</strong>ate betweenlegitimate traffic and attack traffic so that it is possible to drop attack packets? Weneed techniques to(1) detect DoS attacks,(2) trace back to f<strong>in</strong>d out who the attackers are, and(3) drop those packets that belong to attack traffic.6. Distributed Data M<strong>in</strong><strong>in</strong>g and M<strong>in</strong><strong>in</strong>g Multi-Agent DataThe problem of distributed <strong>data</strong> m<strong>in</strong><strong>in</strong>g is very important <strong>in</strong> network <strong>problems</strong>. Ina distributed environment (such as a sensor or IP network), one has distributedprobes placed at strategic locations with<strong>in</strong> the network. The problem here is tobe able to correlate the <strong>data</strong> seen at the various probes, and discover patterns <strong>in</strong>the global <strong>data</strong> seen at all the different probes. There could be different modelsof distributed <strong>data</strong> m<strong>in</strong><strong>in</strong>g here, but one could <strong>in</strong>volve a NOC that collects <strong>data</strong>from the distributed sites, and another <strong>in</strong> which all sites are treated equally. Thegoal here obviously would be to m<strong>in</strong>imize the amount of <strong>data</strong> shipped between thevarious sites — essentially, to reduce the communication overhead.In distributed m<strong>in</strong><strong>in</strong>g, one problem is how to m<strong>in</strong>e across multiple heterogeneous<strong>data</strong> sources: multi-<strong>data</strong>base and multi-relational m<strong>in</strong><strong>in</strong>g.Another important new area is adversary <strong>data</strong> m<strong>in</strong><strong>in</strong>g. Inagrow<strong>in</strong>gnumberofdoma<strong>in</strong>s — email spam, counter-terrorism, <strong>in</strong>trusion detection/computer security,click spam, search eng<strong>in</strong>e spam, surveillance, fraud detection, shopbots, file shar<strong>in</strong>g,etc. — <strong>data</strong> m<strong>in</strong><strong>in</strong>g systems face adversaries that deliberately manipulate the <strong>data</strong>to sabotage them (e.g. make them produce false negatives). We need to developsystems that explicitly take this <strong>in</strong>to account, by comb<strong>in</strong><strong>in</strong>g <strong>data</strong> m<strong>in</strong><strong>in</strong>g with gametheory.7. Data M<strong>in</strong><strong>in</strong>g for Biological and Environmental ProblemsMany <strong>research</strong>ers that we surveyed believe that m<strong>in</strong><strong>in</strong>g biological <strong>data</strong> cont<strong>in</strong>uesto be an extremely important problem, both for <strong>data</strong> m<strong>in</strong><strong>in</strong>g <strong>research</strong> and forbiomedical sciences. An example of a <strong>research</strong> issue is how to apply <strong>data</strong> m<strong>in</strong><strong>in</strong>g toHIV vacc<strong>in</strong>e design. In molecular biology, many complex <strong>data</strong> m<strong>in</strong><strong>in</strong>g tasks exist,which cannot be handled by standard <strong>data</strong> m<strong>in</strong><strong>in</strong>g algorithms. These <strong>problems</strong><strong>in</strong>volve many different aspects, such as DNA, chemical properties, 3D structures,and functional properties.There is also a need to go beyond bio-<strong>data</strong> m<strong>in</strong><strong>in</strong>g. Data m<strong>in</strong><strong>in</strong>g <strong>research</strong>ersshould consider ecological and environmental <strong>in</strong>formatics. One of the biggestconcerns today, which is go<strong>in</strong>g to require significant <strong>data</strong> m<strong>in</strong><strong>in</strong>g efforts, is the


602 Q. Yang & X. Wuquestion of how we can best understand and hence utilize our natural environmentand resources — s<strong>in</strong>ce the world today is highly “resource-driven”! Datam<strong>in</strong><strong>in</strong>g will be able to make a high impact <strong>in</strong> the area of <strong>in</strong>tegrated <strong>data</strong> fusionand m<strong>in</strong><strong>in</strong>g <strong>in</strong> ecological/environmental applications, especially when <strong>in</strong>volv<strong>in</strong>gdistributed/decentralized <strong>data</strong> sources, e.g. autonomous mobile sensor networksfor monitor<strong>in</strong>g climate and/or vegetation changes.For example, how can <strong>data</strong> m<strong>in</strong><strong>in</strong>g technologies be used to study and f<strong>in</strong>d outcontribut<strong>in</strong>g factors <strong>in</strong> the observed doubl<strong>in</strong>g of the number of hurricane occurrencesover the past decades, as recently reported <strong>in</strong> <strong>Science</strong> magaz<strong>in</strong>e? Most of the <strong>data</strong>sources that we are deal<strong>in</strong>g with today are fast evolv<strong>in</strong>g, e.g. those from stockmarkets or city traffic. There is much <strong>in</strong>terest<strong>in</strong>g knowledge yet to be discovered, asfar as the dynamic change regularities and/or their cross-<strong>in</strong>teractions are concerned.In this regard, one of the challenges today is how to deal with the problem ofdynamic temporal behavioral pattern identification and prediction <strong>in</strong>: (1) very largescalesystems (e.g. global climate changes and potential “bird flu” epidemics) and(2) human-centered systems (e.g. user-adapted human-computer <strong>in</strong>teraction or P2Ptransactions).Related to these questions about important applications, there is a need to focuson “killer applications” of <strong>data</strong> m<strong>in</strong><strong>in</strong>g. So far three important and <strong>challeng<strong>in</strong>g</strong>applications for <strong>data</strong> m<strong>in</strong><strong>in</strong>g have emerged: bio<strong>in</strong>formatics, CRM/personalizationand security applications. However, more explorations are needed to expand theseapplications and extend the list of applications.8. Data M<strong>in</strong><strong>in</strong>g Process-Related ProblemsImportant topics exist <strong>in</strong> improv<strong>in</strong>g <strong>data</strong>-m<strong>in</strong><strong>in</strong>g tools and processes throughautomation, as suggested by several <strong>research</strong>ers. Specific issues <strong>in</strong>clude how to automatethe composition of <strong>data</strong> m<strong>in</strong><strong>in</strong>g operations and build<strong>in</strong>g a methodology <strong>in</strong>to<strong>data</strong> m<strong>in</strong><strong>in</strong>g systems to help users avoid many <strong>data</strong> m<strong>in</strong><strong>in</strong>g mistakes. If we automatethe different <strong>data</strong> m<strong>in</strong><strong>in</strong>g process operations, it would be possible to reduce humanlabor as much as possible. One important issue is how to automate <strong>data</strong> clean<strong>in</strong>g.We can build models and f<strong>in</strong>d patterns very fast today, but 90 percent of the costis <strong>in</strong> pre-process<strong>in</strong>g (<strong>data</strong> <strong>in</strong>tegration, <strong>data</strong> clean<strong>in</strong>g, etc.) Reduc<strong>in</strong>g this cost willhave a much greater payoff than further reduc<strong>in</strong>g the cost of model-build<strong>in</strong>g andpattern-f<strong>in</strong>d<strong>in</strong>g. Another issue is how to perform systematic documentation of <strong>data</strong>clean<strong>in</strong>g. Another issue is how to comb<strong>in</strong>e visual <strong>in</strong>teractive and automatic <strong>data</strong>m<strong>in</strong><strong>in</strong>g techniques together. He observes that <strong>in</strong> many applications, <strong>data</strong> m<strong>in</strong><strong>in</strong>ggoals and tasks cannot be fully specified, especially <strong>in</strong> exploratory <strong>data</strong> analysis.Visualization helps to learn more about the <strong>data</strong> and def<strong>in</strong>e/ref<strong>in</strong>e the <strong>data</strong>m<strong>in</strong><strong>in</strong>g tasks.There is also a need for the development of a theory beh<strong>in</strong>d <strong>in</strong>teractive explorationof large/complex <strong>data</strong>sets. An important question to ask is: what are thecompositional approaches for multi-step m<strong>in</strong><strong>in</strong>g “queries”? What is the canonical


<strong>10</strong> Challeng<strong>in</strong>g Problems <strong>in</strong> Data M<strong>in</strong><strong>in</strong>g Research 603set of <strong>data</strong> m<strong>in</strong><strong>in</strong>g operators for the <strong>in</strong>teractive exploration approach? For example,the <strong>data</strong> m<strong>in</strong><strong>in</strong>g system Clement<strong>in</strong>e has a nice user <strong>in</strong>terface, but what is the theorybeh<strong>in</strong>d its operations?9. Security, Privacy, and Data IntegritySeveral <strong>research</strong>ers considered privacy protection <strong>in</strong> <strong>data</strong> m<strong>in</strong><strong>in</strong>g as an importanttopic. That is, how to ensure the users’ privacy while their <strong>data</strong> are be<strong>in</strong>g m<strong>in</strong>ed.Related to this topic is <strong>data</strong> m<strong>in</strong><strong>in</strong>g for protection of security and privacy. Onerespondent states that if we do not solve the privacy issue, <strong>data</strong> m<strong>in</strong><strong>in</strong>g will becomea derogatory term to the general public.Some respondents consider the problem of knowledge <strong>in</strong>tegrity assessment to beimportant. We quote their observations: “Data m<strong>in</strong><strong>in</strong>g algorithms are frequentlyapplied to <strong>data</strong> that have been <strong>in</strong>tentionally modified from their orig<strong>in</strong>al version,<strong>in</strong> order to mis<strong>in</strong>form the recipients of the <strong>data</strong> or to counter privacy and securitythreats. Such modifications can distort, to an unknown extent, the knowledgeconta<strong>in</strong>ed <strong>in</strong> the orig<strong>in</strong>al <strong>data</strong>. As a result, one of the challenges fac<strong>in</strong>g <strong>research</strong>ersis the development of measures not only to evaluate the knowledge <strong>in</strong>tegrity ofa collection of <strong>data</strong>, but also of measures to evaluate the knowledge <strong>in</strong>tegrity of<strong>in</strong>dividual patterns. Additionally, the problem of knowledge <strong>in</strong>tegrity assessmentpresents several challenges.”Related to the knowledge <strong>in</strong>tegrity assessment issue, the two most significantchallenges are: (1) develop efficient algorithms for compar<strong>in</strong>g the knowledge contentsof the two (before and after) versions of the <strong>data</strong>, and (2) develop algorithmsfor estimat<strong>in</strong>g the impact that certa<strong>in</strong> modifications of the <strong>data</strong> have on the statisticalsignificance of <strong>in</strong>dividual patterns obta<strong>in</strong>able by broad classes of <strong>data</strong> m<strong>in</strong><strong>in</strong>galgorithms. The first challenge requires the development of efficient algorithms and<strong>data</strong> structures to evaluate the knowledge <strong>in</strong>tegrity of a collection of <strong>data</strong>. Thesecond challenge is to develop algorithms to measure the impact that the modificationof <strong>data</strong> values has on a discovered pattern’s statistical significance, althoughit might be <strong>in</strong>feasible to develop a global measure for all <strong>data</strong> m<strong>in</strong><strong>in</strong>g algorithms.<strong>10</strong>. Deal<strong>in</strong>g with Non-Static, Unbalanced and Cost-Sensitive DataAn important issue is that the learned models should <strong>in</strong>corporate time because<strong>data</strong> is not static and is constantly chang<strong>in</strong>g <strong>in</strong> many doma<strong>in</strong>s. Historical actions<strong>in</strong> sampl<strong>in</strong>g and model build<strong>in</strong>g are not optimal, but they are not chosen randomlyeither. This gives the follow<strong>in</strong>g <strong>challeng<strong>in</strong>g</strong> phenomenon for the <strong>data</strong> collectionprocess. Suppose that we use the <strong>data</strong> collected <strong>in</strong> 2000 to learn a model. We thenapply this model to select <strong>in</strong>side the 2001 population. Subsequently, we use the <strong>data</strong>about the <strong>in</strong>dividuals selected <strong>in</strong> 2001 to learn a new model, and then apply thismodel <strong>in</strong> 2002. If this process cont<strong>in</strong>ues, then each time a new model is learned, itstra<strong>in</strong><strong>in</strong>g set has been created us<strong>in</strong>g a different selection bias. Thus, a <strong>challeng<strong>in</strong>g</strong>problem is how to correct the bias as much as possible.


604 Q. Yang & X. WuAnother related issue is how to deal with unbalanced and cost-sensitive <strong>data</strong>, amajor challenge <strong>in</strong> <strong>research</strong>. Charles Elkan made the observation <strong>in</strong> an <strong>in</strong>vited talkat ICML 2003 Workshop on Learn<strong>in</strong>g from Imbalanced Data Sets. First,<strong>in</strong>previousstudies, it has been observed that UCI <strong>data</strong>sets are small and not highly unbalanced.In a typical real-world <strong>data</strong>set, there are at least <strong>10</strong> 5 examples and <strong>10</strong> 2.5 features,without s<strong>in</strong>gle well-def<strong>in</strong>ed target class. Interest<strong>in</strong>g cases have a frequency of lessthan 0.01. There is much <strong>in</strong>formation on costs and benefits, but no overall model ofprofit and loss. There are different cost matrices for different examples. However,most cost matrix entries are unknown. An example of this <strong>data</strong>set is the directmarket<strong>in</strong>g DMEF <strong>data</strong> library. Furthermore, the costs of different outcomes aredependent on the examples; for example, the false negative cost of direct market<strong>in</strong>gis directly proportional to the amount of a potential donation. Traditional methodsfor obta<strong>in</strong><strong>in</strong>g these costs relied on sampl<strong>in</strong>g methods. However, sampl<strong>in</strong>g methodscan easily give biased results.11. ConclusionsS<strong>in</strong>ce its conception <strong>in</strong> the late 1980s, <strong>data</strong> m<strong>in</strong><strong>in</strong>g has achieved tremendous success.Many new <strong>problems</strong> have emerged and have been solved by <strong>data</strong> m<strong>in</strong><strong>in</strong>g <strong>research</strong>ers.However, there is still a lack of timely exchange of important topics <strong>in</strong> the communityas a whole. This article summarizes a survey that we have conducted to rank<strong>10</strong> most important <strong>problems</strong> <strong>in</strong> <strong>data</strong> m<strong>in</strong><strong>in</strong>g <strong>research</strong>. These <strong>problems</strong> are sampledfrom a small, albeit important, segment of the community. The list should obviouslybe a function of time for this dynamic field.F<strong>in</strong>ally, we summarize the <strong>10</strong> <strong>problems</strong> below:• Develop<strong>in</strong>g a unify<strong>in</strong>g theory of <strong>data</strong> m<strong>in</strong><strong>in</strong>g• Scal<strong>in</strong>g up for high dimensional <strong>data</strong> and high speed <strong>data</strong> streams• M<strong>in</strong><strong>in</strong>g sequence <strong>data</strong> and time series <strong>data</strong>• M<strong>in</strong><strong>in</strong>g complex knowledge from complex <strong>data</strong>• Data m<strong>in</strong><strong>in</strong>g <strong>in</strong> a network sett<strong>in</strong>g• Distributed <strong>data</strong> m<strong>in</strong><strong>in</strong>g and m<strong>in</strong><strong>in</strong>g multi-agent <strong>data</strong>• Data m<strong>in</strong><strong>in</strong>g for biological and environmental <strong>problems</strong>• Data M<strong>in</strong><strong>in</strong>g process-related <strong>problems</strong>• Security, privacy and <strong>data</strong> <strong>in</strong>tegrity• Deal<strong>in</strong>g with non-static, unbalanced and cost-sensitive <strong>data</strong>AcknowledgmentsWe thank all who have responded to our survey requests despite their busyschedules. We wish to thank Pedro Dom<strong>in</strong>gos, Charles Elkan, Johannes Gehrke,Jiawei Han, David Heckerman, Daniel Keim, Jim<strong>in</strong>g Liu, David Madigan, GregoryPiatetsky-Shapiro, Vijay V. Raghavan, and his associates, Rajeev Rastogi, SalvatoreJ. Stolfo, Alexander Tuzhil<strong>in</strong>, and Benjam<strong>in</strong> W. Wah for their k<strong>in</strong>d <strong>in</strong>put.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!