10.07.2015 Views

business intelligence and analytics: from big data to big impact

business intelligence and analytics: from big data to big impact

business intelligence and analytics: from big data to big impact

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chen et al./Introduction: Business Intelligence Researchadvances such as neural networks for classification/prediction<strong>and</strong> clustering <strong>and</strong> genetic algorithms for optimization <strong>and</strong>machine learning have all contributed <strong>to</strong> the success of <strong>data</strong>mining in different applications.Two other <strong>data</strong> <strong>analytics</strong> approaches commonly taught in<strong>business</strong> school are also critical for BI&A. Grounded instatistical theories <strong>and</strong> models, multivariate statistical analysiscovers analytical techniques such as regression, fac<strong>to</strong>r analysis,clustering, <strong>and</strong> discriminant analysis that have been usedsuccessfully in various <strong>business</strong> applications. Developed inthe management science community, optimization techniques<strong>and</strong> heuristic search are also suitable for selected BI&A problemssuch as <strong>data</strong>base feature selection <strong>and</strong> web crawling/spidering. Most of these techniques can be found in <strong>business</strong>school curricula.Due <strong>to</strong> the success achieved collectively by the <strong>data</strong> mining<strong>and</strong> statistical analysis community, <strong>data</strong> <strong>analytics</strong> continues <strong>to</strong>be an active area of research. Statistical machine learning,often based on well-grounded mathematical models <strong>and</strong>powerful algorithms, techniques such as Bayesian networks,Hidden Markov models, support vec<strong>to</strong>r machine, reinforcementlearning, <strong>and</strong> ensemble models, have been applied <strong>to</strong><strong>data</strong>, text, <strong>and</strong> web <strong>analytics</strong> applications. Other new <strong>data</strong><strong>analytics</strong> techniques explore <strong>and</strong> leverage unique <strong>data</strong> characteristics,<strong>from</strong> sequential/temporal mining <strong>and</strong> spatial mining,<strong>to</strong> <strong>data</strong> mining for high-speed <strong>data</strong> streams <strong>and</strong> sensor <strong>data</strong>.Increased privacy concerns in various e-commerce, e-government, <strong>and</strong> healthcare applications have caused privacypreserving<strong>data</strong> mining <strong>to</strong> become an emerging area ofresearch. Many of these methods are <strong>data</strong>-driven, relying onvarious anonymization techniques, while others are processdriven,defining how <strong>data</strong> can be accessed <strong>and</strong> used (Gelf<strong>and</strong>2011/ 2012). Over the past decade, process mining has alsoemerged as a new research field that focuses on the analysisof processes using event <strong>data</strong>. Process mining has becomepossible due <strong>to</strong> the availability of event logs in variousindustries (e.g., healthcare, supply chains) <strong>and</strong> new processdiscovery <strong>and</strong> conformance checking techniques (van derAalst 2012). Furthermore, network <strong>data</strong> <strong>and</strong> web content havehelped generate exciting research in network <strong>analytics</strong> <strong>and</strong>web <strong>analytics</strong>, which are presented below.In addition <strong>to</strong> active academic research on <strong>data</strong> <strong>analytics</strong>,industry research <strong>and</strong> development has also generated muchexcitement, especially with respect <strong>to</strong> <strong>big</strong> <strong>data</strong> <strong>analytics</strong> forsemi-structured content. Unlike the structured <strong>data</strong> that canbe h<strong>and</strong>led repeatedly through a RDBMS, semi-structured<strong>data</strong> may call for ad hoc <strong>and</strong> one-time extraction, parsing,processing, indexing, <strong>and</strong> <strong>analytics</strong> in a scalable <strong>and</strong> distributedMapReduce or Hadoop environment. MapReducehas been hailed as a revolutionary new platform for largescale,massively parallel <strong>data</strong> access (Patterson 2008).Inspired in part by MapReduce, Hadoop provides a Javabasedsoftware framework for distributed processing of <strong>data</strong>intensivetransformation <strong>and</strong> <strong>analytics</strong>. The <strong>to</strong>p three commercial<strong>data</strong>base suppliers—Oracle, IBM, <strong>and</strong> Microsoft—have all adopted Hadoop, some within a cloud infrastructure.The open source Apache Hadoop has also gained significanttraction for <strong>business</strong> <strong>analytics</strong>, including Chukwa for <strong>data</strong>collection, HBase for distributed <strong>data</strong> s<strong>to</strong>rage, Hive for <strong>data</strong>summarization <strong>and</strong> ad hoc querying, <strong>and</strong> Mahout for <strong>data</strong>mining (Henschen 2011). In their perspective paper, S<strong>to</strong>nebrakeret al. (2010) compared MapReduce with the parallelDBMS. The commercial parallel DBMS showed clear advantagesin efficient query processing <strong>and</strong> high-level querylanguage <strong>and</strong> interface, whereas MapReduce excelled in ETL<strong>and</strong> <strong>analytics</strong> for “read only” semi-structured <strong>data</strong> sets. NewHadoop- <strong>and</strong> MapReduce-based systems have becomeanother viable option for <strong>big</strong> <strong>data</strong> <strong>analytics</strong> in addition <strong>to</strong> thecommercial systems developed for RDBMS, column-basedDBMS, in-memory DBMS, <strong>and</strong> parallel DBMS (Chaudhuriet al. 2011).Text AnalyticsA significant portion of the unstructured content collected byan organization is in textual format, <strong>from</strong> e-mail communication<strong>and</strong> corporate documents <strong>to</strong> web pages <strong>and</strong> socialmedia content. Text <strong>analytics</strong> has its academic roots ininformation retrieval <strong>and</strong> computational linguistics. In informationretrieval, document representation <strong>and</strong> query processingare the foundations for developing the vec<strong>to</strong>r-spacemodel, Boolean retrieval model, <strong>and</strong> probabilistic retrievalmodel, which in turn, became the basis for the modern digitallibraries, search engines, <strong>and</strong> enterprise search systems(Sal<strong>to</strong>n 1989). In computational linguistics, statistical naturallanguage processing (NLP) techniques for lexical acquisition,word sense disam<strong>big</strong>uation, part-of-speech-tagging (POST),<strong>and</strong> probabilistic context-free grammars have also becomeimportant for representing text (Manning <strong>and</strong> Schütze 1999).In addition <strong>to</strong> document <strong>and</strong> query representations, usermodels <strong>and</strong> relevance feedback are also important inenhancing search performance.Since the early 1990s, search engines have evolved in<strong>to</strong>mature commercial systems, consisting of fast, distributedcrawling; efficient inverted indexing; inlink-based pageranking; <strong>and</strong> search logs <strong>analytics</strong>. Many of these foundationaltext processing <strong>and</strong> indexing techniques have beendeployed in text-based enterprise search <strong>and</strong> documentmanagement systems in BI&A 1.0.MIS Quarterly Vol. 36 No. 4/December 2012 11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!