12.07.2015 Views

Enterprise Search - ACM Digital Library

Enterprise Search - ACM Digital Library

Enterprise Search - ACM Digital Library

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

FOCUS<strong>Search</strong>Why is it thatsearching an intranetis so much harder thansearching the Web?36 April 2004 QUEUE rants: feedback@acmqueue.com


RAJAT MUKHERJEE ANDJIANCHANG MAO, VERITYThe last decade has witnessed the growth ofinformation retrieval from a boutique disciplinein information and library science to an everydayexperience for billions of people around theworld. This revolution has been driven in large measureby the Internet, with vendors focused on search and navigationof Web resources and Web content management.Simultaneously, enterprises have invested in networkingall of their information together—to the point where it isincreasingly possible for employees to have a single windowinto the enterprise. Although these employees seekWeb-like experiences in the enterprise, the Internet andenterprise domains differ fundamentally in the nature ofthe content, user behavior, and economic motivations.<strong>Enterprise</strong> <strong>Search</strong>:more queue: www.acmqueue.comQUEUE April 2004 37


FOCUS<strong>Search</strong><strong>Enterprise</strong> <strong>Search</strong>:Our principal focus here is on outlining the demandson information retrieval in enterprises and varioustechnologies that are employed in an enterprise contentinfrastructure. We define an enterprise to mean anycollaborative effort involving proprietary information,whether commercial, academic, governmental, or nonprofit.The term search is usually used to mean keywordsearch. In this article, we use a broader definition thatencompasses advanced search capabilities, navigation,and information discovery.The overwhelming majority of information in anenterprise is unstructured—that is, it is not resident inrelational databases that tabulate the data and transactionsoccurring throughout the enterprise. This unstructuredinformation exists in the form of HTML pages,documents in proprietary formats, and forms (e.g.,paper and media objects). Together with information inrelational and proprietary databases, these documentsconstitute the enterprise information ecosystem.Web <strong>Search</strong> Experience (Google)FIG 1Arguably, it is the structured information in an enterprisethat is the most valuable; enterprises thus seek toenhance the value of their unstructured information byadding structure to it. Creating, aggregating, capturing,managing, retrieving, and delivering this informationare core elements in an enterprise content infrastructure.<strong>Enterprise</strong> information delivery must clearly meet theperformance that users have come to expect on the Internet.While some techniques for scaling and performancedeveloped on the Web can be adapted to the enterprise,many techniques for searching, organizing, and mininginformation on the Web are less applicable to theenterprise.ENTERPRISE VERSUSINTERNET SEARCH<strong>Enterprise</strong> search differs from Internet search in manyways. 1,2,3 First, the notion of a “good” answer to a queryis quite different. On the Internet, it is vaguely defined.Because a large number of documents are typically relevantto a query, a user is often looking for the “best” ormost relevant document. On an intranet, the notion ofa “good” answer is often defined as the “right” answer.Users might know or have previously seen the specificdocument(s) that they are looking for. A large fraction ofqueries tend to have a small set of correct answers (oftenunique, as in “I forgot my Unix password”), and theanswers may not have special characteristics. The correctanswer is not necessarily the most “popular” document,which largely determines the “best” answer on the Internet.Finding the right answer is often more difficult thanfinding the best answer.Second, the social forces behind the creation of Internetand intranet content are quite different. 1 The Internetreflects the collective voice of many authors who are freeto publish content, whereas an intranet generally reflectsthe view of the entity that it serves. Intranet contentis created for disseminating information, rather thanattracting and holding the attention of any specific groupof users. There is no incentive for content creation, andall users may not have permission to publish content.Content from heterogeneous repositories—forexample, e-mail systems and content managementsystems—typically do not cross-reference each other viahyperlinks. Therefore, the link structure on an intranetis very different from the one on the Internet. Forexample, on the Internet the strongly connected component(pages that can reach each other by following the links)accounts for roughly 30 percent of crawled pages. On corporateintranets, this number is much smaller (10 percent38 April 2004 QUEUE rants: feedback@acmqueue.com


on IBM’s intranet, for example 1 ). The popular PageRank 4and HITS 5 algorithms are thus not as effective on anintranet as on the Internet. 6 Other techniques have to beemployed to improve search relevance on an intranet.<strong>Enterprise</strong> content and processes have different characteristicsthat make information retrieval within theenterprise substantially different from Web search. Thisin itself has caused enterprise search to evolve differentlyfrom Internet search. (See the sidebar, “<strong>Enterprise</strong> Characteristics.”)The different needs result in dramaticallydifferent experiences on the Internet (figure 1) and, say, acorporate intranet (figure 2).Deployment environments for these domains alsodiffer: an Internet search engine, including hardware andsoftware, is fully controlled and managed by one organizationas a service. <strong>Enterprise</strong> search software is licensedto and deployed by a variety of organizations in diverseenvironments. This imposes varying requirements: hardwareconstraints, software platforms, bandwidth, firewalls,heterogeneous content repositories, security models,document formats, user communities, interfaces, and geographicdistribution. <strong>Enterprise</strong> search software needs—high flexibility/configurability and ease of use with ease ofdeployment—are often at odds with each other.Though a search service can incorporate new technologiesin quick cycles, this is often not the case for enterprisedeployments. Economic and time constraints in enterprisessometimes prevent quick upgrade cycles. Often,enterprises use old software versions, although theyare fully cognizant that they are not employing certaintechnologies that they could. This is more pronouncedin cases where search software is embedded in third-partyenterprise applications with extended release cycles. Thissometimes leads to the end users being dissatisfied withthe quality of search provided within the enterprise. Asearch service, however, cannot effectively be bundledinto an enterprise offering, since enterprises demand flexibility,security, and custom application access.Sample User Experience on an IntranetFIG 2ofTECHNOLOGIESSpecific techniques canimprove enterprise search.Figure 3 depicts key ingredientsof an enterprisesearch system.Spidering and Indexing.Data must be accumulated(spidered) and indexedbefore it can be searched.This requires knowledge ofwhere the critical informationexists, and access tothese repositories, whichcan be secure. Manycurrent spiders run onpredefined schedules thatdo not match the ratesat which information ischanging. Adaptive refreshof indexes is required,which involves moresophisticated change-detectionmechanisms. Mostspiders use a pull model,which is harsh on networkresources and on the targetrepositories. Future spiderswill take greater advantagetriggers and targetedmore queue: www.acmqueue.comQUEUE April 2004 39


FOCUS<strong>Search</strong><strong>Enterprise</strong> <strong>Search</strong>:crawling. One issue faced here is that most applicationsdo not expose information about what has changed; forexample, they are not designed for exploiting externalsearch technology, since they often build search into theapplication. Adoption of search standards by applicationvendors can help solve this problem.Documents in multiple languages can reside in thesame index, and techniques for automatic languagedetection can be used for language-based content routingand partitioning. Indexes are already taking intoaccount information on hyperlinks, anchor-text, etc.Metadata will automatically be extracted during indexingto improve retrieval quality. Application and contentmanagement vendors will do well to flag content that hasusersfederate search browse alert recommendsecurityindexes, enginesanalyticsclassifyprofileindexclustersecure gatewaysrepositoriesWWWPDF WORD PPT EXCEL etc.FIG 3been modified to eliminate unnecessary reindexing.Data Filtering. One of the keys to quality is to weed outinformation that is dated, irrelevant, or duplicated. Cleandata implies better search relevance. Moreover, automaticclassification, feature extraction, and clustering technologieswill be more accurate when the data presented tothem is cleaned up by a pre-processor. Techniques suchas link-density analysis can be used to detect the differencesbetween content and link-rich menus on Webpages. Entity-extraction techniques can be used to addrelevant information before indexing occurs. Strippingout advertisements, menus, and so forth improves thequality of the subsequent ranking algorithms that operateon the content. This is important for enterprises that areindexing external content.<strong>Search</strong> Relevance. Certain Internet search strategies,such as hyperlink analysis, cannot be carried overdirectly to the enterprise. Some strategies are actuallybeing abused on the Internet; for example, Internetsearch engines are constantly tuned to offset the effectsof spam and the doctoring of Web pages to take advantageof search algorithms.Other characteristics ofenterprise content mustbe exploited to achievehigher search relevance.For example, sinceintranets are essentiallyspam-free (because ofthe lack of incentives forspamming), anchor-textand title words are reliablesources of information forranking documents. Therank-aggregation approachproposed by Fagin etal provides an effectiveway of combining ranksderived from separatesources of information. 1Many intranet queries(60 to 80 percent) aretargeted to retrieve “stuffI’ve seen.” A user mayremember some attributesof the target results, suchas date or author. <strong>Search</strong>engines must provide away of specifying attributesin conjunction with the40 April 2004 QUEUE rants: feedback@acmqueue.com


query. Sorting on an attribute also helps to locate informationquickly.User role and context can improve the relevance ofsearch. Session-based personalization techniques arealready available in enterprise search software from leadingvendors. Whereas a public Web site gathering userinformation can flag privacy concerns, users in enterpriseshave far fewer concerns if their access—for example,clickstreams on the intranet—is tracked, because it islikely business related. In some enterprises—for example,in the finance industry—even IM (instant messaging) isfair game for regulatory reasons.The perceived relevance of a result can be dramaticallychanged by providing better titles (using techniques tocreate titles automatically if none exists), dynamic summaries,category information, and so forth. Usability istightly coupled to relevance.Structured versus Unstructured Information. Inproduct catalogs, each item has unstructured text, as wellas structured attributes. For example, an automobile typicallyhas a description and attributes, such as year, model,and price. A typical query is a conjunction of an arbitrary<strong>Enterprise</strong> CharacteristicsDiversity of Content Sources and Formats<strong>Enterprise</strong>s must ingest and extract structured and unstructuredinformation from heterogeneous content sources, forexample, Microsoft Exchange, Lotus Notes, Documentum,as well as file systems and intranets. Furthermore, documentsexist in a myriad of file formats and several languages;a single document could contain multiple languages, orattachments with multiple MIME-types. At this time, lessthan 10 percent of enterprise content by volume is HTML.Secure AccessAn individual’s role in an enterprise dictates what documentscan be accessed. Sophisticated enterprises demand a morestringent notion of security in which search result lists arefiltered to display only the documents accessible to the user.Doing this in conjunction with the native security of therepositories is a particularly difficult challenge.Combined Structured and Unstructured <strong>Search</strong>Information that is considered to be unstructured is in factsemi-structured, with metadata such as author, title, date,size, and so forth. Conversely, much structured informationin an RDBMS (relational database management system)is unstructured—for example, blobs of text and VARCHARfields. XML is ubiquitous in content and applications. It isessential to provide high-performance parametric search thatallows the user to navigate information through a flexiblecombination of structured and unstructured data.Flexible Scoring and Ranking MechanismsNo single scoring and ranking function will work for allenterprise search contexts. Many of the powerful link-basedscoring and ranking algorithms that have been honed for theWeb are unlikely to be germane to the enterprise. <strong>Enterprise</strong>content is fundamentally different from Web content, enterpriseusers are different in their goals and expectations fromWeb users, and enterprise search imposes layers of complexitythat the Web lacks.Federated and Peered ResultsFederated search enables a single point of access to multiplesources (internal indices, Web search, and subscriptionsources—for example, realtime newsfeeds). The keychallenge here is to merge sets of results from all sources forunified presentation. This must be done even though thesets typically have no documents in common and employdifferent scoring and ranking schemes.Content Generation ProcessesWhile the Internet tends to grow democratically, intranetsare often governed by bureaucracies. Content creation on anintranet is normally centralized to a small number of people.While content published on an intranet may need to complywith specific policies (reviews, approvals), consistency is notguaranteed, since there may be multiple organizational unitswhose policies differ.People/Roles/BehaviorsIt is well known that some of the most valuable knowledge inan enterprise resides in the minds of its employees. <strong>Enterprise</strong>smust combine digital information with the knowledgeand experience of employees. An important distinctionbetween the enterprise and the Internet is that while Internetusers are anonymous for the most part, enterprise users areanswerable and guided by specific controllable processes.Privacy issues are also very different in an enterprise, sincepeople are usually engaged in enterprise-specific behaviorand are being compensated for their engagement.more queue: www.acmqueue.comQUEUE April 2004 41


FOCUS<strong>Search</strong><strong>Enterprise</strong> <strong>Search</strong>:text query (“Leather Trim” AND “All-wheel Drive”) anda parametric query on structured fields (Manufacturer =Toyota AND price < $30,000 AND year > 2000). Whereasthe text query is within the purview of classic informationretrieval systems, the parametric query is traditionallyhandled using relational database systems. Modern searchtools perform both functions for applications such ase-commerce and marketplaces where scalability and costperformanceare critical.Using an RDBMS (relational database management system)to solve the problem would result in unacceptablypoor query responses. Text search extensions in RDBMSsdo not support powerful free text query capabilities—forThe information explosionhas reached the point whereexample, fuzzy search—and are not cost effective forsearch. 7 In addition to being able to sort along attributevalues, it is crucial to be able to rank the results basedon the query. This enables efficient guided navigation ofresults, allowing the user to progressively refine (or relax)the query.Parametric refinement 8 provides a solution to thisproblem by augmenting a full-text index with anauxiliary parametric index that allows for fast search,navigation, and ranking of query results. The main issuewith this powerful technology is that data preparationis very important, and organizations need to invest timein augmenting and normalizing the data. Classificationand entity-extraction techniques will be used to augmentinformation with attributes for improved search and navigation.information architects often lack anadequate grasp of all the themesand topics represented in the corpus.Another key characteristic of data is structure withinthe data itself. With the increased adoption of XML,the ability to search and retrieve specific portions ofdocuments—for example, specific elements in XML—ismandatory. Query semantics like XQuery will be supported,but with the added ability to handle unstructuredtext and fuzzy constructs that databases do not handleelegantly (e.g., spelling errors). The ability to dynamicallyconstruct virtual documents that can consist of relevantportions of many documents will be critical. What theend user will want in the future is not just a matchingdocument, but something that represents an answer.Classification, Clustering, and Taxonomy Navigation.<strong>Search</strong> provides an efficient way of finding relevantinformation from huge amounts of data only if the usersknow what to search for. With large corpora, queries canhave large result sets. It is important to facilitate users informing effective queries through browsing and navigatinginformation in a relatively small, manageable space.Examples of such spaces are taxonomies. They organizedocuments into navigable structures to assist users infinding relevant information. <strong>Search</strong>es within a categorytypically produce higher relevance results than unscopedsearches. Research has shown that presenting results incategories provides better usability. 9Most taxonomies are built and maintained by humansbecause domain expertise is required. Well-known examplesinclude the directory structures of Yahoo! and theOpen Directory Project. Manual taxonomy constructionis time consuming and expensive, however. Further, inmany enterprises, the information explosion has reachedthe point where information architects often lack anadequate grasp of all the themes and topics represented inthe corpus. They need automated systems that first minethe corpus, extract key concepts, organize the conceptsinto a concept hierarchy, 10 and assign documents to it.Visualization tools are helpful here, rendering a thematicmap of the concepts found and the relationships (parent–child,sibling, etc.) between them. A further desirablefeature is to label these concepts with succinct humanreadablelabels. Finally, this idea can be extended to suggesta taxonomy that allows users to browse the corpus.The state of the art in this area is still young andcannot be relied on to extract a taxonomy such as thatproduced by professional library scientists. It is, however,reasonable to expect systems to generate a strawmantaxonomy for refinement by domain experts, making thedomain experts dramatically more productive by automaticallymining the patterns and discovering associationsin the data.Classification rules for assigning documents intocategories can be either manually constructed by knowledgeworkers, or automatically learned from pre-classified42 April 2004 QUEUE rants: feedback@acmqueue.com


FOCUS<strong>Search</strong><strong>Enterprise</strong> <strong>Search</strong>:to relevant content (e.g., on the Web), there are caseswhere it cannot be indexed (e.g., security) or is forbiddenfrom being indexed because of legal constraints. Further,in large organizations, different departments commonlyindex silos of information via different software systemsor applications. 3In such cases, federated search is the only way toprovide a single point of access to data from enterpriserepositories and applications, as well as external subscriptionsources and realtime feeds. Sophisticated systems addfurther value via ranking, filtering, duplicate detection,dynamic classification, and realtime clustering of resultsfrom disparate sources that may not be under the jurisdictionof the enterprise. 12 Database vendors provide federationacross disparate relational databases, but federated???FIG 5Social network models cancorrelate different types ofinformation assets, including people.search for unstructured data provides different challenges(e.g., ranking across independent systems) and opportunities(e.g., classification and clustering).Sources could include personal workstations, as in apeer network, requiring asynchronous and incrementalbehavior and scheduling support. For example, users canschedule an overnight search that provides a blended,filtered, and classified result set before their arrival in themorning.Social Networks and Use-Based Relevance. An enormousproportion of an organization’s intellectual capitalresides as tacit knowledge. Social networks include thehuman element in the information ecosystem. Informationusage patterns can be analyzed to discover thepatent and latent relationships between the people inan organization and the documents they create, modify,access, search, and organize. Web search engines suchas Google use the structure of hyperlinks betweenWeb pages, which reflects relevance of Web assets tocommunities. This is a static approach. A system thatanalyzes communications patterns between people andthe dynamic usage of information in the enterprise willdeliver a richer personalizedexperience, based ona combination of contentand context, to individualsand groups.?Figure 5 depicts a socialnetwork, including multipleentities of differenttypes (e.g., documents,queries, categories, andusers). It is possible torepresent these disparateentities using a consistent?model (e.g., a set of keywordsor feature vector).This enables detection ofuseful correlations amongentities of different types.The input context can bea combination of multipleentities—for example, auser’s profile, the inputquery, and the current?document the user isbrowsing. Further, theserepresentations can changebased on user interactions,improving relevance over44 April 2004 QUEUE rants: feedback@acmqueue.com


time. Although many activities on the Internet are unreliable(e.g., spam), you can be more confident of interactionsthat occur within enterprises, allowing systems totake advantage of this vital input.Traditionally, the scoring and ranking of documentsis based on the content in each document matching thequery. The social network can be exploited to augmentcontent analysis with the historical behavior of users,changing result ranking. This adaptive ranking could besimplistic (boost a document’s rank if a previous userelected to view/rate it after issuing the same search)or more sophisticated (boost rank if selected from thesecond results page for a similar query). Incorporatingdynamic feedback also allows for the infusion of newterms to document representations, allowing relevantinformation to be returned for query terms that do noteven exist in the content (concept-based retrieval).User profiles can enhance the input context to providepersonalization and targeted search. Persistent profilescan exploit a user’s role in addition to historical patternsof access. A session-based profile provides realtimepersonalization, improving relevance to the current task.Such systems can exploit a user’s queries, clickstreams,entry in the company directory, and so forth.Both physical assets (e.g., documents, users) andvirtual assets (e.g., categories, groups) can be includedin the social network, facilitating information discoveryrelevant to the user’s context. Such a system enables usersto participate in the taxonomy building process, enablingpersonal and community taxonomies. The vicinity ofthe input can include experts within the organization.Social-network research suggests that portal users forminto overlapping cliques of tightly knit communities.This drives the need for discovering communities ofother users most germane to a user’s current informationcontext.Analytics. Reporting and analytic modules suppliedby the software can provide concrete metrics of searchrelevance and efficacy, and are powerful tools for improvinguser experience. These metrics can help to validateimprovements in the search implementation—forexample, adding dynamic query-based summaries toresult lists, enabling user feedback, and creating synonymlists and predefined queries to avoid the dreaded“No results found.” Without reporting tools and analytics,measurements of relevance and user satisfaction aredifficult to make. Analyzing what categories are beingnavigated and what documents are popular can be usefulfor organizations in evaluating their information needsand evolution.THE OUTLOOKSeveral opportunities exist for developing better enterprisesearch platforms, but certain challenges must befaced in exploiting these opportunities.Algorithms for content-based search relevance willimprove. However, exploiting user interactions to furthertune system performance—for example, users ratingdocuments or providing and updating their profiles—implies changes in user behavior. Providing incentives toemployees to participate can result in different communitybehavior. Already, companies are being establishedto extract social networking benefits, such as contactlist management. Cultural changes in organizations willfacilitate the efficacy of such technologies.High-quality automated systems will be the norm ofthe future. Any system that depends on explicit humanintervention (e.g., human tagging, annotation) withoutrecourse to automation is destined to fail over the longterm. As enterprise content increases, manual interventionwill not scale. Automation implies that only a smallpercentage of content is siphoned off for human oversight,based on stringent thresholds. As algorithms getmore accurate, an automated system can be more reliableand consistent than the collective output of multiplehumans.As the necessity for clean, organized content is recognizedas crucial by more organizations, content publishingprocesses will become more stringent. Additionaltools will be employed to capture information that iscurrently lost. Further, these tools will use newer technologies,such as XML, thereby enhancing consistencyand automated processing. While data quality is likelyto improve over time, however, we will still need to dealwith noisy legacy data.Studies of enterprise data show that important metadatain documents (e.g., author) is often incorrect, as itis set to some default value (e.g., organization name ortemplate creator). Enforcement of correct metadata, eithervia technology assists or via policy, will go a long waytoward improving data quality. Removal of redundant/obsolete data (as high as 20 to 30 percent) will benefitrelevance. Techniques such as duplicate detection andnear-duplicate detection can ensure that irrelevant data iseliminated from active corpora.Internet search engines have become popular andclearly demonstrate the power of having information atone’s fingertips. <strong>Enterprise</strong> search, while having similardesiderata, is faced with a different set of challenges.Besides having to deal with multiple heterogeneousrepositories and myriad data formats, enterprises alsomore queue: www.acmqueue.comQUEUE April 2004 45


FOCUS<strong>Search</strong><strong>Enterprise</strong> <strong>Search</strong>:have to deal with security, compliance, and deploymentissues. Many of these challenges can be addressed veryeffectively by technology.While search relevance is an important yardstick,there are other key characteristics that make for effectivesearch, such as navigation, classification, entity extraction,recommendation, summarization, query language,and semantics. Systems that incorporate user behaviorwill become the norm, yielding higher relevance, betterpersonalization, and higher utilization of human assetsand tacit information. <strong>Enterprise</strong> systems typically cannotcompile the large-scale statistics that Internet searchengines use to weed out noisy data, and other techniqueswill be employed to address the special needs of theenterprise. QREFERENCES1. Fagin, R., Kumar, R., McCurley, K., Novak, J., Sivakumar,D., Tomlin, J. A., and Williamson, D. P. <strong>Search</strong>ingthe workspace Web. Proceedings of the 12th InternationalWorld Wide Web Conference, Budapest, Hungary (May2003) 366-375.2. Raghavan, P. Structured and unstructured search inenterprises. Data Engineering 24, 4 (Dec. 2001) 15-18.3. Hawking, D. Challenges in enterprise search. Proceedingsof the Australasian Database Conference, Dunedin,New Zealand (Jan. 2004) 15-24.4. Brin, S., and Page, L. The anatomy of a large-scalehypertextual Web search engine. Proceedings of the 7thInternational World Wide Web Conference, Brisbane, Australia(1998) 107-117.5. Kleinberg, J. M. Authoritative sources in a hyperlinkedenvironment. Journal of the <strong>ACM</strong> 46, 5 (1999) 604-632.6. These algorithms are often exploited on the Internetvia spam and page manipulation, making them lesseffective.7. Many applications derive immediate ROI (return oninvestment) when offloading the search componentfrom the database engine to enterprise search software.8. Abrol, M., Latarche, N., Mahadevan, U., Mao, J.,Mukherjee, R., Raghavan, P., Tourn, M., Wang, J., andZhang, G. Navigating large-scale semi-structured data inbusiness portals. Proceedings of the 27th VLDB Conference,Rome, Italy (2001) 663-666.9. Dumais, S. T., Cutrell, E., and Chen, H. Optimizingsearch by showing results in context. Proceedings ofthe SIGCHI Conference on Human Factors in ComputingSystems, Seattle, Washington (March 2001) 277-284.10. Chung, C., Lieu, R., Liu, J., Luk, A., Mao, J., Raghavan,P. Thematic mapping—from unstructured documentsto taxonomies. Proceedings of the Conference on Informationand Knowledge Management, McLean, Virginia(2002) 608-610.11. Dumais, S. T., Platt, J., Heckerman, D., and Sahami,M. Inductive learning algorithms and representationsfor text categorization. Proceedings of the Conferenceon Information and Knowledge Management, Bethesda,Maryland (1998) 148-155.12. Choo, K., Mukherjee, R., Smair, R., and Zhang, W.The Verity federated infrastructure. Proceedings of theConference on Information and Knowledge Management,McLean, Virginia (2002) 621.LOVE IT, HATE IT? LET US KNOWfeedback@acmqueue.com or www.acmqueue.com/forumsRAJAT MUKHERJEE is the principal software architectat Verity in Sunnyvale, California, working in the area ofsocial networks. After completing his B.Tech. in electricalengineering at the Indian Institute of Technology,Madras, India, and his M.S. and Ph.D. in parallel computingat Rice University, Houston, Texas, he joined IBM’sThomas J. Watson Research Center. There he worked onclustered computing, scalability, and high-availability andscalable Web servers, and helped design infrastructuresused during the 1996 Atlanta Olympics and the DeepBlue–Kasparov chess match. He moved to IBM’s AlmadenResearch Center in 1997, where he worked on distributeddigital libraries and content management. He thenworked with Purpleyogi, a Silicon Valley content discoverystartup, for a year before joining Verity.JIANCHANG MAO is the principal software architect andmanager in the Emerging Technologies Group at Verity.Prior to joining Verity, he worked at IBM’s AlmadenResearch Center for more than six years. He earned hisPh.D. in computer science from Michigan State Universityin 1994. A senior member of IEEE, Mao has served asassociate editor for IEEE Transactions on Neural Networksand as an editorial board member of Pattern Analysisand Applications. He received the Outstanding TechnicalAchievement Award and three Research Division Awardsfrom IBM between 1996 and 2000.© 2004 <strong>ACM</strong> 1542-7730/04/0200 $5.0046 April 2004 QUEUE rants: feedback@acmqueue.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!