11.07.2015 Views

Encyclopedia of Computer Science and Technology

Encyclopedia of Computer Science and Technology

Encyclopedia of Computer Science and Technology

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

information theory 241the question into the structured queries most likely to elicitdocuments containing the answer. Ask Jeeves (retired as <strong>of</strong>2006) <strong>and</strong> similar search services have thus far been onlymodestly successful with this approach.On a large scale, systematic information retrieval <strong>and</strong>analysis (see data mining) has become increasingly sophisticated,with applications ranging from e-commerce <strong>and</strong>scientific data analysis to counterterrorism. Artificial intelligencetechniques (see pattern recognition) play animportant role in cutting-edge systems.Finally, encoding more information about content <strong>and</strong>structure within the document itself can provide moreaccurate <strong>and</strong> useful retrieval. The use <strong>of</strong> XML <strong>and</strong> worktoward a “semantic Web” <strong>of</strong>fers hope in that direction (seeBerners-Lee, Tim; semantic web; <strong>and</strong> xml).A number <strong>of</strong> criteria can be used by Web search engines to determinethe likely relevance <strong>of</strong> search results. Perhaps the most importanttool, however, is feedback from the user.pared to one another). There is also more likelihood thatsearchers will either make syntax errors in their requests orcreate requests that do not have the intended effect.While database systems can control the organization<strong>of</strong> data, the pathways for retrieval <strong>and</strong> the comm<strong>and</strong> setor interface, the World Wide Web is a different matter.It amounts to the world’s largest database—or perhapsa “metabase” that includes not only text pages but fileresources <strong>and</strong> links to many traditional database systems.While the flexibility <strong>of</strong> linkage is one <strong>of</strong> the Web’s strengths,it makes the construction <strong>of</strong> search engines difficult. Withmillions <strong>of</strong> new pages being created each week, the “webcrawler”s<strong>of</strong>tware that automatically traverses links <strong>and</strong>records <strong>and</strong> indexes site information is hard pressed tocapture more than a diminishing fraction <strong>of</strong> the availablecontent. Even so, the number <strong>of</strong> “hits” is <strong>of</strong>ten unwieldy(see search engine).A number <strong>of</strong> strategies can be used to provide morefocused search results. The title or full text <strong>of</strong> a given pagecan be checked for synonyms or other ideas <strong>of</strong>ten associatedwith the keyword or phrase used in the search. Themore such matches are found, the higher the degree <strong>of</strong>relevance assigned to the document. Results can then bepresented in declining order <strong>of</strong> relevance score. The usercan also be asked to indicate a result document that he orshe believes to be particularly relevant. The contents <strong>of</strong> thisdocument can then be compared to the other result documentsto find the most similar ones, which are presented aslikely to be <strong>of</strong> interest to the researcher.Information retrieval from either st<strong>and</strong>-alone databasesor the Web can also be improved by making it unnecessaryfor users to employ structured query languages (see sql)or even carefully selected keywords. Users can simply typein their request in the form <strong>of</strong> a question, using ordinarylanguage: For example, “What country in Europe has thelargest population?” The search engine can then translateFurther ReadingBell, Suzanne S. Librarian’s Guide to Online Searching. Westport,Conn.: Libraries Unlimited, 2006.Chakrabarti, Soumen. Mining the Web: Discovering Knowledge fromHypertext Data. San Francisco: Morgan Kaufmann, 2002.Grossman, David A., <strong>and</strong> Ophir Frieder. Information Retrieval:Algorithms <strong>and</strong> Heuristics. 2nd ed. Norwell, Mass.: Springer,2004.“Information Retrieval Research.” Search Tools Consulting. Availableonline. URL: http://www.searchtools.com/info/inforetrieval.html.Accessed August 8, 2007.Meadow, Charles T., et al. Text Information Retrieval Systems. 3rded. Burlington, Mass.: Academic Press, 2007.information theoryInformation theory is the study <strong>of</strong> the fundamental characteristics<strong>of</strong> information <strong>and</strong> its transmission <strong>and</strong> reception.As a discipline, information theory took its impetus fromthe ideas <strong>of</strong> Claude Shannon (see Shannon, Claude).In his seminal paper “A Mathematical Theory <strong>of</strong> Communication”published in the Bell System Technical Journalin 1948, Shannon analyzed the redundancy inherent in anyform <strong>of</strong> communication other than a series <strong>of</strong> purely r<strong>and</strong>omnumbers. Because <strong>of</strong> this redundancy, the amount <strong>of</strong>information (expressed in binary bits) needed to convey amessage will be less than the number in the original message.It is because <strong>of</strong> redundancy that data compressionalgorithms can be applied to text, graphics, <strong>and</strong> other types<strong>of</strong> files to be stored on disk or transmitted over a network(see data compression).Shannon also analyzed the unpredictability or uncertainty<strong>of</strong> information as it is received—that is, the number <strong>of</strong>possibilities for the next bit or character. This is related to thenumber <strong>of</strong> possible symbols, but since all symbols are usuallynot equally likely, it is actually a sum <strong>of</strong> probabilities. Shannonused the physics term entropy to refer to this measure. Itis important because it makes it possible to analyze the probability<strong>of</strong> error (caused by such things as “line noise”) in acommunications circuit. Shannon’s basic formula is:C = Blog 2 (1 + P / N)where the channel capacity C is in bits per second, B is theb<strong>and</strong>width, P the signal power, <strong>and</strong> N the Gaussian noisepower.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!