Classifying the Hungarian Web - Web word processing ...

people.mokk.bme.hu

Classifying the Hungarian Web - Web word processing ...

Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchOrigo.hu -- Axelero’s portal


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchVizsla -- indexing .hu• For many years: AltaVizsla powered byAltaVista• 01/2001-03/2003: Vizsla powered byNorthern Light


The Airport Manager has explained that with the increased traffic of Ministry ofNatural Resources helicopters, Ontario Provincial Police, and Air Ambulance there hasbeen an increase in revenues in recent years. The airport is considered to be an industrialbase for the region, and is marketed for economic development purposes and to bringbusiness into the community. The area surrounding the airport is suited for thedevelopment of a small industrial park, and can be supported by easy access through boththe airport and local highways. The manager of the airport has highlighted the airport’sactivity in community involvement, through the high number of fly-ins, and thesuccessful turnout for the Young Eagles program. He has advised council that he isworking on an arrangement to have Bearskin Air (regional Air Canada service) fly intothe airport on the Toronto to Ottawa route. In early 2003, the airport was seeking Countysupport for the new cross-wind runway development. It was indicated that AlgonquinHighlands was in support of the new runway. In March of 2003, a proposal was preparedand submitted to the Province to fund the project (Council Minutes, 2003).The Reeve of the Municipality has stated that the proposal for funding wasrejected by the Provincial Government. The project is now on hold until funding isgranted. The Municipality is currently financing a business plan to generate funds for theexpansion. It is the municipality’s goal not to put the burden on rate payers. The plan isstill on the table, but it may be ten or twenty years before any development may occur.No official economic, environmental, or social assessments have been conducted at thispoint (phone interview, 2006). Although, in 2003 a firm was hired to draw up adevelopment plan for the airport site (see Map 2).2


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchData flow (build time)CrawlerLanguage ModuleClassificationIndexing and LoadingDB (Index)


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchData flow (query time)Web Server(Apache)Query ParserDB (Index)RankingClass 1 Class 2 ... Class N


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchAutoclassification• Generic machine learning:• Feature extraction• Model training• Classification


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchText processing• Special character normalization: @ » AT• Whitespace normalization• Spell correction• Capitalization• Acronym expansion (BP » BRITISH PETROL)• Undoing abbreviations (Bp » BUDAPEST)• Filtering stopwords (A AZ EGY …)• Stemming (Language specific)• Data enrichment (EB » KUTYA)• Word order normalization


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchFeature extraction• Our features are extended word andphrase (word pair) counts• Can contain trailing wildcard BUPAP*• Can contain “near” directive BUDAPESTnear RESTAURANT• Can be absolute “text must have”• position weighted according to whetherthey appear in the• title• abstract• body of the text• IDF warped


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchNumerical (linear) classificationBREATH 2.0, APNOEA 2.0, APNEIC 2.0, NEONAT* 2.0,RESPIR* 1.8291666508,ARREST 1.7142857313, CHEYN-STOKE 1.625, PULMON 1.5,APNEA 1.4804688692VENTILAT FAILUR 1.0, STOP BREATH 1.0,CHEYN-STOKE RESPIR 1.0,AGON RESPIR 1.0, PULMON ARREST 0.75000RESPIRAT FAILUR 0.7250000238, PULMON FAILUR 0.5,INTENSIF 0.5, DEATH 0.5,RESPIRAT INSUFFICIEN 0.4759615362INSUFFICIEN 0.4759615362, TRANSIENT 0.4166666567,IMPAIR 0.4166666567END STAGE 0.2666666806,SEVER 0.2458333373, TYPE 0.1666666716OBSTRUCT SLEEP 0.1463414580SPONTAN 0.1000000015


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchInitialization and training• Initialized on few “exemplary” documents• Need at least 5 docs• Need at least 200 html-stripped chars/doc• Exemplaries are editorially selected, highquality• Iterative update


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchKrellenstein-Steinberg weight


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchTopic+language model


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchSplit into neg, null, and pos


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchRelevance for alum


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchSimplify, simplify!


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchAnd simplify further!


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchSparse models• Very few terms (typicallyless than 50) used in eachcategory• Word pairs and NEARphrases hardly ever selectedfor model• Trailing wildcards are poorman’s stemming


Small is beautifulAccuracy by average model size1009080Accuracy (%)70605040302010TopicCountSinglePoint01 10 100 1000 10000Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.average model size (#words - log scale)Geographic Text Search


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchThe nature of the feature space• k=10 4 topics, N=10 6 words, kN=10 10 features• Reduce k by hierarchical classification (hard)• Reduce N by feature selection• Once a feature always a feature?• In fact, just because r(w,t) is large for some t, it does notfollow that other values r(w,s) matter at all• System driven by sparseness of the space• Conceptually (what we ignore is noise)• Computationally• Not even trying to train billions of parameters• Large volume classification enabled by sparseness of feature vectors(see also Kornai and Richards 2002 and US patent 6,507,829)


Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.Geographic Text SearchConclusions• Only positive evidence matters• Advantages compared to morelogical (expertise-based):• KISS• Scales well• Implicit modeling assumptions wellsatisfied by data• It was great while it lasted

More magazines by this user
Similar magazines