13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IADIS International Conference <strong>WWW</strong>/<strong>Internet</strong> 20103. INSTANCE RECOMMENDATIONIn the practical work of the instance addition, it would be difficult to register every term without any help.Especially, consumer electronics, CD/DVDs, package software, etc. are increasing on a daily basis, so theirquick registration with a manual work would be impossible. Thus, we developed the function to recommendcandidates for the instances based on records of the user's registration. This function estimates the instanceswhich belong to the same class with the registered instances as seeds using Named Entity Extraction (NEE).But, the user finally selects and adds some of them as new correct instances. For example, if the user registers"Nissan" and "Honda" as the instances, and added "Toyota" as the third instance, then the function takes thelatest three instances {"Nissan", "Honda", "Toyota"} as the seeds for the NEE, and recommends othercandidates for the instances to the user after exclusion of the already-registered ones (Fig.3 bottom left).3.1 Named Entity Extraction with BootstrappingThere already exist free Named Entity Extraction services such as Google Sets (2007) and SEAL (2008) onthe web. However, our investigation found that those have maintenance difficulty for the quick registration ofthe latest information. Therefore, we developed our own Named Entity Extraction engine using bootstrappingmethod (Brin 1998). The bootstrapping firstly generates patterns from <strong>do</strong>cuments using a small number ofthe seed terms, and extracts other terms from the <strong>do</strong>cuments using the patterns. Then, it generates newpatterns using the extracted terms as the new seeds again. This repeated process can extract a large number ofterms from a few seeds. The following describes our bootstrapping process with four steps.(1) Selection of seedsWe first need some seeds to execute the NEE. Selection of the seeds greatly effects on the extracted terms.We prepare a seed list which includes at least three instances in advance. At the first time, it's ran<strong>do</strong>mselection and the bootstrapping is executed, and the extracted terms are added to the seed list. At the secondtime, we take also ran<strong>do</strong>mly two old instances used before and new one extracted as the seeds, because newlyextracted instances may not be necessarily correct. So we take old two and new one after the second time forextending the terms and keeping the accuracy.(2) Collection of Web pagesWe collect the top 100 web pages using search engines like Google and Yahoo!, where the query is acombination of three instances.(3) Pattern generationThen, we find HTML tags which surround all three seeds from collected web pages. For example, if theseeds are {"Nissan", "Honda", "Toyota"}, the surrounding tag is as follows.NissanHondaToyotaMatsudaSuzukiHere tag surrounds all three tags, then term is taken as a pattern. The reason we find thetag surrounding all three seeds is to keep the accuracy. The stricter the condition is, the higher the accuracybecomes. Our preliminary experiences for some kinds of product showed that in case of two seeds the hugenumbers of irrelevant terms were extracted. But, more than four seeds greatly reduced the extracted terms.Therefore, this number of seed is our experimental setting, and should be revised at least for other <strong>do</strong>main.(4) Extraction of termsFinally,we extract other terms from the same <strong>do</strong>cument using the generated patterns. Although webpages have several styles according to their authors, it seems that the same style (pattern) would be used atleast within the same <strong>do</strong>cument. In the example of (3), we can take "Matsuda" and "Suzuki" from this<strong>do</strong>cument, and add them to the seed list.However, we still have a possibility to extract the irrelevant terms in the above process. So we setthreshold for the number of patterns which can be generated in a <strong>do</strong>cument. In our experience, we have beenable to extract the terms most effectively in case that the number is more than two. Therefore, if thegenerated patterns are more than two kinds, we add them to the list. If not, they are discarded.105

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!