12.07.2015 Views

file - ChaSen - 奈良先端科学技術大学院大学

file - ChaSen - 奈良先端科学技術大学院大学

file - ChaSen - 奈良先端科学技術大学院大学

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Table 5.2. Domain and Type of List.Domain Procedures Non-Procedures AllComputer 558 ( 295 ) 1666 ( 724 ) 2224Others 163 ( 64 ) 1733 ( 476 ) 1896All 721 3399 4120restricted the form of documentation in the list. The list could be expected tocontain important information, because it is a summarization done by a human.It has certain benefits pertaining to computer processing as shown in Figure 5.1 1 .These are:a) a large number of lists in Q&A articles or homepages on web pages,b) some clues before and after the lists such as title and leads,c) extraction which is relatively easy by using HTML list tags, e.g. ,.In this study, a binary categorization was conducted, which divided a set oflists into two classes of procedures and non-procedures. The purpose is to revealan effective set of features to extract a list explaining the procedure by examiningthe results of the categorization.5.3 Collection of lists from web pagesTo study the features of lists contained in web pages, web pages comprising listswere collected as shownin Figure 5.2. The sets of lists were made according tothe following steps (see Table 5.1) :Step 1 Enter tejun (procedure) and houhou (method) to Google [14] as keywords,and obtain a list of URLs that are to serve as the seeds of collectionfor the next step (Gathered).Step 2 Recursively search from the top page to the next lower page in the hyperlinkstructure and gather the HTML pages (Retrieved).1 This example excerpts from the readme <strong>file</strong> of software robots Kairai [124].67

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!