Search Web Page Classification Using Form Structural Characteristics

Search Web Page Classification Using Form Structural Characteristics 

Myungsook Klassen 

Computer Science Department 

California Lutheran University 

60 West Olsen Rd 

Thousand Oaks, CA 91360 

mklassen@clunet.edu 

Chenxiao Wang 

Computer Science Graduate Program 

California Lutheran University 

60 West Olsen Rd, 

Thousand Oaks, CA 91360 

cwang@clunet.edu 

Abstract 

Web pages with HTML forms are either search pages or 

non search pages such as registration or blogging. 

Search forms are external search, site search or internal 

hidden database search. The hidden web pages provide 

highly relevant quality content and represent a large 

sector of online information source, yet general purpose 

search engines do not find hidden web pages due to 

difficulties of identifying them. In this paper, we present 

a method to identify search forms from non search 

forms using a small number of HTML input elements 

extracted from user input HTML forms. Using the 

random forest classifier, the classification rates obtained 

are 89.12% with two classes, search form and non 

search form. A small number of HTML input elements 

and attributes used to identify four different HTML 

forms proved not to be strong discriminators. 

1. INTRODUCTION 

Surface webs crawled by search engines contain only 

a fraction of the contents available on-line. Most users 

are only aware of information found by search engines 

such as Google, Yahoo, or Bing. According to 

Wikipedia, surface web contains an estimated 11.5 

billion web pages in the publicly indexable web as of 

January 2005. In the earliest days of the web, there 

were relatively few web pages and most of them were 

static web pages, so search engines found needed 

information with relatively high user satisfaction level. 

When the web became preferred medium for 

commerce and information transfer, static pages could 

no longer provide complete information. It is able to 

cover 20% of it, which means 80% of valuable 

information and data is still remains untouched on the 

web surface. In general, searchable forms are very 

sparsely distributed. Barbosa [2] points out that a 

crawler retrieves only 94 movie search forms out of 

100,000 crawled pages. The paradigm of using 

conventional search engines no longer applies to the 

information content of the internet. The development of 

technology which accesses database back end 

operations for the web search engines provide the path 

to a paradigm switch to database driven web pages. 

Web resources are constantly being added and 

modified, so it is important to identify hidden web 

pages in an efficient and automated way, requiring no 

site specific scripting or analysis. There has been 

increase of interest in techniques that find hidden web 

pages as the volume of hidden information grows. One 

is by creating a vertical search engines for specific 

domains. This is a very costly operation for a search 

engine since it is required to build mediator forms for 

each domain and it is difficult for a search engine to 

identify which queries are relevant to each domain. The 

other approach is to explore all interesting HTML forms 

and identify search interfaces. 

On the web, there are many HTML forms and many 

are not search interface. In particular, HTML forms for 

login, blog, comments, registration, subscription, and 

polling are not search interfaces. There are three 

different search forms: site-search, external search, and 

hidden database search. The “site search” forms are 

what many web sites nowadays provide for searching 

their own HTML texts on their sites. These pages 

simply scan non HTML tag text and provide 

information. They are not dynamically produced using a 

database. External search engine is a HTML form 

interface to search popular search engines such as 

google and Yaoo. A hidden web (sometimes called as 

deep web) is a web page that searches one or more 

backend databases, through a HTML form as its query 

interface. 

The entrance search HTML forms must be 

distinguished from non search web pages in order to 

enable the web crawler to locate the entrance and extract 

information further. Further more, hidden web HTML 

forms must be identified from external search form and 

text site-search form.

In this paper, we make an attempt first to identify 

search forms from non search forms, and second, to 

identify hidden web search forms from two other 

types of search forms using a small number of HTML 

input element attributes, since a sparse matrix data are 

often very lager and are over fitted by classifiers if care 

is not taken. Besides HTML input elements attributes 

such as names, values, labels and URLs in the sparse 

matrix contain many different string values and 

variant string values which are hard to be used without 

daunting string/text processing task. 

Another contribution of our work is that we 

conducted our experiments on a real data set we 

collected by crawling web pages to acquire the newest 

web page trends instead of using dated archived data 

from data repositories such as UIUC. And lastly, 

performance of a random forest classifier is explored 

to evaluate its goodness in identifying different types 

of forms. 

The rest paper is organized as follows: section 2 

presents related works in the field, and section 3 

describes input values created from web pages. In 

Section 4, the Random Forest classifier is briefly 

discussed. Numerical results are presented in section 5, 

followed by discussions and conclusions in section 6. 

2. RELATED WORK 

In what follows, we give an overview of previous 

works on the hidden web classification and a brief 

review of approaches which address different aspects of 

hidden wed data retrieval. 

In year 2000, He et al [6] surveyed the deep web 

studying the scale, subject distribution, and search 

engine coverage and reported their findings in 2007. 

They report that 72% interfaces were found within a 

depth 3 and 94% web databases appeared within depth 

3. They classified web databases into two types: 

unstructured databases such as images, audio and video, 

and structures databases which provide data objects as 

structured “relational” records with attribute-value pairs. 

Hidden web database is dominated by structured 

database with 77.2%. The top three subjects of hidden 

web pages are business & economy, computers& 

Internet, and education. The last question presented in 

their paper was “ how do search engines cover the deep 

web?”. They report that Google and Yahoo indexed 

32% of hidden web pages while MSN had 11%. 

Automatic classification of Hidden web database 

classification: post query techniques 

There are two methods to classify hidden page web 

forms. One is pre-query, which classifies web database 

according to the features of query forms such as HTML 

tag input types and labels. The other is post query, 

which identifies the search interfaces, submit probe 

queries and analyzes the results. 

Hidden web database classification is performed 

based on probing[4][5]. Gong et el [4] created prototype 

for classifying hidden web database into a predefined 

category hierarchy using query probing and link 

evaluation. Features for each class is extracted from 

randomly selected web pages. The hidden web is probed 

by analyzing the results of the class –specific query to 

the hidden database. Classification methods used are k- 

means nearest neighbor method. In [5] authors have 

done similar work, but used extensive number of rules 

or queries for probing multiple times while Gong et al 

used only one query for probing each category. In [11], 

authors used semantic information to feature vectors of 

forms and centroid vector to classify hidden web 

database. The support Vector machine classification 

method is used with data obtained from UIUC databases 

to obtain high classification rates between 90.87% and 

98.37% in five categories. In [5], authors trained a rule 

based document classifier, and then uses the classifier’s 

rules to generate probing queries. The queries are sent to 

the databases which are then classified based on the 

number of matches that they produce for each query. 

UCI KDD archive database is used for their work. 

Identifying hidden web pages: pre query techniques 

Pre-query techniques use visible features of forms. 

Form attribute labels are used for classifiers. In [2], 

authors proposed adaptive crawling strategies to 

efficiently locate the entry points to hidden web 

databases. The strategies use both links and forms to 

enhance identification of hidden databases. For link 

classification, features present in the anchor, URL, and 

text around links are extracted and from them, terms 

with highest document frequency are selected. 

Hess and Kushmerick [7] used input tag labels, 

input name and input types from 129 forms collected. 

For instance HTML fragments “Enter name: ”, the sequence 

[“enter”,”name”,”user”, “text”] is gathered to be used 

for a Naïve Bayes classifier. They reported the 

classification rate of 82%. 

Barbosa and Freire [1] used 216 searchable forms 

from the UIUC repository and gathered 259 non 

searchable forms for the negative examples. HTML 

form structure attributes such as number of hidden tags, 

number of radio tags, number of file input, number of

submit tags, number of image inputs, number of buttons, 

number of resets, number of password tags, number of 

textboxes, number of items in selection lists, sum of text 

sizes in textboxes and submission method (post versus 

get) are used as attributes to several classifiers. Two 

thirds were used as a training data set and one third as a 

test data set for a multi layer perceptron, Naïve Bayes, 

C4.5 and Support Vector Machines. They reported error 

rates 9.05% by C4.5 and 24% by Naïve Bayes classifier. 

Ye et al [10] used a sparse matrix of HTML 

structure features. Features are value and name 

attributes for “input”, “select”, “textarea”, and “label” 

from input types to create a sequence such as [ input1, 

input1-name, input1-value] for each HTML input type. 

A number of “input” and a number of “select” , a 

number “label”, and a number of “textarea” in each 

form are computed to be used as features. In addition, 

attributes “name” and “method” from a HTML form 

element, “src” and “alt” values of “input-image” 

element are used. Made up data set from Mataquerier 

project [10] and websites crawled from Search Engine 

Guide were used for their experiments. Bayes, C4.5 

decision tree, support vector machine and random forest 

classifiers produced from the lowest 79.78% to the 

highest 93.88% classification rates. 

3. FORM ENTRY DATA 

3.1 Form collections 

There are many different HTML forms and 

many of them are not search interfaces. The forms are 

categorized into the following four groups: 

• External search: forms which provide 

external web search sites such as to the 

google site for convenience of users. They 

are not considered as an entry to an 

internal database. 

• No search: forms for login, subscription, 

registration, polling, or blogging. They are 

not considered as an entry to an internal 

database. 

• site search: forms which many web sites 

now provide for searching its own HTML 

pages. They are not considered as an entry 

to an internal database. 

• Internal database search: forms which 

provide entry to its own site backend 

databases. These are considered true 

hidden web pages. 

(a) External search form 

(b) Non search form 

(c) Site search 

(d) internal database search form 

Figure 1: different HTML forms 

The data used for our experiments were created by 

crawling the web site www.searchengineguide.com with 

the web crawler WebLech with a depth 3 and each 

form was manually categorized into one of four types. 

Examples of each type are shown in Figure 1. Initially 

792 HTML forms were collected by the crawler and 

only 9.8% of all forms were internal database forms. 

Other categories collected are 12.86%, 46.36%, and 

30.96% for external search, site search, and no search 

respectively. To increase the number of internal 

database search forms, 89 forms were manually 

collected in 10 different subject areas and added to the 

sample set (confirm with chenxiao) which now contains 

874 samples. 

3.2 Form input elements 

A home grown parser was written in Java to scan 

web pages and break into HTML elements and their 

attributes. Inside HTML forms, following 

elements(often called tags) were considered important 

for users to put information for internal database search. 

Initially based on work by Ye et al [10] but with

changes with our own ideas, the following elements and 

their attributes were gathered. 

• form element attributes action and name. For 

action, only the file name from the URL was 

used. The “method’ attribute was not used. 

• input element type=”text”: name and value 

attributes. This forms a pair for each input text. 

• input element type=”checkbox”: data is 

collected same as input type text. 

• input element type=”radio”. data is collected 

same as input type text. 

• select element: name and value attributes. This 

forms a set of triplets 

for each select element. 

• Textarea element: name and value attributes. 

This forms a set of triplets for each textarea element. 

• Labels from all elements and types inside 

HTML were gathered as one attribute field 

containing comma delimited text values. 

The numbers of elements in forms vary much. 

Some forms contain a large number of form elements 

while some don’t contain any at all. Table 1 shows the 

maximum number of HTML form elements found in 

one HTML form. As a result, data created in this fashion 

is a large sparse matrix with 144 attributes with 877 

samples. 

HTML form elements Maximum numbers 

Input type text 14 

select 6 

Input type checkbox 16 

Input type radio 18 

textarea 2 

Table 1:maximum number of form input elements 

There are two main problems with this data set to 

be used as features of a classification method. Not only 

the data set is very sparse, but also names and values 

have gathered have many variations ( name, nam, user 

name, last name, lname, for instance) if meaningful 

names used at all, but many names and values used are 

surprisingly meaningless such as “tinyturing” and 

“lang-‐de”. And frequently an element name and value 

are missing since they are not required in HTML. 

Laborious preprocessing of text strings of name, value 

and labels is required for them to be useful as 

classification features. 

4. RANDOM FOREST CLASSIFIER 

The Random forest[3] is a meta-learner which 

consists of many individual trees. Each tree votes on an 

overall classification for the given set of data and the 

random forest algorithm chooses the individual 

classification with the most votes. Each decision tree is 

built from a random subset of the training dataset, using 

what is called replacement, in performing this sampling. 

That is, some entities will be included more than once in 

the sample, and others won't appear at all. In building 

each decision tree, a model based on a different random 

subset of the training dataset and a random subset of the 

available variables is used to choose how best to 

partition the dataset at each node. Each decision tree is 

built to its maximum size, with no pruning performed. 

Together, the resulting decision tree models of the 

Random forest represent the final ensemble model 

where each decision tree votes for the result, and the 

majority wins. 

There are two different sources of randomness in 

Random forest: random training set (bootstrap) and 

random selection of attributes. Using a random 

selection of attributes to split each node yields favorable 

error rates and are more robust with respect to noise. 

These attributes form nodes using standard tree building 

methods. Diversity is obtained by randomly choosing 

attributes at each node of a tree and using the attributes 

that provide the highest level of learning. Each tree is 

grown to the fullest possible without pruning until no 

more nodes can be created due to information loss. In 

Breiman’s early work[2], each individual tree is given 

an equal vote and later version of random forest allows 

weighted and un-weighted voting. 

The random forest algorithm computes the out-ofbag 

error. The average misclassification for the entire 

forest is called as out-of-bag error which is useful for 

predicting the performance of the classifier without 

involving the test set example nor cross-validation. The 

out-of-bag error of random forest depends on the 

strength of the individual trees in the forest and the 

correlation between them. With a less number of 

attributes used for split, correlation between any two 

trees decrease and the strength of a tree decreases. 

These two have reverse effect on error rates of random 

forest: less correlation increases the error rate while less 

strength decrease the error rate. 

5. EXPERIMENTS 

For our work, to eliminate sparse matrix problem 

and to avoid preprocessing of strings, numeric data 

such as the numbers of input element text type, 

checkbox type, radio type, select element and textarea 

element in a form were gathered. 

In addition, all strings from label elements and 

name attribute and value attribute of input type text,

checkbox and radio, element select and element 

textarea are scanned for word “search” and its 

synonyms. Proximity of a string to a certain input 

element or type is not considered. Some synonyms and 

antonym of the word “search” in the context of our 

research work are “inquiry”, “examine”, “inspect”, 

“investigate”, “look”, “query” and “find”. However 

the word “search” is dominantly used over 99% and 

“query” was found 4 times and “find” 2 times. Other 

words were not used at all in our data samples. 

In our analysis, Random forest in WEKA 3.5.6 

software developed by the University of Waikato was 

used for experiments. The data file has 6 input 

attributes and 1 target class attribute. The target class 

attribute values are external search, non search, site 

search, and internal database search. Table 2 shows 

numbers of samples of each class. 

External 

search 

Non search Site search Internal db 

search 

104 288 344 151 

Table 2: sample numbers of each class 

The number of attributes to be used in Random 

selection, “numFeatures” (is called Mtry in WEKA), 

and the number of trees to be generated “numTrees” 

were two parameters controlled in our work for 

performance evaluation. The number of tree depth was 

set to 0 to build trees of any depth. For all data sets, 

numTrees were set to max 10. With each numTree, 

classification rates of Mtry values 3,4, and 5 are shown 

in Figure 2. 

5.1 4 classes classification 

Two test methods were used: 10-cross 

validation and using separate test samples. With 10-fold 

cross validation as a test option, its result is shown in 

Figure 2 for various numTrees and attributes. The 

highest 66.93% was obtained when all 5 attributes were 

used with one tree. The next was to do testing with a 

test sample set. The entire file is divided into two files: 

70% as a training data set and 30% as a test data set. 

The classification rates obtained are in a range between 

62.89% and 62.63%, about 4.% lower than those 

obtained by 10 folds cross validation. 

The confusion matrix is shown in Table 3. The site 

search results is surprisingly low with zero 

classification rate while internal database search shows 

0.99 classification. However there are many false 

positive (FP) internal database search from external 

search, non search and site search classes with values 

0.42, 0.18 and 0.50 respectively. 

Figure 2: classification rates by 10-fold cross 

validation 

5.2 2 classes classification 

External 

search 

Non 

search 

Site 

Search 

Internal 

db search 

External 

search 

Non 

search 

Site 

search 

Internal 

db 

search 

0.02 0.56 0.0 0.42 

0.02 0.80 0.0 0.18 

0.09 0.41 0.0 0.50 

0.0 0.0 0.01 0.99 

Table 3:test sample classification confusion matrix 

The second experiment was to classify any searches 

(external, site search, internal db search) from non 

search such as password entry and registration. The 

classification rates with the test sample set are 89.12% 

consistently for numbers of attributes from 1 to 5 and 

numbers of trees from 1 to 10. Table 4 shows statistics 

from the test data set classification. The classification 

with only two classes increases from 66.93% to 

89.12%. 

Classification rate 0.8912 

Out-of-bag error 0.1705 

Mean absolute error 0.1667 

RMSE 0.2872 

Search class TP rate 0.944 

Search class FP rate 0.279 

Non search class TP rate 0.721 

Non search class FP rate 0.056 

Table 4: statistics of 2 class test sample data set 

classification.

6. CONCLUSIONS AND FUTURE 

WORKS 

It is apparent from our results that the 

numbers of input type text, input type checkbox, input 

type radio, input element select, input element textarea 

along with the word ‘search” are not enough to 

distinguish 4 different types of classes: site search, 

external search, internal database search and non search. 

Login, registration, discussion and blogging are 

considered non search. Three classes, external search, 

internal database search and text search are very similar 

in HTML & PHP codes and have very similar structure 

in appearance and layout. External search facility and 

site search is provided by many web pages in recent 

years and almost become a standard feature in any web 

pages. 

There is not enough statistical difference with 

the attribute the number of input type text as shown in 

the scatter diagram (Figure 3). Other HTML elements 

and types used for other work show similar results. And 

when we combined three search types into one and 

made our data as two class problems, similar results 

were obtained. 

Figure 3:scatter diagram of the numbers of input 

type text(Y axis) for 4 classes (x-axis). 

In contrast, the word “ search” is a good discriminator to 

distinguish non search class from three search classes. 

Over 99.9% search classes contain the word “search” 

while approximately 20% of non search class contains 

the word “search.” This is due to the fact that in our 

research work, we didn’t consider a proximity of the 

word ‘search’ or similar occurrence to the location of a 

HTM input form element or type. In one case, an 

irrelevant URL away in an input text from a non 

search page contained the word ‘search’. 

Our 2 class classification rate we obtained is 

close to 90%, which implies that we use a simple set of 

input attributes to identify search form from login or 

registration pages. We propose our future work that it is 

an incremental learning machine architecture which 

consists of 

Stage One: non search forms are separated from any 

search forms. 

Stage Two: Three search sites, site-search sites, 

external search sites and internal database search 

sites are separated 

7. REFERENCES 

[1] L. Barbosa and J. Freire. “ Combining classifiers to 

identify online databases.” WWW 2007 conference. Banff, 

Canada. 2007. 

[2] L. Barbosa and J. Freire. “ An adaptive crawler for 

locating hidden web entry points.” WWW 2007, Banff, 

Canada. 

[3] L. Breiman. “ Random Forest.” Machine Learning. vol. 

45, No. 1, pp.5-32 .2001. 

[4] Z. Gong, J. Zhang, and Q. Liu. “ Automatic hidden web 

database classification.” PKDD 2007. Springer-Verlag Berline 

Heidelberg., pp 454-461. 

[5] L. Gravano, P. Ipeirotis, M. Sahami. “QProber: a system 

for automatic classification of hidden web databases.” ACM 

Transactions on Inforamtion Systems, Vol. 21, No. 1. 2003. 

[6] B.He, M. Patel, Z. Zhang, K. Chand. “Accessing the 

deep web.” Communication of the ACM. May 2007. Vol. 50, 

No. 5, p.95-101. 

[7] A. Hess, N. Kushmerick, “Automatically Attaching 

semantic metadata to web services.” Proceedigns of II Web. 

Pp111-116. 2003. 

[8] J. Madhaven, at al. “Google’s deep web crawl.” VLDB 

’08. Auckland, New Zealand. 2008. 

[9] A. Statnikov, L.Wang, and C. Aliferis. “ A 

comprehensive comparison of random forests and support 

vector machines for microarray-based cancer classification.” 

BMC Bioinformatics 2008, 9:319. 

[10] Y. Ye, H. Li, X. Deng, and J. Huang. “ Feature weighting 

random forest for detection of hidden web search interfaces.” 

Computational linguisitucs and Chinese lanaugage processing. 

Vo. 13, No. 4. Feb 2009. 

[11] W.Zuo, Y. Wang, X.Wang, D.Zhang, T. Peng. 

“Automatic classification of deep web database based on 

centroid and wordnet,” Journal of Computational Information 

systems. Vol. 6, No. 1. 2010. Pp63-70.

Search Web Page Classification Using Form Structural Characteristics

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?