fuzzy clustering in web text mining and its application - ijcsmr

International Journal of Computer Science and Management Research Vol 2 Issue 2 February 2013 

ISSN 2278-733X 

Abstract 

FUZZY CLUSTERING IN WEB TEXT MINING 

AND ITS APPLICATION IN IEEE ABSTRACT 

CLASSIFICATION 

Text Mining, a branch of computer science [1], is the process of 

extracting patterns from large data sets by combining methods from 

statistics and artificial intelligence with database management. Text 

Mining is seen as an increasingly important tool by modern business to 

transform data into business intelligence giving an informational 

advantage. Web text retrieval refers to text retrieval techniques applied 

to Web resources and literature available on the Web. The volume of 

published Web research, and therefore the underlying Web knowledge 

base, is expanding at an increasing rate. Web text retrieval is a way to 

aid researchers in coping with information overload. By discovering 

predictive relationships between different pieces of extracted data, 

data-mining algorithms can be used to improve the accuracy of 

information extraction. However, textual variation due to typos, 

abbreviations, and other sources can prevent the productive discovery 

and utilization of hard-matching rules. Recent methods of soft 

clustering can exploit predictive relationships in textual data. This 

paper presents a technique for using soft clustering Text Mining 

algorithm to increase the accuracy of Web text extraction. 

Experimental results demonstrate that this approach improves text 

extraction more effectively that hard keyword matching rules. 

Keywords—Fuzzy cluster,TextMining,Webmining,document Clustering 

I. INTRODUCTION 

Searching the most similar documents to a given one is crucial in 

text mining because it is the basic process of many techniques such 

as classification or information retrieval. Two of the major issues 

that text mining faces are the large amount of documents, millions 

in modest cases, and a very high dimensionality of the featured 

space. Text documents are usually represented as vectors, where 

each dimension corresponds to a term and the value reflects its 

Rahul R.Papalkar #1 , Gajendrasingh Chandel *2 , 

#Department of information Tehnology, SSICT, Sehore,(M.P.)India 

* Department of information Tehnology, SSICT, Sehore,(M.P.)India 

importance in the document. There are many approaches to find the 

exact vicinity of an object. However, they suffer the curse of 

dimensionality, that is, their performance drastically decreases as 

the number of dimensions grows. This problem prevents its 

application in text mining. To avoid the curse of dimensionality a 

variety of methods based on inexact searching have been pro-posed. 

In [1, 2] a probabilistic technique, with a good performance, was 

presented. This solution uses some elements of the training set as 

pivots or permutants. Basically, the permutants are used to predict 

proximity between elements and to reduce the number of real 

distance evaluation at query time. Although this method has a good 

performance when searching proximities over documents, it 

introduces an overload at search time, due to the necessity to 

perform a sequential search over permutants [1], or to use an 

auxiliary structure to avoid it [2]. This overload increases when the 

space dimension or the size of datasets grows. In this paper we 

introduce improvements to our access method for indexing 

collections of objects representing a very high-dimensional space 

presented in [3].For indexing, this method uses a combination of a 

graph structure and pivots (used as entry points), and a very fast 

search algorithm that uses distance or similarity based measures in 

order to obtain the k-nearest neighbors (knn) of novel query objects. 

In this paper, we introduce a new fast way to generate the connected 

graph and a prune rule to improve searches. Although the time 

required generating the index structure grows with the size of 

collection of objects used, this process is carried out only once 

(offline) and does not affect the query process. 

1529 

Rahul R.Papalkar et.al. www.ijcsmr.org


ISSN 2278-733X 

II. Literature Survey 

processing is to discover important features from raw data. Data 

Current Text Mining tools operate on structured data, the kind of 

data that resides in large relational databases whereas data in the 

multimedia databases are semi-structured or unstructured. Often 

compared with text mining, multimedia mining reaches much 

higher complexity resulting from: a) The huge volume of data, b) 

The variability and heterogeneity of the multimedia data (e.g. 

diversity of sensors, time or conditions of acquisition etc) and c) 

The multimedia content’s meaning is subjective [6]. 

Unstructured data 

Unstructured data is simply a bit stream. Examples include pixel 

level representation for images, video, and audio, and character 

level representation for text. Substantial processing and 

interpretation are required to extract semantics from unstructured 

data [7]. This kind of data is not broken down into smaller logical 

structures and is not typically interpreted by the database 

Architectures for Multimedia Text Mining 

Various architectures are being examined to design and develop a 

multimedia Text Mining system. The first architecture includes the 

following. Extract data or metadata from the unstructured database. 

Store the extracted data in a structured database and apply Text 

Miningtools on the structured database [8]. This is illustrated in 

figure 2.1. 

Figure 2.1 Converting unstructured data to structured data for mining 

Figure 2.1 present architecture of applying multimedia mining in 

different multimedia types [18]. Data collection is the starting point 

of a learning system, as the quality of raw data determines the 

overall achievable performance. Then, the goal of data pre- 

pre-processing includes data cleaning, normalization, 

transformation, feature selection, etc. Learning can be 

straightforward, if informative features can be identified at pre- 

processing stage. Detailed procedure depends highly on the nature 

of raw data and problem’s domain. In some cases, prior knowledge 

can be extremely valuable. 

Figure 2.2 Multimedia Mining Process 

For many systems, this stage is still primarily conducted by domain 

experts. The product of data pre-processing is the training set. Given 

a training set, a learning model has to be chosen to learn from it. It 

must be mentioned that the steps of multimedia mining are often 

iterative. The analyst can also jump back and forth between major 

tasks in order to improve the results [6]. 

Figure 2.3 present architecture of applying multimedia mining in different 

multimedia types 

Figure 2.3 present architecture of applying multimedia mining in 

different multimedia types [5]. Here the main stages of the Text 

Mining process are (1) domain understanding; (2) data selection; (3) 

leaning 

1530 

and preprocessing; (4) discovering patterns ;(5) 



ISSN 2278-733X 

interpretation; and (6) reporting and using discovered knowledge. 

The domain understanding stage requires learning how the results of 

data-mining will be used so as to gather all relevant prior 

knowledge before mining. 

II. METHODOLOGY 

This process is done in three steps: information retrieval, 

information extraction and text mining. A primary reason for 

using Text Mining for web text is to assist in the analysis of 

collections of the available web text. Web data is vulnerable to co 

linearity because of unknown interrelations. The analysis in this 

paper will be augmented by using experiment-based approach. 

Before Text Mining algorithms can be used, a target data set will be 

assembled. As Text Mining can only uncover patterns already 

present in the data, the target dataset must be large enough to 

contain these patterns. Pre-process is essential to analyze the 

multivariate datasets before clustering or text mining. The target set 

is then cleaned. Cleaning removes the observations with noise and 

missing data. The web data available with us is first put into a data 

warehouse. Before putting the data in the data warehouse the 

keyword extraction algorithm is used to find out the keywords from 

the full text. This keyword extraction uses partial parser to extract 

entity names. This parser uses linguistic rules and statistical 

disambiguity to achieve greater precision. The data is then 

organized into clusters. Clustering is the task of discovering groups 

and structures in the data that are in some way or another "similar", 

without using known structures in the data. The clusters will be 

created based on the keywords extracted from our web text. These 

clusters will be created using fuzzy C mean algorithm. The fuzzy c- 

means algorithm is one of the most widely used soft clustering 

algorithms. It is a variant of standard k-means algorithm that uses a 

soft membership function. Fuzzy C-Means (FCM) clustering 

algorithm is one of the most popular fuzzy clustering algorithms. 

FCM is based on minimization of the objective function Fm(u, c): 

FCM computes the membership uij and the cluster centers cj by: 

where m, the fuzzification factor which is a weighting exponent on 

each fuzzy membership, is any real number greater than 1, uij is the 

degree of membership of xi in the cluster j, xi is the i th of d- 

dimensional measured data, cj is the dimension center of the cluster, 

d2(xk,ci) is a distance measure between object xk and cluster center 

ci, and ||*|| is any norm expressing the similarity between any 

measured data and the center. 

1. Read input String. 

2. Read input search path. 

Proposed Algorithm 

3. Cluster input string as per C Means Fuzzy Clustering. 

4. Read files from selected path with specified extension. 

5. Convert selected file into text read format. 

6. Search input string cluster into file & store a result into 

output cluster directory. 

7. Repeat step 5 & 6 until all files are scanned else 

8. Stop. 

go to step 8. 

Here the proposed algorithm is responsible for extracting keywords 

present in the full text web article store these keywords in a relation. 

Then the actual work of algorithm begins, it starts clustering of 

keywords. The algorithm initially picks some keywords that are 

extracted. It groups the full text articles based on these keywords. It 

means each cluster contains only those articles which contain that 

keyword as their part. Then it starts using fuzzy C mean clustering 

to combine the clusters together on some similarity measure. Here 

we combine two clusters if their similarity measure is greater than 

or equal to a specified threshold value. The proposed Algorithm 

repeats this process until no more changes are made to the clusters. 

Finally the proposed algorithm stores all the clusters in directory. 

Here our motive to extract all the full text articles which may be 

1531 



ISSN 2278-733X 

relevant for the user providing the search string, for this out of all extraction that retrieves the relevant text articles more efficiently. 

clusters the cluster with largest number of articles is our target. 

III. EXPERIMENT RESULT 

The experiments were performed on the test application developed 

in ASP.Net 3.0. The database contains all the article entries 

populated manually from the. The search was performed using the 

traditional keyword based search algorithm and compared with the 

proposed algorithm. The snapshot for asset of search results is 

shown in table 4.1.Given the same data for text extraction, the 

proposed algorithm seems to be retrieving approximately 89% more 

relevant search results than the keyword based searching. 

Input String List of Matching Element found 

Keyword Based Search 

Fuzzy logic 46 85 

Neural Network 43 89 

Image mining 49 94 

Signal Processing 36 96 

120 

100 

80 

60 

40 

20 

0 

Proposed 

Method 

Table 4.1 Comparative study of keyword based search & proposed method 

IV. CONCLUSION 

List of Matching 

Element found 

Keyword Based 

Search 

List of Matching 

Element found 

Proposed 

Method 

Extraction of text from web is an essential operation. Given that 

there have been many text extraction methods developed; this paper 

presents a novel technique that employs keyword based article 

clustering to further enhance the text extraction process. The 

development of the proposed algorithm is of practical significance; 

however it is challenging to design a unified approach of text 

The proposed algorithm, using data mining algorithm, seems to 

extract the text with contextual completeness in overall, individual 

and collective forms, making it able to significantly enhance the text 

extraction process from web literature. 

REFERENCES 

[1]Clifton, Christopher (2010). "Encyclopedia Britannica: Definition of Data 

Mining". Retrieved 2010-12-09. 

[2] Han, J., & Kamber, M., Data Mining Concepts and Techniques. CA:Morgan 

Kaufmann, 2001. 

[3] Badgett RG: How to search for and evaluate medical evidence. Seminars in 

Medical Practice 1999, 2:8-14, 28. 

[4]Richardson J: Building CAM databases: the challenges ahead. J Altern 

Complement Med 2002, 8:7-8. 

[5] Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, and 

Algorithms. John Wiley & Sons. ISBN 0471228524. OCLC 50055336 

[6] Miller, H. and Han, J., (eds.), 2001, Geographic Data Mining and Knowledge 

Discovery, (London: Taylor & Francis). 

[7] Manu Aery, Naveen Ramamurthy, and Y. Alp Aslandogan. Topic identification 

of textual data. Technical report, The University of Texas at Arlington, 2003. 

[8] Pavel Berkhin. Survey of clustering data mining techniques. Technical report, 

Accrue Software, San Jose, CA, 2002. 

[9] Cecil Chua, Roger H.L. Chiang, and Ee-Peng Lim. An integrated data mining 

system to automate discovery of measures of association. In Proceedings of the 33rd 

Hawaii International Conference on System Sciences, 2000. 

[10] George Forman. An extensive empirical study of feature selection metrics for 

text classification. J. Mach. Learn. Res., 3:1289-1305, 2003. 

[11] Rayid Ghani. Combining labeled and unlabeled data for text classification with a 

large number of categories. In IEEE Conference on Data Mining, 2001. 

[12] George Karypis and Eui-Hong Han. Concept indexing: A fast dimensionality 

reduction algorithm with applications to document retrieval and categorization. 

Technical report TR-00-0016, University of Minnesota, 2000. 

[13] Jerome Moore, Eui-Hong Han, Daniel Boley, Maria Gini, Robert Gross, Kyle 

Hastings, George Karypis, Vipin Kumar, and Bamshad Mobasher. Web page 

categorization and feature selection using association rule and principal component 

clustering. In7th Workshop on Information Technologies and Systems, 1997. 

[14] Sam Scott and Sam Matwin. Text classification using wordnet hypernyms. In 

Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural 

Language Processing Systems, Montreal, 1998. 

1532 



ISSN 2278-733X 

[15] Michael Steinbach, George Karypis, and Vipin Kumar. A comparison of 

document clustering techniques. In KDD Workshop on Text Mining, 2000. 

[16] Andreas Weingessel, Martin Natter, and Kurt Hornik. Using independent 

component analysis for feature extraction and multivariate data projection, 1998. 

[17] Robert Nisbet (2006) Data Mining Tools: Which One is Best for CRM? Part 1, 

Information Management Special Reports, January 2006. 

[18] Dominique Haughton, Joel Deichmann, Abdolreza Eshghi, Selin Sayek, 

Nicholas Teebagy, & Heikki Topi (2003) A Review of Software Packages for Data 

Mining, The American Statistician, Vol. 57, No. 4, pp. 290–309. 

[19] R. Agrawal et al., Fast discovery of association rules, in Advances in knowledge 

discovery and data mining pp. 307–328, MIT Press, 1996. 

[20] Kumar, V. (2011). An Empirical Study of the Applications of Data Mining 

Techniques in Higher Education. International Journal of Advanced Computer 

Science and Applications - IJACSA, 2(3), 80-84. 

[21]Jadhav, R. J. (2011). Churn Prediction in Telecommunication Using Data Mining 

Technology. International Journal of Advanced Computer Science and Applications - 

IJACSA, 2(2), 17-19. 

[22] Devi, S. N. (2011). A study on Feature Selection Techniques in Bio-Informatics. 

International Journal of Advanced Computer Science and Applications - IJACSA, 

2(1), 138-144. 

1533

fuzzy clustering in web text mining and its application - ijcsmr

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?