27.06.2013 Views

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Manoj Cherukuri and Srinivas Mukkamala<br />

<br />

1<br />

where ‘C’ represents the clustering coefficient of the graph, Ci represents the clustering coefficient of<br />

the node ‘i’ and ‘n’ is the total number of nodes in the graph (Clustering Coefficient, 2010). The<br />

clustering coefficient of the graph in the figure above is 0.466 (i.e.1/5(0.33+1+1+0+0).<br />

3. Processes<br />

Construction of our dataset is composed of three processes.<br />

The first process deals with the collection of malicious websites from multiple sources<br />

The second process deals with the construction of link structure for the malicious domains<br />

obtained in the previous process<br />

The third process deals with computing the geographical location of the websites based on the IP<br />

address to geographical location mapping<br />

3.1 Collection of malicious websites<br />

The process of collecting the malicious websites is initialized by identifying the potential sources using<br />

the meta-search engines. These sources publish the websites with different sets of associated<br />

attributes. Some of the attributes associated with these domains are date, type of attack, executable<br />

name and IP address. A custom parser was used for each website source to retrieve the domain<br />

name and the IP address if available. All the malicious websites collected from these sources are<br />

stored in the database.<br />

3.2 Construction of link structure for malicious websites<br />

Crawling is performed on the malicious websites obtained from the previous process using a custom<br />

program. The malicious websites are crawled till the second hop. The flowchart for the process of<br />

building the link structure is shown in the figure 6. All the malicious domains retrieved from various<br />

sources are loaded which act as the seeds for the crawling process and ‘n’ is the total number of<br />

malicious websites. A custom crawler was built to retrieve the content of the malicious websites and<br />

parse all the anchor tags from the content. The parsing of the anchor tags from the website content is<br />

done using the BeautifulSoup (Beautiful Soup, 2010), widely used html parser. All the links originating<br />

from respective malicious websites are stored in the database. The domain is parsed from each such<br />

link and is stored in the tuple corresponding to that link. If the link URL is not listed in the malicious<br />

websites then it is added to a list called as LinkWebsites to avoid duplicate URLs. Once all the<br />

malicious domains are crawled and their corresponding links are stored, the LinkWebsites list is<br />

loaded into MaliciousWebsites list to crawl all the new websites obtained in the first hop from the<br />

malicious websites. Again the process of crawling, parsing anchor tags and storing the links is<br />

performed. Thus, all the links within the first two hops emerging from the malicious websites are<br />

obtained.<br />

3.3 Computing geographical location of websites<br />

The location of the server on which the website is hosted was determined using the IP address to<br />

geographic location mapping. All the distinct domains that are stored during the link analysis phase<br />

are translated to their corresponding IP addresses using the Domain Name Servers (DNS). A custom<br />

script is used to perform the domain name to IP address translation using the ‘nslookup’ command.<br />

The IP address to location mapping is done using the open source database. All the IP addresses for<br />

which the location is not identified are mapped to latitude and longitude values 0 and 0 respectively.<br />

4. Dataset<br />

For our experiments, all the potential sources for the malicious websites were identified using the<br />

meta-search engines. Such identified sources used for the study were PhishTank, Malware Domain<br />

Blocklist, abuse.ch, MalwarePatrol, joewein.de LLC. and Malware Domain List.<br />

55

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!