Views
5 years ago

April 10, 2011 Salzburg, Austria - WOMBAT project

April 10, 2011 Salzburg, Austria - WOMBAT project

if analyzed at different

if analyzed at different times (e.g., the exploit server may have been shut down, or the infected websites may have been cleaned) or if analyzed from different network locations (e.g. malicious ads may target users belonging to a specific geographical location [6]). HARMUR, the Historical ARchive of Malicious URLs, is a security dataset developed in the context of the WOM- BAT project 1 that tries to address these two challenges. HARMUR leverages publicly available information on the security and network state of suspicious domains to build a “big picture” instrumental to a better understanding of web-borne threats and their evolution. HARMUR specifically addresses the two previously introduced challenges by focusing on the threat dynamics and on the threat context. Threat dynamics. HARMUR is designed to perform a set of analysis tasks for each of a set of tracked domains. Each analysis task aims at collecting information about the state of a domain by querying different information sources. Since the moment in which a specific domain is first introduced in the dataset, HARMUR implements a scheduling policy aiming at repeating the analysis on a regular basis, giving priority to domains that are believed to be “most interesting” and trying to allocate its resources on a “best-effort” fashion. In the long term, this allows the reconstruction of an approximate timeline of the evolution of the domain state that can be instrumental to a better understanding of the threat dynamics. Threat context. HARMUR aggregates information from a variety of sources to gain a more complete understanding of the state of a monitored resource. By collecting information from different security and networking feeds on a regular basis, it is possible to rebuild a partial “ground truth” on the state of a website at a given point in time. For instance, it is possible to correlate a change in the security state (e.g., from malicious to benign) with a change in the DNS records, or with the fact that the server has stopped responding to HTTP requests. 2. RELATED WORK When analyzing web threats, previous work has often proposed ad-hoc analysis solutions that focus the attention on the mechanics and dynamics of specific threat instances [8, 11, 18]. Various detection techniques have been proposed for the detection of web-borne malware propagations. Most of the work has focused on the analysis and detection of driveby downloads: similarly to what previously happened for server-side honeypots, researchers have proposed techniques with varying resource costs and levels of interaction. High interaction client honeypots leverage a full-fledged and vulnerable browser running in a contained environment. High interaction client honeypots include Capture-HPC [21], HoneyClient [23], HoneyMonkey [26] and Shelia [25]. In most cases, a website is considered as malicious if visiting it with the browser results in an unexpected system modification, 1 http://www.wombat-project.eu 45 such as a new running process or a new file (the exception is Shelia, that leverages memory tainting to detect an exploit condition). In all cases, a website can be flagged as malicious if and only if the threat is targeting the specific setup found on the honeypot. For instance, a website exploiting a vulnerability in the Adobe Flash plugin will be flagged as benign by a honeypot on which the plugin is not installed. Low interaction client honeypots leverage a set of heuristics for the detection of vulnerabilities within a specific web page. For instance, HoneyC [17] leverages Snort signatures for the detection of malicious scripts, while SpyBye scans the web content using the ClamAV open-source antivirus [14]. In both these cases, the detection of a threat depends on the generation of some sort of signature, and therefore requires a relatively detailed knowledge of the threat vector. More sophisticated approaches have been proposed, such as PhoneyC [10], that implements vulnerability plugins to be offered to the malicious website (similarly to what Nepenthes [1] does for server-side honeypots) and Wepawet [3, 5] that employs sophisticated machine learning techniques to analyze Javascript and Flash scripts. Despite the variety of the proposed approaches, it is worth noticing that none of the proposed solutions is likely to achieve 100% detection rate. Either because of the impossibility of detecting exploits targeting a different configuration, or because of the limitations implicit to the usage of heuristics, both high- and low-interaction client honeypots have a non-null failure probability that may lead them to mark a malicious website as benign. Previous work has proposed to deal with these limitations by combining lowand high-interaction techniques [6, 15]. With HARMUR, we push this attempt further by proposing a generic aggregation framework able to collect and correlate information generated by different security feeds (e.g. different client honeypots with different characteristics) with generic contextual information on the infrastructure underlying a specific domain. By correlating such information, we aim at learning more on the structure the characteristics of the web threats, as well as on the characteristics and limitations of modern web threats analysis techniques. For instance, HARMUR has been used in the past as information source for the analysis of a specific threat type, that of Rogue AV domains [4], although it collects information on a variety of other different web-borne threats. 3. HARMUR ARCHITECTURE As previously explained, HARMUR is not, per se, a client honeypot. HARMUR is an aggregator of information generated by third parties, and uses this information to generate a historical view for a set of domains that are believed to be malicious. In order to allow HARMUR to scale to a significant number of domains while building this historical view, all the analysis operations must have low cost: we have decided to avoid by design any expensive operation such as crawling the domains’ content, and to rely on existing crawling infrastructures for the analysis of the domain security. The HARMUR dataset and associated framework is built around the concept of domain, more exactly the Fully Qualified Domain Name (FQDN) that is normally associated with one or more URLs. Each domain is associated to a color,

that identifies its current status following the following coding: Red. At least one threat is currently known to be served by one of the hostnames belonging to the domain. Green. No threat has ever been known to be associated to any of the hostnames belonging to the domain (the domain has never been red). Orange. No threat is currently known to be hosted within the domain, but the domain has been red in the past. Gray. None of the currently available security feeds is able to provide any information on the domain. This is likely due to the fact that the domain has never been analyzed. Black. None of the hostnames associated to the domain is currently reachable. This can be due to removal of the associated DNS records or to a failure in responding to HTTP requests. The HARMUR framework is made of two main building block types that are in charge of populating the underlying dataset with various types of metadata. URL feeds. URL feeds are in charge of populating the HARMUR dataset with lists of fresh URLs (and associated FQDNs) that are likely to be of interest. This can include lists of URLs detected as malicious by crawlers, but also URLs available in public lists of malicious domains or phishing sites. Analysis modules. An analysis module wraps an action to be periodically performed on a specific domain. An analysis module is defined by an action, timing information for its repetition, a set of dependencies with respect to other analysis modules, and a list of color priorities. Action. The specific analysis to be executed on the domain. An analysis module has full visibility over the information currently available for the assigned domain, and is in charge of updating the domain status. It should be noted that an analysis module never deletes any information from a domain state: any information in HARMUR is associated to a set of timestamps that define the time periods in which it has been seen holding true. Timing settings. Each analysis module defines a function T (color) that specifies the frequency with which a given domain should be analyzed (given unliminted resources) as a function of its current color. For instance, the security state of a red domain is likely to change more quickly than that of a green domain, and should therefore be checked more frequently (e.g. on a daily basis) than that of a green one (that can be checked on a weekly or even monthly basis). 46 Module dependencies. Each analysis module is likely to depend on the output generated by other analysis modules: for instance, a module in charge of analyzing the geographical location of the web servers associated to a specific domain requires to have access to the DNS associations, generated by another module. This information is used by HARMUR in the scheduling process to decide which domains qualify for the analysis from a specific module. Color priorities. Each module defines the priority rules for the choice of the batch of k domains to be processed at a given analysis round. For instance, checking the availability of a red domain should have priority over the execution of the same action for a domain that has been green for a long time. Still, a green domain should still be checked whenever resources are available. HARMUR’s core consists of a simple scheduler in charge of assigning tasks to a pool of threads. Each analysis task is composed of an analysis module (defining a specific analysis action) and a batch of k domains to be processed by the analysis module (where k is a configuration parameter). The k domains are picked among those that have not been analyzed by the specific analysis module in the last T (color) minutes. Among all the domains requiring analysis, the choice of the k domains is based on their current color and the color priorities specified by the analysis module using a simple assignment algorithm. The scheduler initially assigns a total of n (with n ≪ k) domains to each priority class, and then proceeds to fill the remaining positions starting from the highest priority. For instance, if a module specifies the following color priority order: [red, green, yellow], with k = 100 and n = 10 and if a total of 50 red domains, 60 green domains, and 100 yellow domains need to be analyzed, the batch will be composed of 50 red domains, 40 green domains and 10 yellow domains. 3.1 Input feeds Thanks to the definition of these basic building blocks, the HARMUR framework is easily expandable with new URL feeds or analysis modules. The currently implemented components are represented in Figure 1. HARMUR receives URL feeds from a variety of sources: • Norton Safeweb (http://safeweb.norton.com) • Malware Domain List (http://malwaredomainlist. com) • Malware URL (http://www.malwareurl.com) • Hosts File (http://www.hosts-file.net) • Phishtank (http://www.phishtank.com/) On top of these basic URL sources, HARMUR has the possibility to enrich the initial URL feed for URLs having specific characteristics. This is achieved by leveraging passive DNS information (obtained from http://www.robtex.com/)

D06 (D3.1) Infrastructure Design - WOMBAT project
6-9 December 2012, Salzburg, Austria Social Programme
D I P L O M A R B E I T - Salzburg Research
D I P L O M A R B E I T - Salzburg Research
D I P L O M A R B E I T - Salzburg Research
Communication Plan for EGU 2011 April 3-8, 2011, Vienna, Austria
ECCMID meeting Vienna, Austria 10-13 April 2010 - European ...
8th Liquid Matter Conference September 6-10, 2011 Wien, Austria ...
8th Liquid Matter Conference September 6-10, 2011 Wien, Austria ...
April 10, 2011 - University of Cambridge
Top 10 Project Management Trends for 2011 from ESI International