Views
5 years ago

April 10, 2011 Salzburg, Austria - WOMBAT project

April 10, 2011 Salzburg, Austria - WOMBAT project

Test Loss �� X 0.6

Test Loss �� X 0.6 0.55 0.5 0.45 0.4 0.35 0 0.2 0.4 0.6 0.8 1 � (a) User-Attribute Test Loss �� Y 0.34 0.32 0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0 0.2 0.4 0.6 0.8 1 � (b) User-Permissions Figure 7: Collective Matrix Factorization of customer 1 with 15 latent variables. sions should result in an measurable decrease in reconstructive error. Matrix factorization, unlike the entropy calculations, is impacted by the assignments of other permissions. For example, user i with permission j may be granted a right k because many users with j also have k. Thus, the conditional entropy of an attribute A on a permission j should really be measured ∼ H(pj | P \{pj} ) − H(pj | P \{pj} , A). 5. CONCLUSIONS This paper examined eleven real-world datasets of access control policies typically used for research in role mining, access control analytics, and policy learning. Of these, eight datasets a public and have been used in the academic literature for research and evaluation, and three are confidential datasets from clients. Analysis indicated the public data contains more distinct clusters of users and permissions, making it well suited to role-based access control, and easier to learn models in general. To contrast, the client datasets were sparser on average, were more fragmented (as calculated by the number and ratio of formal concepts), and less compressible using roles. Because the public data lacks semantic meaning, it limits the types of analysis we can perform on them. Next, we more closely analyzed the client data which contains semantic information about both the users and permissions. We found the permissions to have a longer tail, and the customer 3 dataset follows a power law distribution. The sparseness of these datasets leads us to speculate there are ad personam permissions, e.g., rights to a user’s home directory, that should be parameterized. This would simplify the complexity and analysis of such datasets and cut the long tail. We have also discovered the need to develop a normalized model of an entitlement in access control policies. As illustrated in Section 2.3, entitlements may appear at different levels of granularity, such as rights to a column, table, or database, and entitlements may subsume others, e.g., sudo. To perform powerful analytics on the rights assigned to users, it is necessary to have an accurate and uniform model of such entitlements. In the database example, this requires knowledge of the objects themselves, while administrative access requires additional information about the relationship between rights. While performing our research, a missing element to access control data is feedback from administrators who understand the policies and the constraints they were authored under. This includes quality measure for mined roles, validation of discovered policy errors, and comments on auto- 59 matically suggested role names. During our engagements we received feedback such as“This seems to look good,” without further input. We hope this work helps open up new data sources that can drive future research. For example, future work may model how roles evolve over time, how users use permissions, or separation of duty policies. In general this line of research is attempting to discover a meta-model of access control, a topic which has garnered discussion recently [2, 5]. Further, obtaining multiple datasources than can be correlated, such as network traces, email, and permission usage patterns, can allow researchers to model the interaction of systems and its users, possibly leading to solutions to insider threats. 6. REFERENCES [1] ANSI. Role-based access control. Technical Report ANSI INCITS 359-2004, 2004. [2] S. Barker. The next 700 access control models or a unifying meta-model? SACMAT, 2009. [3] A. Colantonio, R. Pietro, A. Ocello, and N. Verde. A formal framework to elicit roles with business meaning in rbac systems. SACMAT, 2009. [4] A. Ene, W. Horne, N. Milosavljevic, P. Rao, R. Schreiber, and R. E. Tarjan. Fast exact and heuristic methods for role minimization problems. SACMAT, 2008. [5] D. Ferraiolo and V. Atluri. A meta model for access control: why is it needed and is it even possible to achieve? SACMAT, 2008. [6] M. Frank, A. Streich, D. Basin, and J. Buhmann. A probabilistic approach to hybrid role mining. CCS, 2009. [7] B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Foundations. 1998. [8] C. Giblin, M. Graf, G. Karjoth, A. Wespi, I. Molloy, J. Lobo, and S. Calo. Towards an integrated approach to role engineering. SafeConfig, 2010. [9] M. Kuhlmann, D. Shohat, and G. Schimpf. Role mining - revealing business roles for security administration using data mining technology. SACMAT, 2003. [10] R. Kumar, S. Sural, and A. Gupta. Mining RBAC roles under cardinality constraint. Information Systems Security, 2010. [11] I. Molloy, H. Chen, T. Li, Q. Wang, N. Li, E. Bertino, S. B. Calo, and J. Lobo. Mining roles with semantic meanings. SACMAT, 2008. [12] I. Molloy, N. Li, T. Li, Z. Mao, Q. Wang, and J. Lobo. Evaluating role mining algorithms. SACMAT, 2009. [13] I. Molloy, N. Li, J. Lobo, Y. A. Qi, and L. Dickens. Mining roles with noisy data. SACMAT, 2010. [14] A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization. KDD, 2008. [15] S. D. Stoller, P. Yang, C. R. Ramakrishnan, and M. I. Gofman. Efficient policy analysis for administrative role based access control. CCS, 2007. [16] A. P. Streich, M. Frank, D. Basin, and J. M. Buhmann. Multi-assignment clustering for boolean data. ICML, 2009. [17] J. Vaidya, V. Atluri, and J. Warner. Roleminer: mining roles using subset enumeration. CCS, 2006.

On Collection of Large-Scale Multi-Purpose Datasets on Internet Backbone Links Farnaz Moradi, Magnus Almgren, Wolfgang John, Tomas Olovsson, Philippas Tsigas Computer Science and Engineering Chalmers University of Technology Göteborg, Sweden {moradi,almgren,johnwolf,tomasol,tsigas}@chalmers.se ABSTRACT We have collected several large-scale datasets in a number of passive measurement projects on an Internet backbone link belonging to a national university network. The datasets have been used in different studies such as in general classification and characterization of properties of Internet traffic, in network security projects detecting and classifying malicious traffic and hosts, and in studies of network-level properties of unsolicited e-mail (spam) traffic. The Antispam dataset alone contains traffic between more than 10 million e-mail addresses. In this paper we describe our datasets, the data collection methodology including experiences in collecting and processing data on a large scale. We have in particular selected a dataset belonging to an anti-spam project to show how a practical analysis of highly privacy-sensitive data can be done, in this case containing complete e-mail traffic. Not only do we show that it is possible to collect large datasets, we also show how to solve different issues regarding user privacy and give experiences from how to work with large datasets. Categories and Subject Descriptors C.2.3 [Network Operations]: Network Monitoring; C.2.2 [Network Protocols]: Applications (SMTP, FTP, etc.) General Terms Measurement Keywords Internet Measurement, Large-Scale Datasets, E-mail traffic, Spam 1. INTRODUCTION Access to real-life large-scale datasets is in many cases crucial for understanding the true characteristics of network traffic and application behavior. The collection of large Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Badgers’11, 10-APR-2011, Salzburg, Austria Copyright 2011 ACM /11/04...$10.00. 60 datasets from backbone Internet traffic is therefore very important for such analysis although the data collection projects in themselves face several challenges [13]. Not only is mere physical access to optical Internet backbone links needed, but also rather expensive equipment in order to deal with the large data volumes arriving at high speeds. Adding to the complexity, the collected data traces must be desensitized due to privacy issues because they may contain privacysensitive data. This anonymization process must be done in such a way so that a satisfactory analysis to answer the research question still can be performed, without leaking any sensitive user data. Packets also need to be reassembled into application level “conversations” so that, finally and maybe the most challenging part, methods and algorithms suitable for analysis of massive data volumes can be run. Finding these scalable methods is difficult. We have over the years performed several data collection projects where large datasets have been gathered and analyzed. Different projects have had different goals with the data collection and for each project, unique tools have been developed and used. In this paper we describe the data collection procedure and the challenges we have faced with dealing with high-speed data collection and give examples of how data have been used in different projects. In particular, we describe a current project, the Antispam project, aiming for spam detection mechanisms on the network level where characteristics of SMTP traffic are collected and analyzed. Not only does this involve collection of vast amounts of email traffic but the data collected is also highly sensitive so that automated ways to handle message privacy are essential. We also describe a method that could be deployed for analyzing the large-scale Antispam dataset. This method allows us to find distinguishing characteristics of legitimate and unsolicited e-mails which could be used for complementing current anti-spam tools. The rest of this paper is organized as follows. Section 2 presents our methodology for data collection, including challenges we encountered and the solutions we deployed. Section 3 introduces the large-scale datasets collected during different years for different projects. Section 4 describes the collection of a particular dataset, the Antispam dataset, and in Section 5 we shift focus to describe the analysis of this Antispam dataset and how we compare unsolicited with legitimate e-mails. In Section 6 we present related work by comparing other sources of data collection with our collection method and resulting datasets. Finally, Section 7 con-

D06 (D3.1) Infrastructure Design - WOMBAT project
6-9 December 2012, Salzburg, Austria Social Programme
D I P L O M A R B E I T - Salzburg Research
D I P L O M A R B E I T - Salzburg Research
D I P L O M A R B E I T - Salzburg Research
ECCMID meeting Vienna, Austria 10-13 April 2010 - European ...
Communication Plan for EGU 2011 April 3-8, 2011, Vienna, Austria
8th Liquid Matter Conference September 6-10, 2011 Wien, Austria ...
8th Liquid Matter Conference September 6-10, 2011 Wien, Austria ...
April 10, 2011 - University of Cambridge
Top 10 Project Management Trends for 2011 from ESI International