April 10, 2011 Salzburg, Austria - WOMBAT project
• The utility of attributes in determining levels of access is inconsistent across datasets. While we cannot disclose the client data, we hope to point out how it is different from many of the theoretical assumptions made in academic works, aiding future endeavors. We tried to give as much detail and information as possible while not compromising confidentiality. It is our aim that this analysis provides hints and helps guide future research into models of access control in real-world organizations. Several assumptions made in many academic works, including our own such as the distribution of permissions, did not hold true in the client data. Such analysis may lead to more realistic data generators than those used previously [12, 17]. To this end, the remainder of this paper will focus on the analysis of the data, and not the underlying theory of role mining or access control analytics. The presentation is roughly divided in three parts. First, we discuss the pre-processing of the data. Access control information comes in many forms and at are many levels of granularity. What is considered a resource or permission can be very different from case to case. Even in the same data set, the granularity of the information can vary given that information is collected from different types of systems. During pre-processing, there is a considerable amount of manual labor and simplification. In the second part we analyzed the structure of the data after distillation into a binary access control matrix. We will discuss difference and similarities of the different data sets. The third part is about the use of semantic information during analysis. In particular we will discuss the inclusion of attribute information for data analysis. Surprisingly, having attributes does not necessary help with the analysis. We conclude with our thoughts on the current state of the data and the types of data that we are seeking to continue our research. 2. DESCRIPTION OF THE DATA We now describe the real-world access control datasets in our possession. We will divide these into two categories: datasets from customers that contain confidential and sensitive information and cannot be released; and those publicly available and used in the access control analytics literature. 2.1 Private Data Access control data is particularly difficult to obtain for researchers, and only a small handful have been released to the public. Datasets have primarily been used for research on role mining, automatically converting a non-role-based access control system to an role-based system. We have obtained several datasets from clients for use in research and evaluation of role mining and access control analytics. To protect the confidentiality of the data owners we simply refer to these as customer 1, customer 2, and customer 3. When necessary, we may obfuscate attribute names and other key values, though they are known to us. Customer 1. The first dataset is from a medium size organization used to provision IT system administrators with entitlements. The dataset contains 311 users and 1105 permissions with 7868 user-permission assignments (a density of 2.15%). A permission is an account on a system, such as a file server, and may indicate administrative (e.g., sudo) access. There 53 are six attributes describing each user that provide insights into their duties for the organization. Customer 2. The second dataset is an access control policy from a system that provisions administrative access to an outsourced IT system. The policy consists of 881 users and 852 permissions with 6691 authorizations (a density of 0.89%). There are 25 attributes describing each user, however many contain attributes like telephone number, which we trim to only eight that are applicable for determining user-entitlements based on known semantics. Customer 3. The third dataset contains 3068 users, 3133 permissions, and 71596 user-permission assignments, giving it a density of 0.74%. Permissions may be broadly categorized as Active Directory groups or Applications technical roles 1 , and are not permissions in the traditional sense. However, each group or technical role constitutes one or more levels of access, and will be treated as semantically identical to permissions. Application permissions are given as an applicationaccount pair. Users are assigned a total of eight attributes. 2.2 Publicly Available Data To contrast the confidential client datasets, we analyze eight real-world datasets that have been highly anonymized and released to the public. The bulk of these datasets were released by researchers at HP Labs for use in evaluation in . They have since been used extensively for evaluating role mining algorithms [10, 12, 13]. There are eight datasets (their customer dataset has not been released) in total: the healthcare data was obtained from the US Veteran’s Administration; the domino data was from a Lotus Domino server; americas (both large and small), emea, and apj data were from Cisco firewalls used to provide external users access to HP resources. There are also two firewall policies, firewall1 and firewall2. These datasets are all provided as binary relations, e.g., user i has permission j, and all semantics of users and permissions are speculative. The university dataset, while not from a real-world access control policy, has been used in the literature and is believed to be representative of an access control policy in a university setting. The data was generated from a template 2 use in . We present it here for comparison only, and will focus our analysis on the real datasets. The size of each dataset is most succinctly represented by the number of users, permission, permissions assigned to users (UP), and the density of the data, the number of granted assignments per possible request (| UP | / | U × P |). A summary of the datasets is given in Table 1. Indicated are the number of users (U), permissions (P), attribute types (A), granted assignments (UP), and the density. For the attributes 3 , we indicate those applicable for access control (by semantics of the attributes), and the total number of attribute in parenthesis. The user-permission data can be visualized as a black-and-white image where the Y -axis, or height, is a user, the X-axis are permissions, and a black 1 A technical role is a group in an application or system. A technical role can be characterized as a group of permissions. 2 http://www.cs.sunysb.edu/~stoller/ccs2007/ university-policy.txt 3 Attributes in university are tags, and not unique keys.
(a) Firewall 1 (b) Customer 3 (Cropped) Figure 1: A black dot indicates a granted permission. Note the well defined clusters of users and permissions in firewall1, and the difference in cluster shape, size, and completeness in customer 3. Dataset | U | | A | | P | | UP | Density University 493 5 56 3955 0.143 Americas L 3485 10127 185294 0.005 Americas S 3477 1587 105205 0.019 APJ 2044 1146 6841 0.003 Domino 79 231 730 0.040 EMEA 35 3046 7220 0.068 Firewall 1 365 709 31951 0.123 Firewall 2 325 590 36428 0.190 Healthcare 46 46 1486 0.702 Customer 1 311 6 (6) 1105 7868 0.022 Customer 2 881 8 (25) 852 6691 0.009 Customer 3 3068 6 (14) 3133 71596 0.007 Table 1: Sizes of the real-world datasets presented dot indicates the permission is assigned to the user. The firewall1 policy is shown in Figure 1(a), and (a crop of) the customer 3 policy is shown in Figure 1(b). The users and permissions have been clustered using the k-means algorithm with the hamming distance to aid in visualization only. 2.3 Comments and Observations Before we provide a detailed analysis of what we have learned regarding how access control policies are managed in medium sized organizations, we provide some cursory analysis of the structure, format, and semantics of enforcement. Parsing and Format. First, each customer dataset has a unique format, requiring custom parsing and post processing. Most datasets were provided as a set (two or more) comma delimited files. These were typically user-attributes and user-permission relations. In most, each column is a standard key-value pair, e.g., the user’s department, title, or job location. In other instances, values were structured data themselves, such as an LDAP distinguished name, and required special parsing. Noise. We also encountered several problems identifying a correct set of users, permissions, and attributes for a given set of input files. For example, customer 3 contains 71943 user-permission assignments, however only 71596 of these are unique. If we further define a permission to be case insensitive, e.g., “SYSTEM,” “system,” and “SySTeM” are 54 all identical, then the number of unique user-permission assignments is further decreased to 63582. We encountered another problem with the customer 2 dataset; the userattribute file contained 1400 users, but only 881 of these are assigned permissions in the user-permission relation. Joining Accounts. The customer 2 dataset was provided as several separate relations between users and accounts, groups, roles, or the corporate hierarchy. This required a small amount of additional effort to join tables and calculate closures to obtain the final set of relations. For example, the company developed a concept of “virtual users” to simplify administration. Virtual users are assigned accounts and permissions on systems, but aren’t real employees. Instead, the real user that is the manager of a virtual user is entitled the virtual user’s accounts. In this setting, virtual users are similar to roles 4 , except each virtual user is assigned to a single real employee. In these instances the transitive closure of permissions assigned to virtual users—and not real users—had to be calculated. Granularity and Semantics. Finally, we observed differences in the structure and semantics of permissions. In an academic setting, permissions in access control systems are treated as abstract rights to objects, and this is a definition used in many standards . This is consistent with the public datasets where the data is published as binary relations between users and permissions. The semantics of these permissions are unknown, and the data has been heavily anonymized such that all are abstract statements such as user i is granted permission j. From the descriptions of the data we can infer types of access, for example, the firewall policies are likely port-ip pairs, however, this is purely speculative. The three customer datasets have richer permissions, and raise concerns when applying analysis based on a less-rich model to many real-world datasets and problems. The customer 1 and customer 2 datasets manage entitlements at the level of accounts on systems. Because these access control policies provision administrative access to systems, two different administrative accounts on a system may provision the same set of users. For example, if both accounts have su access, they are largely equivalent. In many other 4 Their systems did not natively support RBAC.
Preface The BADGERS workshop is int
Table of Contents Study on Informat
hosts, and researchers containing l
Phisher text voice text voice text