Views
5 years ago

April 10, 2011 Salzburg, Austria - WOMBAT project

April 10, 2011 Salzburg, Austria - WOMBAT project

higher-than-usual risk

higher-than-usual risk of both theft and malware infection. Specifically, we try to answer the following questions: • Are the laptops of an organization similarly sensitive or are some laptops more sensitive than other laptops? • Are the owners of more sensitive laptops aware of the sensitive information in their machines? • Is there any correlation between the sensitivity level and the risk level of a laptop? 39 employees volunteered for the pilot test, and, in total, about 2.3 million files in various file formats were analyzed. All analyses were done locally inside a machine, and the pilot collected only statistical information from the individual machines such as the number of confidential documents or number of files containing credit card numbers. Thus, no file was needed to be transferred or stored in a central storage preserving the privacy. In this pilot, the system estimated data sensitivity based on the occurrences of 11 different PII types and 11 sensitive topics in the data, which were identified as sensitive to the company by the SMEs. Our analysis results show that about 7% of the 2.3 million files belong to one of the 11 sensitive data categories, and 37% of the scanned files contain at least one piece of sensitive information. Among sensitive data categories, company confidential documents and PII documents are most common categories. The results also reveal that most laptops have a similar overall sensitivity level, but a few machines have exceptionally high sensitivity. By associating the sensitivity and risk levels of the laptops, we discovered that those few highly sensitive laptops were also most at risk of data loss and of malware infection. In the remainder of this paper, we present more detailed characteristics of the pilot data and interesting findings obtained by analyzing a large number of documents in business laptops. We also discuss several pragmatic issues to deal with in conducting a deep content inspection of primary workstations, including how to scan and analyze files efficiently while the users are still using the laptops and how their privacy is protected. 2. SEMI-AUTOMATED SENSITIVITY ANAL- YSIS In this section, we describe a high-level overview of the sensitivity estimation method. The system is designed to scan files in a computer and to estimate the sensitivity of the computer based on the contents of the files. In addition, we try to estimate the risk level of the machine using various factors including the owner’s organizational status and the the usage patterns. The high-level process for estimating both the laptop sensitivity and risk levels are depicted in Figure 1. In this section, we describe more details on how to define sensitive data categories and their sensitivity scores, and the automated process for estimating the sensitivity of a computer. The risk assessment step is discussed in Section 4. 2.1 Defining Sensitive Data Categories and The Sensitivities The first step towards measuring data sensitivity is to define what kinds of information are important or sensitive 69 1. Find Data 3. Assess Risk Level Risk Survey Questionnaire 2. Assess Data Sensitivity SPI SPI Confidential Patent Data Categories Subject Matter Expert Figure 1: Process for Estimating the Sensitivity and Risk Levels and what their respective values are to the company. Categories of sensitive data can vary across different countries and even across organizations within a country [7]. Therefore, the categories can best be defined by the people inside the organization. We gathered 104 sensitive data types for the IT company by asking 30 subject matter experts (SMEs), most of whom are executive level employees from various parts of the company. The data types range from employee data and financial data to future business plans and intellectual property data. Next, we asked the SMEs to rank-order the data categories by their importance to the company considering both tangible and intangible factors. We showed each of the SMEs the full list of the data categories and asked to rank all or a subset of the categories they are knowledgable about. Ties were allowed in ranking. We developed a GUI tool to help the SMEs with the ranking process. Figure 2 shows a snapshot of the tool. The tool allows users to select a subset of data categories they want to rank, and provides multiple ways to enter sensitivity scores. Categories for this SME 3 Different methods for determining sensitivity Figure 2: SME Interview Tool We, then, combined the partial rankings from all of the SMEs into a single ranked list, and data categories with very similar rankings were grouped into subgroups resulting in 8 distinct bands, from Band A to Band H. Finally, we asked the SMEs to provide relative sensitivity for each of the bands, where the sensitivity of the least sensitive band is 1. In this round, 6 SMEs provided relative sensitivity scores for the bands, and we use the average score for each band as its sensitivity score. The final sensitivity scores of

Band Sample Data Categories Sensitivity Band A Jewel Code 52 Band B Acquisition Plans 45 Band C Trade secrets 37 Band D Sensitive Personal Informa- 29 Band E tion, Personal Health Data Design Documents, Employee Personal Data 19 Band F Employee Incentives Data, 11 Band G Project Plans Non-Jewel Code, Network Log 5 Data Band H Market Growth Data, Delivery Plans Table 1: Data categories and their relative sensitivities the 8 bands and some sample data categories in each band are shown in Table 1. 2.2 Sensitivity Estimation Process The sensitivity estimation process consists of two components: file scanning and data classification. Find Data: The component scans the hard disk of a computer and selects files for analysis. The system currently supports Windows, Linux and Mac OS X platforms. We allow the users easily customize the file retrieval process. Users can limit the scanning to a certain directories or certain type of files (e.g., analyze only “/projects” directory or “only PPT files”), or can exclude certain directories or files from the scanning (e.g., exclude “/MyPrivateDirectory”). Upon retrieval of a file, it converts a non-textual file (e.g., MS PowerPoint files and Adobe PDF files) into plain text for content inspection. Assess Data Sensitivity: This component comprises a suite of text analytics and classification engines that categorize unstructured text into a set of sensitive data types. Currently, the text analytics engine supports the 11 sensitive data categories listed in Table 2. The data categories were chosen by the CIO’s office at the company as the first target set for the study. As we can note from the list, the company is most concerned about satisfying regulatory compliance and protecting intellectual property. The SPI category is for the company’s HR (human resource) documents with personal information. The PHI category includes documents containing medical information (i.e., disease name or medical treatment name) together with personally identifiable information. The PCI category is assigned to documents containing a credit card number, an expiration date and a person name together. It is worth noting that the data categories are not independent each other. For instance, the SPI and PHI categories supersede the PII category. If a PII document contains a medical information or is a HR document, its category is escalated to PHI and SPI respectively. Similarly, most design documents and proprietary source code files are also company confidential documents. Note that we need to recognize personal information to identify the SPI, PHI, PCI and PII categories. Currently, 1 70 Data Categories Bands Sensitive Personal Information (SPI) Band D Personal Health Information (PHI) Band D Payment Card Industry Information (PCI) Band E Personally Identifiable Information (PII) Band E Design Document Band E Patent Disclosure Band E Patent Application Band E Employee Record Band E Salary Information Band F Proprietary Source Code Band G Other Confidential Documents Band G Table 2: Sensitive data categories the pilot system discovers. the system can identify the following sensitive data types. • Social Security Number • Passport Number • Lotus Notes Email Address • Internet Email Address • Employee Name • Non-employee Person Name • Address • Phone Number • Date of Birth • Credit Card Number • Medical Information including disease names, treatment names and drug names. The system applies both rule-based approaches including signatures, regular expressions and linguistic patterns, and machine learning-based classification methods. Often, different approaches are applied together to enhance the efficiency and accuracy[8, 9]. For instance, a signature-based method is used to identify candidate medical information (i.e., known disease names), and a statistical classification method is applied to determine if the term really has a medical sense in the given context. Similarly, a set of linguistic patterns are used to identify candidate confidential documents, e.g., “do not (disclose|distribute|forward|share)”, then a supervised machine learning method is used to decide if the identified confidential label indeed denotes the document is confidential or is used in a different context. The accuracy levels of discovering PII data types, confidential documents and proprietary source code files reach to 97– 98%. Due to lack of ground truth data, we were not able to objectively measure the accuracy levels of other classifiers. When the classification process is completed, the component then maps the classification results into their sensitivity scores as defined in the data taxonomy (Table 1). Note that the sensitivity scores are based on the value of a single data (e.g., one credit card number) or a single document (e.g., one source code file), so we count the number of occurrences of each type of sensitive information to estimate the overall sensitivity of a document and a laptop. For instance, if a laptop contains 10 Band A files, then the overall sensitivity of the laptop is 520.

D06 (D3.1) Infrastructure Design - WOMBAT project
6-9 December 2012, Salzburg, Austria Social Programme
D I P L O M A R B E I T - Salzburg Research
D I P L O M A R B E I T - Salzburg Research
D I P L O M A R B E I T - Salzburg Research
Communication Plan for EGU 2011 April 3-8, 2011, Vienna, Austria
ECCMID meeting Vienna, Austria 10-13 April 2010 - European ...
8th Liquid Matter Conference September 6-10, 2011 Wien, Austria ...
April 10, 2011 - University of Cambridge
8th Liquid Matter Conference September 6-10, 2011 Wien, Austria ...
Top 10 Project Management Trends for 2011 from ESI International