八月 6, 2019

Identifying Insider Threat Through Analysis of Data-at-Rest

Audra Simons
Dalwinderjeet Kular Research Scientist

This blog provides a summary of a university research project, done in partnership between Forcepoint and The University of Texas at San Antonio (UTSA), related to detecting insider threat risk from data-at-rest. The attached report “Identifying Insider Threat Through Analysis of Data-at-Rest” presents the detail of the project purpose, description and assessment of the data set, visualization of the results, privacy considerations and conclusions.

The purpose of the research was to experiment, test and deliver an algorithm that qualifies the risk of user behavior based on their data storage; derived from the following details:

  • Files stored on the users network share
    • Type of File (Microsoft Word, Excel and PDF)
    • Pseudonymized file storage hierarchy
    • File metadata
    • Anonymized user data
  • Data rule violations (based on DLP rules)
  • File classification labels and tagging (based on Document Authority file classification labels)

The research was divided in to two parts:

  • Creation of the algorithm using graph measures[1] to understand behaviors around user file system usage
  • The risk score algorithm as then validated using the random forest[2] model, which is an anomaly score estimation technique, that was used to identify the anomaly score for each user and highlight anomalous users.


This section lays out the two-part prototype that was designed to identify and explore the feasibility of a risk scoring algorithm for identifying cyber insider threats in an organization.


During the project, that ran from May 2018 to June 2019, Forcepoint provided two test data sets to UTSA. The first was a user back up archive.  The second data set was from “live” user’s hard drives across several departments, an evolution from the first.  The testing from the first archived data set taught us that we needed real world “live” user data plus seeded bad actor user data mixed together to test the algorithm.


Both data sets went through the following process before being passed to UTSA for their research:

  • To ensure privacy of users Forcepoint Security Operations (SecOps) collected and processed the data sets on behalf of the Forcepoint Innovation Labs team using scripts provided
  • All data was processed through the:
    • Forcepoint Data Loss Protection classifiers to see if it contains information that would be typically protected by DLP (e.g. SSNs, Driving License Numbers, etc.)
    • A trial document classification system to determine the basic content type of documents (e.g. “This file is a resume”, “This file is a contract”).
  • All results were then merged, pseudonymised and the file names, paths, and types were salt hashed
  • Prior to approving for the data set to be shared with UTSA,  it was reviewed by SecOps to ensure there was no Personally Identifiable Information (PII)
  • Once approved it was shared with UTSA for their two-part research

During the research period Forcepoint Innovation Labs and UTSA met on a monthly basis or ad hoc to ensure transparency and direction of the research.

Prototype - Part 1


Create an algorithm to identify user and department insider threat potential risk scores.


Using graph theory, the researcher looked at several variables:

  • User
    • Files
      • Tags and tag values
      • Rule violations
      • Match locations

Using the above as inputs for a risk score, each file was then scored. The base scores were then aggregated as a summation and as an average to create the risk score for users and departments.

Prototype - Part 2


Validate the algorithm from Part 1 using the random forest model.


For validation UTSA used the random forest (RF) model, as it was appropriate on the following basis. The reasons being random forest models as capable of:

  • Evaluating non-linear effects
  • Assessing interaction effects among all the predictors[3]
  • Having an adaptive automatic feature selection process that selects the most relevant predictors among many candidate attributes and deselects all the irrelevant noisy predictors
  • Making predictions based on out-of-bag (OOB) subsets ensuring residuals [4]are cross-validated


Results Summary

The results of both the algorithm and the validation of it with random forest highlighted for the majority the same high-risk individual users.

Data Set - Users, Departments and Files

Below takes you through an example of one test data set and the results. It is made up of 50 users across 5 departments with a total number of ~280k files:

Department Number of Users per Department User's ID Number of Folders per Department Number of Files per Department

Department 1





Department 2





Department 3





Department 4





Department 5










Table 1: Details of Departments, Number Of Users, Users ID, Folders And Files


Top 10 Risky Users

Below shows the findings of the top 10 riskiest users by user ID and their risk scores:

Ranking User ID Risk Score































Table 2: Top 10 Risky Users


Department Risk Score Summary

This is the summary of the risk scores per department[5]:

Department Risk Score

Department 1


Department 2


Department 3


Department 4


Department 5


Table 3: Risk Score by Department


Seeded Malicious Users and Algorithm Detected Malicious Users

The data set provided for the final results was a combination of live user data stored on their hard drive combined seeded bad actor data.  Below shows the summary of the seeded bad actors identified by the algorithm and three that were identified as natural occurring from the data:

User ID Malicious User Introduced by Forcepoint Malicious User Detected by the Algorithm
17 * *
22 * *
47 * *
31 * *
2   *
12 * *
45   *
34   *
38 * *
21 * *

Table 4: Seeded Malicious Users and Algorithm Detected Malicious Users

Conclusions and Future Work

In this project, Forcepoint and UTSA collaborated to detect malicious or compromised users using analysis of data-at-rest. For detection, a combination of graph theory and machine learning was used. These methods were able to detect multiple malicious or compromised users.

For this and future projects, we are proposing to use analysis of data-at-rest to shield ourselves from data misuse, data theft, and masquerading.


[1] Graph theory is a branch of mathematics dedicated to studying structures made up of vertices connected by directed or undirected edges.

[2] Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

[3] Predictors - random forest models bring together a large number of relatively uncorrelated models (trees) operating as a committee, which  will outperform any of the individual constituent models. The low correlation between models is the key. Uncorrelated models can produce ensemble predictions or predictors that are more accurate than any of the individual predictions.

[4] Residuals - are defined as a quantity remaining after other things have been subtracted or allowed for.

[5] Department Risk Score is the department-level average risk score, meaning sum of all the users risk score for a department divided by the number of users in the department.

Audra Simons

Audra Simons is the Senior Director of  Global Products, G2CI. Audra is part of the Forcepoint Global Governments team, where her goal is to break new ground in the area of non-ITAR global products and engineering with a focus on high assurance critical infrastructure customers,...

Read more articles by Audra Simons

Dalwinderjeet Kular

Research Scientist

Dr. Dalwinderjeet Kular holds a Ph.D. in Computer Vision from the Florida Institute of Technology. She joined the security industry in 2015. In her role as Research Scientist in Forcepoint's X-Labs she is focused on analyzing structure and unstructured data, identifying relevant features and...

Read more articles by Dalwinderjeet Kular

About Forcepoint

Forcepoint is the leading user and data protection cybersecurity company, entrusted to safeguard organizations while driving digital transformation and growth. Our solutions adapt in real-time to how people interact with data, providing secure access while enabling employees to create value.