Identifying Insider Threat Through Analysis of Data-at-Rest
This blog provides a summary of a university research project, done in partnership between Forcepoint and The University of Texas at San Antonio (UTSA), related to detecting insider threat risk from data-at-rest. The attached report “Identifying Insider Threat Through Analysis of Data-at-Rest” presents the detail of the project purpose, description and assessment of the data set, visualization of the results, privacy considerations and conclusions.

Identifying Insider Threat Through Analysis of Data-at-Rest
Read the ReportThe purpose of the research was to experiment, test and deliver an algorithm that qualifies the risk of user behavior based on their data storage; derived from the following details:
- Files stored on the users network share
- Type of File (Microsoft Word, Excel and PDF)
- Pseudonymized file storage hierarchy
- File metadata
- Anonymized user data
- Data rule violations (based on DLP rules)
- File classification labels and tagging (based on Document Authority file classification labels)
The research was divided in to two parts:
- Creation of the algorithm using graph measures[1] to understand behaviors around user file system usage
- The risk score algorithm as then validated using the random forest[2] model, which is an anomaly score estimation technique, that was used to identify the anomaly score for each user and highlight anomalous users.
Background
This section lays out the two-part prototype that was designed to identify and explore the feasibility of a risk scoring algorithm for identifying cyber insider threats in an organization.
During the project, that ran from May 2018 to June 2019, Forcepoint provided two test data sets to UTSA. The first was a user back up archive. The second data set was from “live” user’s hard drives across several departments, an evolution from the first. The testing from the first archived data set taught us that we needed real world “live” user data plus seeded bad actor user data mixed together to test the algorithm.
Both data sets went through the following process before being passed to UTSA for their research:
- To ensure privacy of users Forcepoint Security Operations (SecOps) collected and processed the data sets on behalf of the Forcepoint Innovation Labs team using scripts provided
- All data was processed through the:
- Forcepoint Data Loss Protection classifiers to see if it contains information that would be typically protected by DLP (e.g. SSNs, Driving License Numbers, etc.)
- A trial document classification system to determine the basic content type of documents (e.g. “This file is a resume”, “This file is a contract”).
- All results were then merged, pseudonymised and the file names, paths, and types were salt hashed
- Prior to approving for the data set to be shared with UTSA, it was reviewed by SecOps to ensure there was no Personally Identifiable Information (PII)
- Once approved it was shared with UTSA for their two-part research
During the research period Forcepoint Innovation Labs and UTSA met on a monthly basis or ad hoc to ensure transparency and direction of the research.
Prototype - Part 1
Goal
Create an algorithm to identify user and department insider threat potential risk scores.
Method
Using graph theory, the researcher looked at several variables:
- User
- Files
- Tags and tag values
- Rule violations
- Match locations
- Files
Using the above as inputs for a risk score, each file was then scored. The base scores were then aggregated as a summation and as an average to create the risk score for users and departments.
Prototype - Part 2
Goal
Validate the algorithm from Part 1 using the random forest model.
Method
For validation UTSA used the random forest (RF) model, as it was appropriate on the following basis. The reasons being random forest models as capable of:
- Evaluating non-linear effects
- Assessing interaction effects among all the predictors[3]
- Having an adaptive automatic feature selection process that selects the most relevant predictors among many candidate attributes and deselects all the irrelevant noisy predictors
- Making predictions based on out-of-bag (OOB) subsets ensuring residuals [4]are cross-validated
Results Summary
The results of both the algorithm and the validation of it with random forest highlighted for the majority the same high-risk individual users.
Data Set - Users, Departments and Files
Below takes you through an example of one test data set and the results. It is made up of 50 users across 5 departments with a total number of ~280k files:
Department | Number of Users per Department | User's ID | Number of Folders per Department | Number of Files per Department |
---|---|---|---|---|
Department 1 |
10 |
41-50 |
1420 |
16738 |
Department 2 |
10 |
31-40 |
1600 |
225779 |
Department 3 |
10 |
21-30 |
1291 |
17331 |
Department 4 |
10 |
11-20 |
1454 |
8007 |
Department 5 |
10 |
1-10 |
719 |
12154 |
Total |
50 |
|
6484 |
280009 |
Table 1: Details of Departments, Number Of Users, Users ID, Folders And Files
Top 10 Risky Users
Below shows the findings of the top 10 riskiest users by user ID and their risk scores:
Ranking | User ID | Risk Score |
---|---|---|
1 |
17 |
74.28 |
2 |
22 |
66.35 |
3 |
47 |
54.79 |
4 |
31 |
50.43 |
5 |
2 |
9.83 |
6 |
12 |
8.86 |
7 |
45 |
8.72 |
8 |
34 |
8.10 |
9 |
38 |
7.93 |
10 |
21 |
7.91 |
Table 2: Top 10 Risky Users
Department Risk Score Summary
This is the summary of the risk scores per department[5]:
Department | Risk Score |
---|---|
Department 1 |
10.94 |
Department 2 |
11.05 |
Department 3 |
11.81 |
Department 4 |
9.42 |
Department 5 |
6.04 |
Table 3: Risk Score by Department
Seeded Malicious Users and Algorithm Detected Malicious Users
The data set provided for the final results was a combination of live user data stored on their hard drive combined seeded bad actor data. Below shows the summary of the seeded bad actors identified by the algorithm and three that were identified as natural occurring from the data:
User ID | Malicious User Introduced by Forcepoint | Malicious User Detected by the Algorithm |
---|---|---|
17 | * | * |
22 | * | * |
47 | * | * |
31 | * | * |
2 | * | |
12 | * | * |
45 | * | |
34 | * | |
38 | * | * |
21 | * | * |
Table 4: Seeded Malicious Users and Algorithm Detected Malicious Users
Conclusions and Future Work
In this project, Forcepoint and UTSA collaborated to detect malicious or compromised users using analysis of data-at-rest. For detection, a combination of graph theory and machine learning was used. These methods were able to detect multiple malicious or compromised users.
For this and future projects, we are proposing to use analysis of data-at-rest to shield ourselves from data misuse, data theft, and masquerading.
[1] Graph theory is a branch of mathematics dedicated to studying structures made up of vertices connected by directed or undirected edges.
[2] Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
[3] Predictors - random forest models bring together a large number of relatively uncorrelated models (trees) operating as a committee, which will outperform any of the individual constituent models. The low correlation between models is the key. Uncorrelated models can produce ensemble predictions or predictors that are more accurate than any of the individual predictions.
[4] Residuals - are defined as a quantity remaining after other things have been subtracted or allowed for.
[5] Department Risk Score is the department-level average risk score, meaning sum of all the users risk score for a department divided by the number of users in the department.