October 20, 2020

Future Insights – Inherent Bias in Machine Learning

Raffael Marty

A note from our series editor, Global CTO Nicolas Fischbach:

Welcome to the second post in our Forcepoint Future Insights series, which will offer six separate points of view on the trends and events we believe the cybersecurity industry will need to deal with in 2021. Check out the first post in the series: The Emergence of the Zoom of Cybersecurity

Update: The Future Insights 2021 eBook is now available for download for those of you who want to dig into all six insights in one place.

Here's the next post from Raffael Marty, Vice President Research and Intelligence:

Cracks in Trust and How to Mend Them

Looking at the cybersecurity landscape today, I have to say I’m glad I’m not a CISO. In an ever-evolving world of digital transformation, omni-connected devices and semi-permanent remote workforces, keeping critical data and people safe is a huge challenge. So huge, in fact, that it can’t be done without the implementation of machine learning and automation.

At the core of understanding risk and exposure to an organization, we need to understand its critical data and how that data moves. We can only do so by collecting large quantities of metadata and telemetry about said data and the interactions with it to then apply analytics to make sense and translate it into a risk-based view.

However, developing automated systems is not without its challenges. In 2021, I believe machine learning and analytics will fall under tighter scrutiny, as both our trust in their unbiased nature and fairness, and their ethical boundaries will continue to be questioned.

Rage at the Machine

We saw headline-grabbing incidents this summer. For example in the United Kingdom, where the government initially decided to let algorithms determine schoolchildren’s exam results. However, the bias which had been baked into this particular algorithm resulted in significant drops in grades: unfairly skewed to lower-income areas, and worse, not taking the teachers’ expertise into account. This resulted in an embarrassing U-turn, where people ended up trumping machines in grading exams.

This is not the first time that algorithms and machine learning systems, trained on biased data sets have been criticized. You will have heard of Microsoft’s Tay chatbot and you may have heard of facial recognition software incorrectly identifying members of the public as criminals.  Getting it wrong can have life-changing effects (e.g. for the students or people applying for credit) or could be as “minor” as an inappropriate shopping coupon being sent to a customer.

A number of cybersecurity systems use machine learning to make decisions about whether an action is appropriate (of low risk) for a given user or system. These machine learning systems must be trained on large enough quantities of data and they have to be carefully assessed for bias and accuracy. Get it wrong, apply the controls wrong, and you will experience situations such as a business critical document being incorrectly stopped mid-transit, a sales leader unable to share proposals with a prospect, or other blocks to effective and efficient work. Conversely if the controls are too loose, data can leak out of an organization, causing damaging and costly data breaches.

Finding the Balance in 2021

To build cyber systems that help identify risky users and prevent damaging actions, the data we analyze comes for the most part from monitoring a user’s activities. It’s worth saying upfront that user activity monitoring must be done appropriately, and with people’s privacy and the appropriate ethical guidelines in place.

In order to create a virtual picture of users, we can track log on and log off actions. We monitor which files people open, modify, and share. Data is pulled from security systems such as web proxies, network firewalls, endpoint protection and data leak prevention solutions. From this data, risk scores are then computed and the security systems in turn flags inappropriate behavior and enforces security policies appropriately.

When undertaking this analysis or, in fact, any analysis which uses machine learning or algorithms to make automated decisions which impact people’s lives, we must use a combination of algorithms and human intelligence. Without bringing in human intuition, insights, context and an understanding of psychology, you risk creating algorithms which are themselves biased or make decisions based on flawed or biased data, as discussed above.

In addition to involving human expertise in the algorithmsor in other wordsmodelling expert knowledge, the right training data and the right data feeding the live analytics is just as important. What constitutes “the right” data?  The right data is often determined by the problem itself, how the algorithm is constructed, and whether there are reinforcement loops or even explicit expert involvement is possible. The right data means the right amount, the right training set, the right sampling locations, the right trust in the data, the right timeliness, etc. The biggest problem with the ‘right data’ is that it’s almost impossible to define what bias could be present until a false result is observed. At that point, it’s potentially too lateharm has been caused.

Using machine learning and algorithms in everyday life is still in its infancy but we see the number of applications grow at stunning pace. In 2021, I expect further applications to fail due to inherent bias, and a lack of expert oversight and control of the algorithms. Not the least problem being that the majority of supervised machine learning algorithms act as a blackbox, making verification either impossible or incredibly hard.

This doesn’t mean that all machine learning algorithms are doomed to failure. The good news is that bias is now being discussed and considered in open groups, alongside the efficacy of algorithms. I hope we will continue to develop explainable algorithms that model expert input.  The future of machine learning is bright; the application of algorithms in smart ways is only bounded by our imagination.

Additional Resources

For more detail on Forcepoint’s commitment to privacy, please see the Forcepoint Privacy Hub.

Future Insights Takeaways:

  • In 2021 machine learning and analytics will fall under tightened scrutiny, as trust in their unbiased nature and fairness, as well as ethical boundaries will be questioned.
  • Machine learning systems must be trained on large enough quantities of data and they have to be carefully assessed for bias and accuracy.
  • 2021 is all about finding this balance, which can only be done through a combination of algorithms and human intelligence.
  • Without bringing in human intuition, insights, context and an understanding of psychology, you risk creating biased algorithms, which can have life-changing impact.
  • We must continue to develop explainable algorithms that model expert input.
  • The future is bright: the application of algorithms in smart ways is bounded only by our imagination.

Raffael Marty

Raffael Marty brings more than 20 years of cybersecurity industry experience across engineering, analytics, research and strategy to Forcepoint. Prior to joining the company, Marty ran security analytics for Sophos, a leading endpoint and network security company, launched pixlcloud, a visual...

Read more articles by Raffael Marty

About Forcepoint

Forcepoint is the leading user and data protection cybersecurity company, entrusted to safeguard organizations while driving digital transformation and growth. Our solutions adapt in real-time to how people interact with data, providing secure access while enabling employees to create value.