Rest In Peace Big Data Security Analytics
Despite the fact that for over 15 years I have been one of the biggest advocates for building larger and larger repositories of security events and network telemetry for analytics, it is sadly time to acknowledge that however successful this endeavor has been, it has now sadly, passed away.
The History of Big Data Security Collection
Imagine where the security industry has evolved from. Back in2003 we attempted to bring Network IDS logs and Host Antivirus logs into a Security Event Manager (SEM). SEMs were so young then they didn’t even have an “I” for information in them (that will come later). However, the SEM started to fall over when we put about 40 million events/day into it.
To address the problem, we regrouped and scaled every time we integrated new data sets - despite this challenge the SIEM market grew up nicely in support of the digital forensics and incident response needs. By 2012 we had an environment that was ingesting and processing 4-5 billion events/day! This was astounding yet still well short of the volume of logs and telemetry being generated on a daily basis.
With the big data revolution we became convinced that we could drive scale at levels unheard of previously (apart from some large government agencies). Now we had a platform that could ingest 100 billion logs and events per day! If that rate of growth continued, then by 2016 we’d have systems (and the data) to support up to a trillion events/day. Figure 1 shows what we built.
Figure 1: The evolution of Big Data Security Collection.
Challenges of Evolution and Lessons Learned
The cost to build out and manage these large-scale systems was significant. There are a lot of moving parts to keep that much data flowing, landing appropriately and on time. Even to get to the part where analytics could be applied took dozens of steps.
We believed that unsupervised machine learning would be the answer. It wasn’t. There were often clues inside the results but they were buried and the actual yield of real findings was <1%. (I’m being generous with that percentage point too!). The additional complexity of managing machine learning pipelines across these large datasets contributed its own set of costs to the problem.
Between 2016 and 2019, we got better. Supervised techniques allowed for results. It appeared as though we were on the verge of a breakthrough, yet there were still many problems hindering the overall value of these big data platforms.
Context about users, devices, networks and locations was often missing. The ability to do entity resolution to help frame the problem left us with significant false positives and false negatives. Additionally, most sources of logs, events and telemetries weren’t adequately configured so while we got billions of logs, they were often incomplete or just the wrong ones. Once the configurations were improved (I won’t say optimized), things calmed down - showing with context and adequate logging levels, we started seeing real results out of these systems!
But then when we started really studying the situation from a yield perspective. If you took a large enterprise, improved their logging configurations and had some context, the yield of findings for every 1 billion log lines was about 5 - parts per billion. That means that while value was being generated, it was coming at a very high cost for the yield. For smaller companies the yield was generally in the parts/million category.
One can argue that the value of the findings outweighed the cost of the solution but to evolve sometimes we just have to look at things differently.
The Future – More Brains Are Better Than One
We need analytics as atomically close to the data as possible where we can ship the important data that has been extracted from the raw logs. We can still benefit from a big central “brain” but we can gain efficiency through distributing a bunch of “little brains” across our products eliminating the heavy lifting of hauling billions of records across miles of cyberspace. Figure 2 shows how, ideally, this would work.
Figure 2: Moving analytics closer to the data.
This will not happen immediately. We will still have to backhaul logs and telemetry from sources where we cannot run analytics locally. BUT it will reduce the amount of data we have to store (often redundantly) by 99% or greater. It also means that the centralized analytics engine can be based on different architectures and platforms, providing much greater flexibility than we’ve had.
Another benefit of this decentralized architecture is that we can push down new analytical approaches to the edge as they’re developed and optimized in the analytics engine.
Godspeed Big Data Security Collection – Long Live Distributed Analytics
Security analytics professionals should not forget to connect across their enterprises and their industries, seeking advice and feedback from peers in similar, or connected roles.. Back in 2010 I remember beaming with pride because we were able to process a billion events/day. This was an accomplishment that just a few years prior seemed unattainable. One of my peers, upon hearing of this said: “So what?” I replied that we could do so much with the data. She was running Security Operations at the time and said: “How does this make my life better?”
In 2013, when we started looking at the intersection of Big Data, Machine Learning and Cyber Security one of my other trusted peers looked at me and said “If the (redacted government agency) can’t make this work, what makes you think you can?”.
Both were right but we learned a lot along the way as an industry. Expensive lessons. It’s time to say “Godspeed” to Centralized Big Data Security Analytics and look toward distributed analytics as a path forward.