How does AI predict cyberattacks?
Cyberattacks are designed to be hidden, but AI is shining new light on these threats earlier than ever before. Leidos is a prime contractor on an IARPA research program that uses AI-ML to observe the early stages of cyberattacks through unconventional signals, including tweets and other open source data. The research, which was featured earlier this month in WIRED, demonstrates how AI-ML is accelerating threat intelligence and changing the cybersecurity landscape. To learn more, we welcome Dr. Graham Mueller, a senior research scientist on the project.
Q: The CAUSE program which you worked on revolves around AI-ML in cyber threat intelligence. What was the goal of the program?
Graham: Cyberattacks generally develop in the phases outlined in the cyber kill chain. In this chain of events, one of the early stages is the surveillance stage, where a hacker scans their target’s infrastructure, or gathers other open source intelligence about the target’s key personnel for targeted phishing emails. The attacks evolve from there. Our goal was to observe the early stages of an attack using unconventional data sources which serve as indicators or signals of the attack. Tweets are one of these unconventional signals, but there are many other data sources we looked at, including content on the dark web and open source software repositories. Much of our research was devoted to developing useful sensors from these unconventional data sources which were used as input into our ML driven prediction models.
Q: Why has detecting cyberattacks in the early phases traditionally been so difficult, and why is this a problem uniquely suited for AI-ML?
Graham: First, cyberattacks are designed to be hidden. If a hacker is setting up infrastructure to deliver malicious email campaigns, for example, they don’t want you to know about it. They don’t want to be identified, and they will actively take measures to be hidden.
The second problem is called a “web scale” information problem. Even if you set up the ability to monitor these things in real-time, the amount of information that is produced on a daily basis is huge. This is one of the main reasons ML solutions are so useful in this area. Traditional cyber defense systems are often signature based, which means they protect against things they’ve seen before based on certain rules. This approach is very brittle, because a hacker could just change the name of a malicious domain so that it would no longer be blacklisted. ML approaches can be so effective because they’re probabilistic. They can take something you’ve never seen before and make some prediction about it based on how the system is trained, so you don’t have to rely on a strict, rule-based system to block attacks.
, Senior Research Scientist
It all comes down to the data — identifying data sources which could give you additional time to take preventative measures against an attack.
Q: What else do you see at the intersection of AI and cybersecurity? What other problems are you and your team of researchers looking into?
Graham: The big thing I think about is managing information. With the cybersecurity and threat intelligence landscape, part of the future with AI-ML is to bubble up the most pressing threats to organizations. There’s so much information coming at you that it’s very difficult to know which attacks are the most threatening. AI-ML tools have the potential to help us identify these threats both as they occur and at the scale that they occur and allow us to respond.
Q: In your experience, what are the critical success factors in developing AI-ML solutions that solve real problems?
Graham: It all comes down to the data — identifying data sources which could give you additional time to take preventative measures against an attack. There’s always a need for providing actionable information to decision makers, rather than just a high-level overview of what’s happening. You need the right data that tells you specific things relevant to your networks and systems in order to adjust your defenses accordingly.
One of the bottlenecks in developing machine learning applications is the need for large, labeled data sets which are used to train the ML models. There’s a significant effort needed to first collect the data and manually annotate it. This is a hugely time-consuming thing to do. The approach we took is to use “weak supervision” to label data very quickly in order to leverage it. This allowed us to develop custom cyber-focused event extraction tools which we used as input into our ML models.
Q: What’s next? What do you hope to achieve in future AI-ML projects?
Graham: Our overall goal is to provide real value to the cyber security world by providing actionable information and decreasing detection time. We’re focused on providing indicators of malicious activity in real-time. Moving forward we want to quickly develop algorithms which extract important information from huge streams of data and summarize it. That’s what is so exciting about our research.