Machine Learning Can Use Tweets to Spot Critical Security Flaws

Researchers built an AI engine that uses tweets to predict the severity of software vulnerabilities with 86 percent accuracy.
Alyssa Foote

At the endless booths of this week's RSA security trade show in San Francisco, an overflowing industry of vendors will offer any visitor an ad nauseam array of "threat intelligence" and "vulnerability management" systems. But it turns out that there's already a decent, free feed of vulnerability information that can tell systems administrators what bugs they really need to patch, updated 24/7: Twitter. And one group of researchers has not only measured the value of Twitter's stream of bug data but is also building a piece of free software that automatically tracks it to pull out hackable software flaws and rate their severity.

Researchers at Ohio State University, the security company FireEye, and research firm Leidos last week published a paper describing a new system that reads millions of tweets for mentions of software security vulnerabilities, and then, using their machine-learning-trained algorithm, assessed how much of a threat they represent based on how they're described. They found that Twitter can not only predict the majority of security flaws that will show up days later on the National Vulnerability Database—the official register of security vulnerabilities tracked by the National Institute of Standards and Technology—but that they could also use natural language processing to roughly predict which of those vulnerabilities will be given a "high" or "critical" severity rating with better than 80 percent accuracy.

"We think of it almost like Twitter trending topics," says Alan Ritter, an Ohio State professor who worked on the research and will be presenting it at the North American Chapter of the Association for Computational Linguistics in June. "These are trending vulnerabilities."

A work-in-progress prototype they've put online, for instance, surfaces tweets from the last week about a fresh vulnerability in MacOS known as "BuggyCow," as well as an attack known as SPOILER that could allow webpages to exploit deep-seated vulnerabilities in Intel chips. Neither of the attacks, which the researchers' Twitter scanner labeled "probably severe," has shown up yet in the National Vulnerability Database.

The prototype, they admit, isn't perfect. It updates only once daily, includes some duplicates, and in WIRED's checks missed some vulnerabilities that later showed up in the NVD. But Ritter argues that the research's real advancement is in accurately ranking the severity of vulnerabilities based on an automated analysis of human language. That means it could someday serve as a powerful aggregator of fresh information for systems administrators trying to keep their systems protected, or at the very least a component in commercial vulnerability data feeds, or an extra, free feed of vulnerability information—weighted for importance—for those admins to consider. "We want to build computer programs that can read the web and extract early reports of new software vulnerabilities and also analyze users’ opinions of how severe they might be," he says. "Is this a routine bug that developers might need to fix, or a major flaw that could really leave people exposed to attack?"

The general idea of extracting software vulnerability data from text on the web, and even Twitter specifically, has been around for years. Ranking the severity of tweeted vulnerabilities via natural language processing is an "added twist," says Anupam Joshi, a professor at the University of Maryland, Baltimore County who has focused on the same problem. "There's a growing interest in finding vulnerability descriptions when they’re talked about" online, Joshi says. "People are recognizing that you can get early warning signs from things like Twitter, but also Reddit posts, the dark web, and discussions on blogs."

In their experiment, the Ohio State, FireEye, and Leidos researchers began by taking a subset of 6,000 tweets they'd identified as discussing security vulnerabilities. They showed them to a collection of Amazon Mechanical Turk workers who labeled them with human-generated rankings of severity, filtering out the results from any outliers who drastically disagreed with other readers. Then the researchers used those labeled tweets as training data for a machine learning engine and tested its predictions. Looking five days ahead of a vulnerability's inclusion in the National Vulnerability database, they could predict the severity of the 100 most severe vulnerabilities based on the NVD's own severity ranking with 78 percent accuracy. For the top 50, they could predict the bugs' severity with 86 percent accuracy, and 100 percent accuracy for the NVD's 10 most severe vulnerabilities.

Ohio State's Ritter cautions that despite those promising results, their automated tool probably shouldn't be used as anyone's sole source of vulnerability data—and that at the very least, a human should click through to the underlying tweet and its linked information to confirm its findings. "It still requires people to be in the loop," he says. He suggests that it might be best used, in fact, as a component in a broader feed of vulnerability data curated by a human being.

But given the accelerating pace of vulnerability discovery and the growing sea of social media chatter about them, Ritter suggests it might be an increasingly important tool to find the signal in the noise. "Security has gotten to the point where there's too much information out there," he says. "This is about creating algorithms that help you sort through it all to find what’s actually important."


More Great WIRED Stories