It’s the wild, wild, west out there in cyberspace, except the feral camels[1] that once roamed Texas are the hackers, and they’re roaming beyond borders and through firewalls on the daily.

At present, cyber threat intelligence gathering is a mish-mash of intrusion detection system logs, port scans, IP addresses, information sharing platforms, Twitter feeds and traditional write-ups. There is no one consistent language used across these platforms to refer to attacks, techniques or procedures and there’s no one single source of data. Much like post-truth America, you’ve got to look in all the right places to piece together the whole story and even then it’s hard to know if you’ve put the puzzle together the way it was intended. What this means is while there’s massive complexity when trying to understand the path an attacker has taken, it also means that there’s huge potential when it comes to leveraging the data or bits (pun intended) of evidence a hacker leaves behind.

Information Gathering and the Penetration Tester

Penetration testers, who are my focus here, do much of their work when it comes to figuring out attack paths and new ways to penetrate, based on historical data or tried and true ways to compromise a system or application. They might listen to a few podcasts, keep an eye on social media, follow a hacking news website and sign up to a mailing list, but all of this is hugely labour intensive and no one person has the hours in the day to keep on top of, let alone be well versed in, all the latest attacks. The dream, of course, is to have a program or Artificial Intelligence learn the tactics, techniques and procedures of hackers out in the wild, bring it all back into a nice table where all the data is the same data type, turn into a visualisation with a gorgeous dashboard and then teach the team new attacks on the fly as they happen in real-time. This, dream, as wondrous as it sounds, is hanging above the Magic Faraway Tree and yet to be written down and sold as a four set gold embossed collection. What we do have, and I’m focusing here on open source data and software, are many tools and data sets that can bring us just that little bit closer to a rousing monologue that could change the history of how we prevent cyber-attacks in the future.

Big Data Big Complexity

For data analysts, one of the problems with data on the internet is that it comes in many forms, with many definitions and no one universal dictionary to look-up in order to know for sure what a word or a phrase means. Structured Threat Information Expression[2] or STIX, which created by the United States Department of Homeland Security) and is used here in Australia by our own Cyber Security Centre, was created to address this issue. It’s useful in order to try and start standardising the way we talk about cyber threat intelligence so that we are all in fact, having the same conversation, in the same language. Some platforms, like MISP[3] which is a Malware Information Sharing Platform created by Christophe Vandeplas who was working for the Belgian Defence Department at the time, allows users to export the Indicators of Compromise (IOC) that they and others share on the platform in the STIX format. This actively aids the development of a threat intelligence language so that we may use it to talk back to one another and share with the various systems we all use. MISP itself is an interesting platform with the public instance of it boasting more than 1000 organisational users from the across the globe, including the big players like Google, Apple, and our own Federal Police. It’s great at gathering threat feeds that are readily usable for other machines to digest but like every feed I’ve found to date, it tells only one part of the story of an attack or attempted attack. To tell the whole story, human research, interpretation and reasoning is needed, along with further data and frameworks in order to be able to map or make sense, of what actually happened blow by blow. Therefore, mapping attacks is where MITRE’s ATT&CK Framework comes in. ATT&CK describes why an action was performed and the technique used to do it, which is often missing in publicly released reports or write-ups that gloss over the specifics of an attack. MITRE have even produced a STIX version of ATT&CK so you can output the data in a standardised format.

So Many Data Types So Little Time

Using a common language is not the only challenge when it comes to data mining threat intel because when you’re out in the wild looking for feeds that deliver indicators of compromise or information, not all data is created equal. You’ll find XML, JSON, JavaScript, images and if you’re lucky, APIs to query data in a more programmatic way. At this point you’ll need a good grasp of either Python or R to make HTTP requests to get the data like you would if you’re looking up a regular web address, and then you’ll sometimes find purpose built libraries which are often built in Python. So depending on your language preference, R for beauty and simplicity or Python for a more smash and grab approach, both are good to have in your tool belt. Once you’ve pulled the data from various feeds and platforms, you’ll then notice that you’ll have to transform it into something much easier to work with, than JSON key-value pairs which is where data frames come in. Each data set will have particular information that doesn’t always match information in other data sets so cleaning the data is a crucial activity too. After this, you’ll then need to push it to an unstructured database of your choice. Then and only then, can the magic happen. The magic being a genius, yet simple way to collate masses of data and turn it into easy to digest threat intel, served with a side of sweet visualisation and predictive analytics in the making.

The future of cyber analytics is now and I am excitedly working towards making the internet a more hospitable place. I would love to hear from you if you are too.

[1] https://www.history.com/news/10-things-you-didnt-know-about-the-old-west

[2] https://oasis-open.github.io/cti-documentation/

[3] https://www.misp-project.org/index.html

Originally published by the Australian Cyber Security Magazine.

Searching Twitter Data with R and Grep

Learning how to use R Studio, R and then all the libraries and functions inside it can be hell(ish). But there's good little ways to search your Twitter data for whatever you're looking for, and give you some instant satisfaction in the process. Step 1 You will need...

Starting a Startup

I've never wanted an ordinary life. It's why I did a PhD at 24 instead of entering the workforce like all of my friends. It had a negative affect on my careers prospects, LMAO (surprise). My topic was on the history of Western Philosophy and why we became rational...
<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script><!-- [et_pb_line_break_holder] --><!-- left sidebar --><!-- [et_pb_line_break_holder] --><ins class="adsbygoogle"<!-- [et_pb_line_break_holder] --> style="display:block"<!-- [et_pb_line_break_holder] --> data-ad-client="ca-pub-2525169926381896"<!-- [et_pb_line_break_holder] --> data-ad-slot="1293035999"<!-- [et_pb_line_break_holder] --> data-ad-format="auto"<!-- [et_pb_line_break_holder] --> data-full-width-responsive="true"></ins><!-- [et_pb_line_break_holder] --><script><!-- [et_pb_line_break_holder] -->(adsbygoogle = window.adsbygoogle || []).push({});<!-- [et_pb_line_break_holder] --></script>