Security Research
Showing results for 
Search instead for 
Do you mean 

HP Security Briefing, episode 14 - malicious file vizualization and clustering

‎06-26-2014 04:20 PM - edited ‎09-11-2014 11:36 AM

In this month’s Security Briefing, we conduct a number of experiments with file geometry visualization and clustering algorithms on malicious and clean files using R language. You can listen to this episode of the HP Security Briefing podcast on the Web or via iTunes, and you can read or download the detailed companion report report here.

 

The huge and increasing number of malware samples that AV companies have to deal with on a daily basis poses a problem - how do you efficiently and accurately process and label such a large number of incoming files? With the advent of cloud-based services, Big Data and increasing computational power, many put their hopes in machine-learning classification algorithms. In this security briefing we attempt to describe a hands-on approach and show the basic principles used in the automated classification of files.

 

Once you start analyzing malware files you find that different variants of a malware family are generally derived from a similar codebase and hence might exhibit similar file structures.  Taking this into consideration we applied freely accessible tools such as PEStudio (1) and R language (2) to create various visual representations of a file’s properties. We then studied how such properties and their variations could help us to group and label similar malware families.

 

For instance, processing a set of files with PEStudio allowed us to parse malware files and extract their file attributes in to an XML file format. Then we used R language to generate data frames out of the obtained XML. R is a language and environment for statistical computing and graphics and as such, it provides a wide variety of statistical techniques such as linear and nonlinear modelling, time-series analysis, classification, clustering and more. The R-package also contains powerful graphical visualization capabilities and is highly extensible. Once the attributes were extracted into data frames it allowed us to apply various visualization techniques in order to gauge how such file attributes could be useful for grouping. For instance a parallel graph could give us an idea of how tightly coupled files are based on the selected attributes.

 

 

Parallel graph based on a number of files attributes - Gamarue worm family

  

Going further, we investigated how a simple clustering algorithm such as k-mean clustering would group malware files based on selected attributes. In this study we used data from two malware families. We generated various clustering plots, such as the one below, which allowed us to visualize the quality of our grouping of files based on selected attributes.

.

 

 Clustering plot based on the extended set of attributes - Ursnif malware family

 

We concluded that file geometry alone might not be sufficient for accurate grouping and that other sets of attributes need to be explored and considered. Such attributes could be the results of static and behavioral analysis of code, section entropies, imported and exported APIs and combinations of the above. However, the methodology we provided here might allow for the selection, testing and assessment of possible attributes for the clustering of malware families.

 

You can download the detailed companion report here and if you’re interested in our experiments, we’d really like to hear from you - particularly if you’ve managed to replicate our methods and produced your own visualizations and meaningful clusters. We might even be able to publish your graphs on a follow-up blog.

 

References:

  1. PeStudio - http://www.winitor.com/
  2. The R Project for Statistical Computing  - http://www.r-project.org/

0 Kudos
About the Author

Oleg_Petrovsky

Labels
Events
Aug 29 - Sep 1
Boston, MA
HPE Big Data Conference 2016
Attend HPE’s Big Data Conference on August 29 - September 1, 2016 to learn from peers in every industry and hear from Big Data experts and thought lea...
Read more
Sep 13-16
National Harbor, MD
HPE Protect 2016
Protect 2016 is our annual conference on September 13 - 16, 2016, and is the place to meet the world’s top information security talent, discuss new pr...
Read more
View all