Security Research
Showing results for 
Search instead for 
Do you mean 

HP Security Briefing, episode 14 - malicious file vizualization and clustering

Oleg_Petrovsky ‎06-26-2014 04:20 PM - edited ‎09-11-2014 11:36 AM

In this month’s Security Briefing, we conduct a number of experiments with file geometry visualization and clustering algorithms on malicious and clean files using R language. You can listen to this episode of the HP Security Briefing podcast on the Web or via iTunes, and you can read or download the detailed companion report report here.


The huge and increasing number of malware samples that AV companies have to deal with on a daily basis poses a problem - how do you efficiently and accurately process and label such a large number of incoming files? With the advent of cloud-based services, Big Data and increasing computational power, many put their hopes in machine-learning classification algorithms. In this security briefing we attempt to describe a hands-on approach and show the basic principles used in the automated classification of files.


Once you start analyzing malware files you find that different variants of a malware family are generally derived from a similar codebase and hence might exhibit similar file structures.  Taking this into consideration we applied freely accessible tools such as PEStudio (1) and R language (2) to create various visual representations of a file’s properties. We then studied how such properties and their variations could help us to group and label similar malware families.


For instance, processing a set of files with PEStudio allowed us to parse malware files and extract their file attributes in to an XML file format. Then we used R language to generate data frames out of the obtained XML. R is a language and environment for statistical computing and graphics and as such, it provides a wide variety of statistical techniques such as linear and nonlinear modelling, time-series analysis, classification, clustering and more. The R-package also contains powerful graphical visualization capabilities and is highly extensible. Once the attributes were extracted into data frames it allowed us to apply various visualization techniques in order to gauge how such file attributes could be useful for grouping. For instance a parallel graph could give us an idea of how tightly coupled files are based on the selected attributes.



Parallel graph based on a number of files attributes - Gamarue worm family


Going further, we investigated how a simple clustering algorithm such as k-mean clustering would group malware files based on selected attributes. In this study we used data from two malware families. We generated various clustering plots, such as the one below, which allowed us to visualize the quality of our grouping of files based on selected attributes.



 Clustering plot based on the extended set of attributes - Ursnif malware family


We concluded that file geometry alone might not be sufficient for accurate grouping and that other sets of attributes need to be explored and considered. Such attributes could be the results of static and behavioral analysis of code, section entropies, imported and exported APIs and combinations of the above. However, the methodology we provided here might allow for the selection, testing and assessment of possible attributes for the clustering of malware families.


You can download the detailed companion report here and if you’re interested in our experiments, we’d really like to hear from you - particularly if you’ve managed to replicate our methods and produced your own visualizations and meaningful clusters. We might even be able to publish your graphs on a follow-up blog.



  1. PeStudio -
  2. The R Project for Statistical Computing  -

0 Kudos
About the Author


27 Feb - 2 March 2017
Barcelona | Fira Gran Via
Mobile World Congress 2017
Hewlett Packard Enterprise at Mobile World Congress 2017, Barcelona | Fira Gran Via Location: Hall 3, Booth 3E11
Read more
Each Month in 2017
Software Expert Days - 2017
Join us online to talk directly with our Software experts during online Expert Days. Find information here about past, current, and upcoming Expert Da...
Read more
View all