Security Research
Showing results for 
Search instead for 
Do you mean 

Real-time C2 blocking via Random Domain name identification

spovolny ‎12-09-2013 01:13 PM - edited ‎02-24-2014 10:03 AM

With credit to Brandon Niemczyk
HP TippingPoint DVLabs


Many pieces of malware use automated domain generation to increase the robustness of their network and avoid being blacklisted. It puts much more burden onto the party trying to block by DNS blacklisting.


Why can AV vendors not just reverse the algorithms and sink-hole all the domains that will be generated?

The traditional method of reversing the domain generation algorithm and pre-determining the domains to block has 2 major pitfalls:


  1. The time it takes to reverse engineer is longer than the time it takes to write a new generation algorithm, leading to an arms race the good guys can never win.
  2. Domain algorithms may include seed information unknown at the time of writing and reversing, such as the 4th most popular YouTube video on the day of generation. So that even if algorithm is fully understood and reversed, there is still no way to know ahead of time what the generated domains will be.

0-day C2 coverage

Using a statistical approach allows domains to be discovered and blocked real-time, without any prior knowledge of the malware, protecting patient zero.


Method Overview

We want to look at the domain name and automatically decide if it is meant for humans to read, or if it is likely generated by an algorithm. Intuitively, this is easy for a human to do, but requires a bit more thought to automate. So we will leverage old-school, cipher-analysis techniques but instead of posing the question, “Does this key applied to this text produce valid output?” we will ask “Is this string likely to be human readable?”


We are going to need to select some data points to analyze. The simplest solution is to associate a probability of showing up in English with every character (called 1-grams) and character pair (called 2-grams). Then by analyzing the probability of a certain string being generated, we can decide if it is likely to be English. The upside to this is it is easy to generate the probabilities and does not require any understanding of the semantics in the language we are targeting. In fact, given a string, we just take all the probabilities for 1-grams and 2-grams and take the mean of each type. Those 2 numbers can be used to determine how likely a string is to be our target language.


To understand why these will be useful data points to look at, let’s take a look at histograms of the means for datasets of both English and randomly-generated domains.


In the following histograms, the x-axis is the mean probability of occurrence in English and the y-axis is the number of occurrences in the training data. As an illustration, a random 2nd level domain may be "jfkdslvjiewnfi0e2jlfjksl" and an English 2nd level domain may be "facebook."


2-Grams (example: AA - ZZ)

  • English domain 2-gram mean distribution


  • Random domain 2-gram mean distribution

1-Grams (example: A - Z)

  • English domain 1-gram mean distribution
  • Random domain 1-gram mean distribution

By looking at these histograms, we can see 2 immediately useful things:


  1. They seem to follow a normal distribution
  2. Overlap between English domains and Random domains does exist, but is minimal. So by using both inputs we should be able to classify correctly a large percentage of the time.

The algorithm

So the idea is simple, when given a domain do the following:


  1. Parse out the second level domain [It is important we use only the second level domain because there are legit uses of random subdomains (Akamai, AEWS, etc.), and we do not want these to be flagged as malicious.]
  2. Calculate the mean 1-gram and 2-gram probabilities for 2nd level domain.
  3. Map the means for the domain to a probability that the domain is random. (In the following section we will define how to do this.)

A function to map means to probability of being random

The simplest solution is to view mean 1-gram, mean 2-gram a

s a point on a 2-D plane and find a separating line for all our training data. This could be done with logistic regression. But this generates more false positives and false negatives than is acceptable, so we need to find a non-linear solution.


Finding the mapping function

We wrote a fuzzy logic module for Mathematica which takes a set of fuzzy rules and translates them into equations, here are the fuzzy logic rules we used:


$\text{very}(mf \neq englishmf) \vee \text{very}(mtp \neq englishmtp) \vee (mf \neq englishmf \wedge mtp \neq englishmtp) \mapsto \text{RANDOM}$


$1<13 \vee \neg \text{RANDOM} \mapsto \text{ENGLISH}$


$englishmf \text{ and } englishmtp$ are parameterized gaussian functions scaled to peak at 1.0. Equality with them is by taking their value; inequality is 1 - taking the value. So equality and inequality return a value between 0 and 1, the very() function squares that value, making it so it must be higher to pass.


The fuzzy logic module then takes these rules and generates a large parameterized equation with parameters $\Theta$ for the distributions that represent englishmf and englishmtp. We can then use whatever method we want to find the best $\Theta$ such that RANDOM values map to 0 and ENGLISH values map to 1 as often as possible. After fitting using a squared error cost function, we ended up with the equation in figure 2. Which, when ignoring the $1$ dimension, can be rendered (see Figure 1).


Figure 1 (Graph of final mapping function)




Figure 2 (Final mapping function)


$ \max[ \\ \text{ } 0.5 + 0.5 \text{ erf}(14142.1 - 707.107 l), \\ \text{ } 1 - \max[ \text{ } (-1 + 2.71828^{-1371.59(-0.0759477+mf)^2})^2, \\ \text{ } (-1 + 2.71828^{-445.707(-0.114675+mtp)^2})^2, \\ \text{ } \min[1 - 2.71828^{-1371.59(-0.0759477+mf)^2}, 1 - 2.71828^{-445.707 (-0.114675+mtp)^2}]]] \\ $

Where erf = error function, l= length of 2nd level domain, mf= mean 1-gram probability, and mtp=mean 2-gram probability



After training, we created a testing set with 5000 random domains and 5000 English domains, then ran this algorithm against them, the results are below with blue dots for the English domains and red dots for random ones.



There are a total of 114 random domains that have a value > 0.5 and only 8 English domains that have a value < 0.5, giving a 1.22% error rate and only 0.16% false positive rate.



Statistical analysis of domain names can be a very useful way to find C2 traffic for many malware families currently in the wild, including unknown malware. There are many ways to do this. It is easy to get around (just smash English words together for your domain generation algorithm), and is certainly not a silver bullet, but it can be useful. We used it for labeling DNS data to do further research on finding infected hosts using Markov models which we presented at VirusBtn 2013.


Much research in this area has been done. Damballa has done work in attributing a particular domain name with a specific malware family.

About the Author


Steve Povolny is a Senior Manager for DVLabs Security Research and Development teams at HP TippingPoint.

on ‎12-20-2013 06:02 AM

Did You also test with some non-english, for example German or some East European language? What About Polish, Czech, Slovak, will it still work?


Anyhow, I like this a lot.




on ‎02-05-2014 02:48 AM

Lovely post and correlates exactly with the trends we see of domain names containing english words smashed together instead of jibberish.

June 6 - 8, 2017
Las Vegas, Nevada
Discover 2017 Las Vegas
Join us for HPE Discover 2017 in Las Vegas. The event will be held at the Venetian | Palazzo from June 6-8, 2017.
Read more
Each Month in 2017
Software Expert Days - 2017
Join us online to talk directly with our Software experts during online Expert Days. Find information here about past, current, and upcoming Expert Da...
Read more
View all