Crypto-Analysis in Shellcode Detection
Probably the biggest computer threats nowadays are the Exploits. Exploits seek out the vulnerabilities of an application to launch their malicious activities. Cyber criminals target popular applications such as Shockwave Flash and Adobe Acrobat PDF to keep the chances high that a user's computer is vulnerable. In this blog we will examine a Flash exploit using a very simple crypto-analysis technique we call X-ray.
Crypto-analysis of malicious code is not a new technology or invention. It has been used in fighting MS-DOS viruses since the '90s. This article provides an in-depth, detailed discussion on this subject, explaining how it works and how it can be used for malicious content detection in shell code.
First we need to understand the X-ray technique and how it works, and then we can see how it helps us to analyze and detect malicious content in shell code. X-ray is basically a differential crypto-analysis method which is a very easy way to attack simple encrypted data. What we assume is that when a simple block encryption algorithm is used, the difference between the consecutive data blocks remains the same.
One very good way to explain this is to encrypt a picture and then try decrypting it. Take a look at this picture:
The picture does not tell us much, except that we can see that it is encrypted. It looks random enough, even though we can spot some repetition. In fact the algorithm used is very simple stream ciphering with some avalanche effect. The result is a picture that suggests very little about itself. However, when we generate the difference in between the consecutive bytes, we get this:
Ah-ha! Now we see that this is the logo of our secret weapon against Internet threats. :-) (See the original graphic below)
Now, no wonder it is called X-ray! We may not see the 'skin', but we clearly see the 'bones'. The resulting picture is far from the original one, but is good enough to see what it was. Nice, but how does it work?
To understand, we need to get into the math behind cryptography. Take a look at this very simple block ciphering algorithm. We have a message of:
Where M is the n length of plaintext message and m is the block of the message (typically a character).
In order to get the ciphered message of:
Where C is the n length of ciphertext (encrypted) message and c is the block of the message (typically a character), we need to apply an encryption to each one of the message blocks using the same key:
Where E is the encryption algorithm using k key.
When E encryption algorithm is a simple XOR using the same k key on each block, then
the (above) formula gives us the encrypted stream. Usually we see this simple method in shell code with byte size blocks. In other words, each one of the characters of clear text is simply XORed with a constant (see the pseudo code). The reason this kind of encryption is so popular is that it is easy to understand and it is also easy to obfuscate the data and the code sections enough to avoid detection by a simple string detection engine.
Now we can note that a simple differential analysis will easily decypher this kind of encryption:
Because XOR is commutative, we can remove the brackets and reorganize the equation to:
We know that:
As we can see, a simple block ciphering does not provide strong encryption. And because simple block ciphering is widely used in exploits, we can easily break those by decyphering known text or binary content in them. To put the theory in practice, let's take a look at this simple decryption loop taken out of shell code used in an SWF exploit:
(MD5 of the sample: 32398CBF94CA9B42E0B415BB93B00CFE)
As we can see, the code uses byte size blocks and a simple XOR ciphering with a constant 0x3D. Inside the code we can also see a pattern starts with some 0x3D following by a text "UIIM":
We might suspect that is an encrypted URL starting with "http://". Now that we know the algorithm and the encryption key, it is easy to double check if our suspicion is correct. The question is how do we find this string without knowing the key?
Do you remember the differential attack? All we need to do is to take a known text, which is "http://", and create a stream of differences:
Similarly, we create a difference on the encrypted stream:
And then, if we can find ΔM in ΔC, then we have what we are looking for. Obviously, the longer the known text, the less prone it is to falsely detecting the string.
The next step is to determine the key, and decipher the entire URL, which is very simple by just doing an XOR on the first detected block and the first block of the known text:
Knowing all of this, we can now write a simple analysis tool that can find and extract 'interesting text' from a binary file. As an example, here is the output of a small Perl hack I wrote earlier (see the script attached below):
The good thing about this technique is that we have to generate the differential set from the known text set only once in the lifetime of the application. Also, we need to generate the differences of the scanned shell code only once to check all the known text from our dictionary. Our dictionary therefore can be huge, including not only known and unknown URL patterns, but binary sequences that can identify each type of shell code we already know.
Although in reality we see mostly simple block ciphers in exploits, there are many examples in viruses and trojans of much more sophisticated encryptions. These use a variety of block and stream ciphers with different length encryption keys, even applying more than one algorithm on top of each other to harden the encryption. Breaking such ciphertext requires more complex method, however, due to constantly increasing computing power, it is even possible to attack the DES algorithm. The good news is that even when stronger encryption has been applied, we have better techniques to detect malicious content.