April 16, 2010

Multi-layer Obfuscated JavaScript Using Twitter API

Tamas Rudnai

Nowadays infected Web pages are probably the biggest threat to the IT sector. Most compromised HTML documents contain a JavaScript that generates the malicious content dynamically to make it less obvious what it is doing. To avoid detection, they are using more and more complex obfuscation techniques. In this blog we will analyze a sample with 5 different obfuscation layers using a few tricks to fool automated de-obfuscation engines.

Our sample today is a 6KB obfuscated JavaScript that by the end turns into a single iframe pointing to a malicious site. The threat is using a mixture of Codebook, XOR and substitution ciphering as well as the traditional character representation tricks to hide the malicious content. Some of these techniques have already been discussed in this blog.

To decrypt it, we need to tweak the code a little bit so that the evil script reveals its true nature - as opposed to silently executing the payload. As you can see the injected code looks strange, but other than that it does not tell us whether the code is malicious or not:


What you can see from here is not that much, except you can be sure that the script is obfuscated. For a security expert this kind of code is always highly suspicious as it reveals that the author of the code wanted to hide something for a good reason. If you are indenting the code properly, however, it shows something more to the human eye. Actually you can then divide the code into two parts. In the first part there is only a very short function and a definition of a variable:


Note that we cut off the value of the variable as it was just too long and is not needed to understand this algorithm. The second part is another function and a call to the first function mentioned above:


As you can see it calls function t() which is only a wrapper around function z(), most probably only to use it as a light anti-de-obfuscation technique. Therefore we need to analyze only the second function. It is very easy to spot that it uses simple substitution ciphering, this time only for the letter 'Z'. Also it uses char representation coding, where for decoding it only uses the unescape() function. At the end of the script, you can see the eval() function call. This one needs to be replaced with a print() instead in order to display the de-obfuscated code:


These are the results, and they are something different from the first layer; however, it all still looks quite cryptic. Many of the malicious JavaScripts today are using multi-level obfuscations. As described in an earlier blog we have to decrypt such code layer by layer. In each layer we can see new details of the code: some of them are valuable during the analysis, some of them are not.

Have you noticed the clear URLs at the bottom? There is definitely something there to investigate. Again, we can use indentation tools (or to use their more fashionable name, code beautifiers) to see what is behind the scenes:


We can see the Twitter links, which are clean of course. We could easily come to the wrong conclusion that the script contains only clean URLs, and is therefore not malicious. However, as the bigger part of the script is still cryptic we have to suspect that there is much more behind the scenes.

To understand why these clean links are there, we should study the Twitter API set a little. In this blog we are not diving too deep into this subject. It's enough to know for now that we can include JavaScript APIs from Twitter in order to gather information dynamically from the popular microblog site. With some of these APIs we can then pass the name of a callback function which will then be called from the API. This is a great trick against even the more sophisticated de-obfuscation engines, and also for all black-boxing machines not connected to the Internet - no Internet, no Twitter, therefore no callback.

The sample on our plate uses the 'trends.daily' API which returns the daily trending topics in Twitter. We can easily see the 'callback=' parameter, which tells us the name of the callback function. Later on, that function will be called automatically with the list of trend topics as a parameter. Also we can see that the format of this list is in JSON, which can be processed very easily by JavaScript.

If we look at that code snippet even more closely, we can see that it uses two of these callbacks. The first one is only to determine the date. It also fires up the second callback function, which then utilizes the eval() function in the middle. Note also the following document.write() function, which we will discuss later.

In order to be able to decipher this layer, we need to get to the stage where that eval() function is called with the proper parameters. For this we either need to completely emulate the browser environment including Twitter APIs, or need to modify the code a little bit to get over those callback tricks. For the sake of simplicity I just commented out the part of the code that is not needed just now, leaving only the eval() function call (replacing that with a print() of course). Let's see what we have:


At this stage we are a bit closer to the finish, but the fun part just starts here. This bit uses global variables previously defined in the upper layers, so we need those. Once we copy over the variable set, we realize that the function named 'cz' interferes with the variable of the same name. If we quickly look back, we can see that the eval() function that generates this code snippet is embedded into the callback2() function. This means it is in a function context, not global. That's why it is no problem for the script to override the original definition of the 'cz' and redefine it as a function. This trick, however, makes harder to emulate the code.

In addition to this, it plays with the variable names inside the dw() function to fool simple de-obfuscation engines, which are using context-free grammar only. Furthermore it keeps reusing the variable $a in a nested eval() chain, just for total confusion. Once we solve this level, we get yet another which looks very similar to the previous one; however, there are differences:


Nothing special about this one: it looks like just another layer, except you can see that it also uses an XOR algorithm to increase the level of encryption. Solving this part is very similar to the previous one, so we get into the final level easily, which looks something similar to this:


Now this part could be a bit confusing as there are loads of things happening here. The good news is that we already can see some valuable information: a decrypted domain name. However, we can also see that there is a dynamic URL generation algorithm at the end. To decrypt it fully, we should tweak the code again in order to get it to work.

Basically we have to remove all the browser-specific code snippets, replacing their return values and variable initializations with static data. This way the script can decipher itself in a non-browser environment. Please note that the function returns different URLs depending on the computed values of the variables. Therefore to get the entire list of URLs, we would need to determine and feed all the possible combinations of the values.

However, for this we would need to fully understand the code. The basic idea is that it uses the date as well as character code and the length of the Twitter titles to generate the URL. The sample could create two URLs each day giving a combination of 730 in total throughout the whole year. The algorithm is a true mixture of Cesar and codebook ciphering, as it uses the live Twitter data as a codebook, and then uses this data to calculate a shift value from it. The resulting encrypted text is then used as the domain name for the URL.

Finally, you may notice that there is no eval(), document.write() or any other method in this piece of code to write the data back to the document or otherwise execute the decrypted code. Do you remember that this code is still running inside the callback2() function? There is the document.write($a) right after the eval() which then makes it happen.

After all this we got the results: a hidden iframe pointing to a possible malicious site:


Wait a minute, why did we just say "possible"? Can't we just tell this for sure? Of course we can, once the URL has been correctly generated. However, as the original algorithm generates these URLs from live Twitter data, the URL list can't be guessed, not even by the author of this threat. We can, however, mine the existing trends and calculate all the URLs that were used in the past. Other possibilities include predicting the URL using all possible combinations of the shifting value, giving us a huge list of URLs to block.

This sample is a clear example how cyber criminals are trying to use increasingly advanced tricks to fool automated analysis. Using multiple level of obfuscations and live data from the internet is problematic for the traditional static detection algorithms and requires more advanced methods.

About Forcepoint

Forcepoint is the leading user and data protection cybersecurity company, entrusted to safeguard organizations while driving digital transformation and growth. Our solutions adapt in real-time to how people interact with data, providing secure access while enabling employees to create value.