Malicious PDF Challenges
Cyber criminals are constantly evolving newer and newer tricks to bypass security scanners with their malicious creations. In this blog we will review one of these: a challenging PDF obfuscation that we have recently seen in a mass injection.
Our ThreatSeeker Network detected several hundred Web sites injected with an iframe where the payload Web site was located in Ukraine. When a user visits a legitimate Web site which was compromised by this threat, the browser automatically follows the injected iframe to a malicious site, downloading the PDF file. This then utilizes several vulnerabilities in the PDF viewer, infecting the computer silently which downloads further malware to the machine. Although Websense customers were already protected from this threat by our existing real-time analytics, we found it interesting to analyze the malicious PDF file in detail.
The first thing we can see in the PDF file is that it has several small JavaScript streams inside:
In fact these scripts belong together, as the main JavaScript object suggests:
/Names [(42a0e) 12 0 R (140) 14 0 R (57) 16 0 R]
Without too much confusion here we can easily see that there are references to objects 12, 14 and 16. That tells the PDF reader that the content of the JavaScript has been split up into three objects; therefore we need to concatenate them together to see the entire script at once.
After this all we need to do is to tweak the code a bit so it can be analyzed by a human. The steps we have taken were only a few easy modifications, including indenting, otherwise known as beautifying the code. That simple step made the code generally readable but still it was hard to understand how it works - partly because the script was injected with junk code fragments that were irrelevant to the algorithm itself and the only reason the fragments were there was to make the code difficult to read. After removing these, the code shrank to half the size and was much easier to understand. There was only one small final step left: to analyze each of the procedures there and rename obfuscated variable and function names to relevant ones. It only took a few minutes, and the result was this:
From this point we can see the two API calls, getPageNumWords and getPageNthWord. The first one is used to determine the number of words in a PDF page, while the latter one picks up a word from the page. We can clearly see therefore that this JavaScript code makes operations with the words used in the PDF document, but why would it do that?
To understand better we need to examine an another PDF object stream, the page content:
It might look a bit cryptic at first glance, but for the sake of simplicity it is enough to know that it only sends some rendering information to the PDF viewer and then after the 'Td' tag inside the brackets we can see the text content itself. We are interested only in the text, as we do not need to know how it would be displayed. As we can see, it contains loads of short words, each one of which is basically a three or four digit hexadecimal number. The algorithm shown above tells us that it uses only the last two digits from each one of these numbers, so we need only those for a successful de-obfuscation. Also easy to spot in the decoder is that it uses a lightweight decryption routine to get back the clear content of the encrypted stream.
Now that we know enough let's decode it! We ended up revealing this code:
From this point we can clearly see the obvious malicious content. Our malicious PDF sample utilizes several different vulnerabilities which have suffered exploit attempts with a shellcode sprayed all over the heap.
By the time of writing the malicious PDF was detected only by four antivirus products, according to VirusTotal:
The shellcode downloads a trojan which also had a poor detection rate:
As we can see from this example, obfuscation techniques are always changing and bad guys are always challenging our engines. Unlike traditional obfuscation methods, with this one it is not enough just to load a single JavaScript stream from the PDF file: you need all three of them, and even the text content needs to be processed for a full de-obfuscation. Here at Websense Security Labs, therefore, we are constantly monitoring the latest tricks and improving our technologies accordingly to provide the best protection to our customers.