Malware Analysis & Forensics: Analyze Malicious Documents

Tags: #<Tag:0x00007f389e6e1300> #<Tag:0x00007f389e6e1170> #<Tag:0x00007f389e6e1030> #<Tag:0x00007f389e6e0ef0> #<Tag:0x00007f389e6e0db0> #<Tag:0x00007f389e6e0c70> #<Tag:0x00007f389e6e0b30>


This is my growing collection of analysis approaches to deal with a potentially malicious office document.

You can often find such documents in Phishing mails, or behind malicious web links. People, you know … normal users, are more likely to open documents rather than an executable. The thinking behind that is naive, but understandable.

If you deal with Malware incidents, and the infection chain contains malicious documents, you may also want to check out my other articles to get a better overall understanding. - One inch deep: there is much more you need to know. But knowledge grows in practice.

Please research the validity of these information independently. If you spot an error, please let me know. Most of this material is from my personal recollection, experience at work, also with clients, and years of background research (the link goes to my old Evernote notebook)… and last but not least ongoing interest. Latter is the most important thing, actually.

Personally I do not provide any professional Incident Response support or any training outside of my current employment for contractual reasons. Do not send me your samples. :slight_smile: Unless you are an attacker.

I maintain these wiki articles as a personal grid for growth. Not as a professional or relevant reference.

Malicious document files: Microsoft Office, Adobe Reader PDFs...


A malicious document is like a Malware.

It comes in a form of a manipulated file. Like a PDF, DOC, DOCX, XLS… You might be inclined to classify these as harmless, as just reading material, because they are not EXE files. However they can EXEcute code, and that is the problem we have here. It’s not just reading material.

  • Microsoft Office has Visual Basic, a runtime environment to execute code (Macros).
  • PDFs, for Adobe Reader for example, can execute JavaScript or Flash. That means these are executable files, actually.

On top of that bugs in these office document softwares can be abused to run Shellcode. That means there can be exploits in malicious documents. That usually means, that no interaction is necessary besides opening the document, and it will lead to the system being compromised.

And where there are documents, there is confidential information. We all know that. Malicious documents do not need vertical privilege escalation to steal information and data.

Microsoft Office - ILOVEYOU Melissa click click whoops

For security reasons, please put all your metallic items on the tray and click this button:


In a similar way the computer viruses Melissa and ILOVEYOU spread.

OfficeMalScanner - for static analysis

For static analysis of Microsoft Office files you can use OfficeMalScanner by Frank Boldewin.

OfficeMalScanner v0.5 is a Ms Office forensic tool to scan for malicious traces, like shellcode heuristics, PE-files or embedded OLE streams.

The info command can dump out the VBA macro so that you can read it in VScode, or something alike. MS Office doesn’t allow you to read Macros without allowing the malicious VBA code to execute.

For docm and some other newer XML based Microsoft Office, you can run inflate first to decompress it. You will find the files in %TEMP%\DecompressedMSOfficeDocument . There will be a .bin file.

Read VBA code for Malware Analysis

In the VBA code check for Sub AutoOpen(), which will automatically be executed by MS Word when the document is opened. For Excel it’s Workbook_Open().

You will often find that these scripts download exe files. That means it is likely that there will be a URL string, either directly readable, or it will be computed. The document will attempt to run the .exe in a Shell or Scripting Host style environment. That really depends.

There are also Droppers. They will use the document to compose the malicious Drop from the XML or OLE stream. One good indicator of a Dropper is, that the malicious document has a lot of blank pages (unreadable content, transparent or white). This unreadable content is used to puzzle the Drop together from the text-encoded data. You will see something like:

For Each Foo In ActiveDocument.Paragraphs

So the loop re-encodes the document paragraphs to an executable. That will be saved somewhere, usually AppData. Malware authors often throw their Drops there, because this path is always write-able and the user doesn’t check it.

python-oletools - Python tools for malicious document analysis

python-oletools is a package of python tools to analyze Microsoft OLE2 files (also called Structured Storage, Compound File Binary Format or Compound Document File Format), such as Microsoft Office documents or Outlook messages, mainly for malware analysis, forensics and debugging. It is based on my olefile parser.

You can get the tools here, on Linux / REMnux. They are written by Philippe Lagadec, afaik.

They can get invoked like this:

python2 foo.docm | vim -

You will get a nice Analysis summary opened up in your favorite text editor. And you don’t need to decompress the Microsoft Office file. If installed, you can set Vim’s syntax highlighting :set syntax=vbnet.

This script will also flag IOCs and suspicious uses of VB Macros here. That is useful for a quick initial report.

Dissect OLE .bin files with SSView

Not every malicious task needs to be implemented as a Macro.

In order to discover tricks, which make use of OLE, you can use SSView.

This tool allows to completely manage any MS OLE Structured Storage based file. You can save and load streams, add, delete, rename and edit items and property sets.
Embedded streams can be viewed as hexadecimal listing, text, bitmap, icon or RTF.

OLE is like a file-system actually.

You can find Macro cache artifacts in the “SRP” streams. It’s possible to find Macro drafts from the Malware author dumped into the OLE SRP. Most attackers do not know that.

Open the decompressed XML with a real XML editor

Instead of scrolling through a junk-load of XML, use an XML editor. XML Notepad for example is free and does the trick.

In order to be able to open binary OLE file like DOC, XLS etc. you can use Office Binary Translator or the Microsoft conversion utilities. If you want, you can also open the documents in a lab VM, and save them using the XML format. That however might strip some of the information.

Exploits in malicious MSO documents : direct compromise through double-click on an office document. Douple-click whoops

Sophisticated attackers will have exploits, and do not need to rely on a user to enable Macros. These can also rely on OLE (the .doc, .xls, .ppt format without the x at the end).

Dissect a malicious OLE document with OffVis

This is a free tool from Microsoft Security Research & Defense.

We’ve gotten questions from security researchers and malware protection vendors about the binary file format used by Microsoft Word, PowerPoint, and Excel. The format specification is open and we have spoken at several conferences (1, 2, 3) about detecting malicious docs but we wanted to do more to help defenders. So earlier this year we started working on an Office Visualization Tool called “OffVis”.

OffVis website at Microsoft.

It has got an anomaly detection for the OLE fields, and signatures for known patched old vulns. MS could do a better job in supporting it though.

Scan a malicious OLE document with OfficeMalScanner

OfficeMalScanner also has a scan command.

Personally for me this is too much automation for this particular analysis. Sometimes OfficeMalScanner does not work. You can try scan brute to brute-test for some deobfuscation techniques.

Now to be clear: if there is Shellcode in an Office document, the chances that Sally from Finance has written this document (for legit purposes) are very slim… But finding out what exactly an attacker intended to do might not be that easy to determine.

Shellcode characteristics in exploitation of malicious MS Office documents

OfficeMalScanner has a utility called MalHost. You need to find the beginning of the Shellcode, which usually is something like POP POP or PUSH. You need to poke around to find this, and use your experience.

  • A good way to search for the beginning of the Shellcode is to use the scan results, and / or…
  • FileInsight. You’d look for a start point, which doesn’t result into junk assembly instructions with this tool. FileInsight has got an OLE parser, but it never works for me.
  • If this doesn’t yield results, try xorsearch from Didier Stevens

Let’s say we think the start offset is 0x42424:

MalHost-Setup LookAtMe.ppt code.exe 0x42424 wait

The optional wait adds an infinite loop before the extracted Shellcode. You can instrument the code.exe with a Debugger, like OllyDBG or better IDA Pro. The debugger will be stuck in the loop. Then you can pause, and read the Shellcode. Afterwards you’d binary-patch the loop and single-step through the code.

Before you enter the shellcode, open up some of the Behavioral Analysis tools. This for extra visibility, because some Shellcode is nasty.
If you see calls to GetFileSize, that usually is a self-check from a Dropper or a Downloader. It may mean that you are on the right track. But unless you run this in a lab with MS Word, the needed Handle might not be there.

jmp2it can fake the Handle

There is a utility by Adam Kramer to replace MalHost-Setup for scenarios, where you have to deal with Shellcode in malicious documents, which needs Handles. The tool is called jmp2it and it has got the parameter addhandle. You’d instrument jmp2it with IDA Pro or BinNavi…

An example invocation is:

jmp2it SV_funny_birds__1.bin 0x42424 addhandle funny_birds.doc

0x42 is the start of the Shellcode. funny_birds.doc is the Handle we need. The .bin gets extracted by OfficeMalScan. Just make sure your lab is setup properly.

The RTF format is not secure per se - it's a modern attack vector for MS Office. RTFscan can help analysts.

RTF files can be embedded in .doc files. OfficeMalScanner can detect an embedded RTF. Then you can continue with RTFscan.

Here is a sample invocation:

RTFscan funny_birds.doc scan
xorsearch -w -d 3 SV_funny_birds__1.bin
brutexor SV_funny_birds__1.bin > /tmp/out.txt
MalHost-Setup SV_funny_birds__1.bin code.exe 0x42424 wait

This assumes that RTFscan dumps out a SV_funny_birds...bin file, that you want to xorsearch through. The workflow is very similar. You should get to a GetEIP method. This should also reveal the XOR key. If you find multiple GetEIP methods, it’s possible that you are dealing with multi-staged Shellcode.

xorsearch is usually better at searching through the objects than OfficeMalScan and you can add custom signatures.
brutexor can be found here. It can brute-force all possible 1-byte XOR key values and examine the file for strings that might have been encoded with these keys.

Using Behavioral Analysis tools again is useful to observe the Malware sample in the lab.

Long story short: RTF can be abused as well. It’s not a safe file extension.

Other tools I made good experiences with

Didier Stevens also wrote, which can come in handy to extract OLE file streams. The 010 hex editor has got a template for OLE, and might be more useful. For XML as well.

For more advanced analysis there is Hachoir3, which has got an OLE parser.

Summary - Malicious MSO document analysis workflow

To analyze a Malicious MSO document you need to decide whether you deal with the binary OLE or the XML based standard. Then you need to dissect the format and look for interactive content, such as Macros. It’s also possible that you find exploits in OLE streams, or even manipulated image documents which try to trigger vulnerabilities. There are loads of options, because MSO documents are like containers for all kinds of objects.

Adobe Payload Delivery Format (PDF) - (click) whoops

I go so far and say: PDFs are more dangerous than MSO files. For Macro-based Malware documents it’s “click click” - pwned. For Exploit-based Malware document it’s “double-click” - pwned. For PDFs clicks might not be necessary to get pwned, because many readers enable interactive content by default.

PDF parsing is very complex to do. PDFs can be active documents, with JavaScript. On top of that you can embed objects, like ActionScript / Flash, into PDFs. File preview functions in Adobe iFilter have triggered vulnerabilities in the past, which led to mass compromises.

JS can be in benign PDFs, which contain forms. That isn’t unusual in certain bureaucracies, federal institutions, military etc. Even schools use this a lot. And that’s ok.

Attackers however can use JS in PDFs to trigger Heap Overflows. So they will lay out the Chunks in very big arrays, and define the Heap memory layout.
In opposite to a Stack BOF you don’t find the return address overwrites. You will see that an attacker sprays the Shellcode copies all over the Heap, not only once. Next time the program jumps, it will be in one of the injected Heap Chunks. And the Shellcode gets executed. That is not 100% certain, but likely. Given that there is no other memory.

Felipe Andres Manzano has blogged about this some years ago, when it was fancy. Peter has an excellent tutorial here. I wrote one, a long time ago… it’s fun.

What do you mean there is an ISO standard for PDFs?! I thought that's just text.

A PDF, just like an OLE or XML MSO document, is a collection of elements. In PDF you can find streams, but also strings, arrays… There is the ISO 32000 standard, which is a large comprehensive standard. I really wonder why no one implemented mobile apps as PDFs. :wink:

Sad thing is, that the only people who invest time into knowing what a PDF really is, are security professionals. Think about that.

(picture is from Yuuhei Ootsubo)

The structure can be summed up like this:

  • header: %PDF-1.7 or something
  • the PDF body contains the objects like X Y obj and ends with endobj. X is the object nr. Y the version.
  • there is our X-ref table which has got offsets to the objects and their logical order
  • the PDF trailer lists the number of objects and the offset of the xref table. It ends of an EOF

In a PDF you can compress encoded streams into the objects to save space, and the PDF reader has to use one or multiple filters for decompression. That means there can be encoded JS inside a stream. So our tools need:

  • to be able to decode and decompress streams out of dissected PDF objects
  • detect the type of the stream
  • support Flash, JS and HTML
  • highlight suspicious actions, like when a PDF connects to a website or wants to launch a local executable (which is supported)
  • highlight interesting keywords such as:
/AA, /OpenAction - for execution behehavior
/JS, /JavaScript - for embedded JS
/RichMedia - for embedded Flash
/ObjStm - to wrap objects in streams
/Names, /AcroForm, /Action - launch scripts and actions
/GoTo is an inside anchor
/Launch - runs a program and opens a document
/URI - URL or file:// access
/SubmitForm and /GoToR - to send data to a URL
  • normalize different encodings from 7bit ASCII
  • detect and possibly de-obfuscate Shellcode

Didier Stevens' tools: pdfid, pdf-parser

Didier’s tools go through the PDF and scan for the keywords. A sample invocation is:

python2 pdfid foo.pdf

That’ll give you a count for these keywords, and you can make estimations and perform a basic triage. With pdf-parser you can go one step deeper, if you think it’s worth it. This will for example find (--search /JS) a /JS or pass objects through a filter. You might want to add the Vim pipe I use pdf.vim:

python2 pdf-parser --search /JS foo.pdf | vim -c 'set syntax pdf' -
python2 pdf-parser --object 42 foo.pdf
python2 pdf-parser --object --filter --raw 42 foo.pdf > out.js
vim out.js

Here pdf-parser will search for /JS. We found our /JS in object 42, and we continue. Then we filter and decode that object 42. Then we open the JavaScript in vim.
I often see Unicode (%u) encoded Shellcode in JavaScript vars, which looks like it’s generated with msfvenom and default arguments. Few attackers write their own Shellcode. Or read the msfvenom documentation.

Shellcode characteristics in exploitation of malicious PDF documents

In order to dump out the Unicode object Shellcode, we can use Mozilla Spidermonkey. Didier created a fork, which is very useful, because inherently a PDF reader, just like a browser, has some extra definitions.

js_didi -f /tmp/remnux/def.js -f out.js > out.js

That should generate some traces at least, if that runtime is suitable. We could now print the Shellcode out with JS, or debug it. I have debugged such bad JS code in IntelliJ WebStorm, because it was a lot of code.

But there is an easier way from Lenny Zeltser which I prefer instead of JS hacking. In REMnux he adds a utility called unicode2hex-escaped. I remember that I spent hours on fiddling with this kind of stuff. You may not need to do that.

Save the long Shellcode var to a file. It’s usually a multi-line string-like variable with a lot of %u patterns.

unicode2hex-escaped < shellcode.txt > shellcode_dump.txt
head shellcode_dump.txt

There should be something familiar in the dump, like some \x90 NOP sled. Now we generate an executable from the dumped Shellcode.

python2 -s shellcode_dump.txt
xorsearch shellcode_dump.exe http:
medusa shellcode_dump.exe

Medusa is an OpenSource disassembler I use sometimes for clean files like this. But IDA Pro has got Linux support as well.
Most attackers do not know better obfuscation techniques. I get results with xorsearch.

pdfwalker and pdfextract - walking on Origami

pdfwalker is a tool written by Sebastien Damaye. He writes many useful tools.

It’s a GUI dissector, just like pdf-parser. You’d jump to an object and dump the encoded stream. It has always worked for me, and I prefer this one for reports because I can make a screenshot :wink:

pdfextract is from the same toolkit. With pdfextract -j foo.pdf I can dump out JavaScript, and continue with my analysis to get the Shellcode next, if there is any. pdfextract is useful if the JavaScript is cut into pieces and spread across multiple objects. You might also find funny fonts. I think Malicious Font Analysis needs to be added to my guides. Among other things.

Shellcode emulation with libemu

libemu is a x86 emulation library by Angelo Dell’Aera.

There is a utility called sctest, which is part of it. This comes in handy if attackers use something like a polymorph encoder to obfuscate their auto-generated Shellcode. I have very little patience for stuff like this.
On REMnux there also is unicode2raw, which you may need to use to prepare the dumped Shellcode. The result looks best in Vim with this plugin.

unicode2raw < shellcode_dump.txt > shellcode.raw
sctest -Svs 1000000000 < shellcode.raw | vim -c 'set syntax asmx86' -

That’ll look very familiar. If your attackers write x86-64 shellcode, this won’t help.

Shellcode cannot just read EIP, so you’ll see something like:


The CALL inst will push EIP on the stack, and POP saves it into EAX here. Classic GetEIP. You could add this directly to your IDS rules, and use Yara to find it.
But of course you can write more sophisticated Shellcode.

There is also the EggHunter pattern where the Shellcode is split. One part is the Hunter. This gets started first. The second part is the Egg. The Hunter looks for the Egg through memory.
So you will see a Hunter, doing a loop in assembly. There will be a CMP in the loop for the byte sequence, which marks the Egg.
That’s good for an analyst, because I can just step into the loop. There is no dangerous code in there. I can see what CMP needs to succeed. And then look for that pattern in memory myself.
But even today most automated detention routines only try to detect the loop. That’s because most people don’t know much about Shellcode, and they only look for NOPsleds.

As a side note: Shellcode doesn’t need to use a NOPsled technique with NOPs. It can use meaningless ADD insts, all kinds of fillers. Sophisticated Malware will avoid NOPs, because it doesn’t want to be caught by some generic Shellcode detection rule. If you find something like that, chances are good that it’s not just an Egg. :slight_smile:

Shellcode usually wants to access Windows API functions from kernel32.dll. But that’s loaded by most programs. It will often use LoadLibraryA or GetProcAddress. With that it can load any DLL and resolve any function on Windows. In order to load these functions the Shellcode needs a pointer to kernel32.dll.

The location of this pointer is in the PEB, the Process Environment Block; among other things. The pointer to the PEB is at FS:[0x30]. How would you find the PEB?



PUSH 30h

Then you can parse the PEB to find kernel32.dll. FS:[0] is the beginning of the SEH chain. And you can ride the SEH chain in Shellcode as well.
You will also see that Shellcode doesn’t lookup the function names directly. Instead it uses hashes or lookup structures.

In short: FS:[0x30] will have the pointer to the PEB. I can load that into ECX. FS:[ECX+0x0C] the pointer to PEB_LDR_DATA within the PEB. At FS:[ECX+0x1C] is the InitializationOrderModuleList. And somewhere within that list is kernel32.dll. That’s why you might see all these FSs in the code.

Other tools I made good experiences with

There are some hosted sandboxes, but I recommend none of them unless you know the authors and maintainers.


You can have Shellcode within an SWF. Which might have been in a rich-media PDF.

swfdump -Ddu foo.swf > /tmp/foo.txt
vim /tmp/foo.txt

Now foo.txt will most likely not contain text. Might be Shellcode, might be manipulated images, might be payload data… I like to use the 010 hex editor, or FileInsight. With FileInisght you can disassemble a selected section, past the cursor. That’s very useful, if you think that there is Shellcode.

Summary - Malicious PDF analysis workflow

To analyze a malicious PDF the workflow is: Extract and de-obfuscate JavaScript, get the Shellcode, dump the Shellcode into an executable and analyze the Shellcode.

As a side note: some Malware PDFs do not need JavaScript. But that is a special topic. 99% will use JS. If you find funky fonts being embedded for example, you need some other analysis techniques.

Summary : Malware Analysis & Forensics: Analyze Malicious Documents

Thanks to Didier Stevens, Lenny Zeltser, Frank Boldewin, Philippe Lagadec, Sebastien Damaye, Adam Kramer, Yuuhei Ootsubo and last but not least all other tool authors and contributors.

It makes a difference if we can analyze malware documents, or not. Doing this is not as easy as it could be. Even today with a lot of tools.

As analysts we need to know a ton of stuff, which is rarely taught in a structured or reasonably complete way. Time to catch up.

Version history

07.05.2017 - matured brain-dump a little, added brutexor, added Shellcode part, added swfdump
25.08.2017 - some editing, re-published it