MindshaRE: Statically Extracting Malware C2s Using Capstone Engine

October 14, 2014 Jason Jones

It’s been far too long since the last MindshaRE post, so I decided to share a technique I’ve been playing around with to pull C2 and other configuration information out of malware that does not store all of its configuration information in a set structure or in the resource section (for a nice set of publicly available decoders check out KevTheHermit’s RATDecoders repository on GitHub). Being able to statically extract this information becomes important in the event that the malware does not run properly in your sandbox, the C2s are down or you don’t have the time / sandbox bandwidth to manually run and extract the information from network indicators.

Intro

To find C2 info, one could always just extract all hostname-/IP-/URI-/URL-like elements via string regex matching, but it’s entirely possible to end up false positives or in some cases multiple hostname and URI combinations and potentially mismatch the information. In addition to that issue, there are known families of malware that will include benign or junk hostnames in their disassembly that may never get referenced or only referenced to make false phone-homes. Manually locating references and then disassembling using a disassembler (in my case, Capstone Engine) can help to verify that you have found the correct information and avoid any of the junk inserted to throw your analysis off.

For those not familiar, Capstone Engine is a disassembler written by Nguyen Anh Quynh that was first released in 2013. The engine has seen a significant amount of development in that short amount of time and has a good track record of handling some tricky disassembly. Most importantly, it supports most popular programming languages, including Python – my current programming language of choice. One complaint I have with using an on-the-fly disassembler is the lack of symbols, but that can be gotten around by taking the list of imports and addresses from pefile and then checking any memory references against it. All of the PoCs presented expect an image base of 0×400000, but for any production use the actual image base should be parsed out and replaced.

 

Example: Backoff PoS Malware

Backoff is a recently discovered PoS malware family. I noticed that many of the times the malware was sandboxed, it would not communicate with a C2, but I could see the C2 info in plain-text in the binary or other times when the C2 was down.

Backoff C2 Plain-Text

Backoff C2 Plain-Text

In an attempt to “correctly” locate the C2 information and utilize some Capstone-fu, I crafted a function that first locates hostname- or IP-like strings in the binary, looks for a “mov [register+offset]/<addr> addr” pattern, and then uses capstone to disassemble to obtain the other configuration elements.

Backoff ASM Code to load C2

Backoff ASM Code to load C2

This ends up being useful, since the argument order is not necessarily the same. This doesn’t work for all versions, but does work for most – I have encountered a number that are using a VisualBasic injector or are using an array structure to store the config so the below code will not work. This can be coupled with another piece of code that searches for version-like strings and then disassembles to find the additional campaign name attached to the binary. The code should check to see if a) host,port, URI are defined after the loop and b) if the number of mov instructions encountered before the call was 3. The number of mov’s ends up being important since my code starts with the hostname and the arguments are not always encountered in the same order. If the mov’s are less than 3, then I jump back the appropriate number of mov’s via regex search and then walk the disassembly again to see if I encounter the expected configuration data. This will also help find the backup domains and URLs that are embedded in the malware that may not be seen during a sandbox run even if there is successful communication to the C2. The code is quick and dirty and can easily be improved by validating  some common instructions seen in between, but is presented as-is for this example:


    md = Cs(CS_ARCH_X86, CS_MODE_32)
    md.detail = True
    movs = 0
    host = None
    uri = None
    port = None
    for insn in md.disasm(code, 0x1000):
        if insn.mnemonic == 'mov':
            movs += 1
            if insn.operands[1].type == X86_OP_IMM:
                v = insn.operands[1].value.imm.real
                if v < 65536:
                    port = v
                else:
                    x = self.get_string(file,v-0x400000)
                    if URI_REGEX.match(x): uri = x
                    elif DOMAIN_REGEX.match(x): host = x
                    elif IP_REGEX.match(x): host = x
         elif insn.mnemonic == 'call': break 
         if movs == 3: break 

Example: Alina PoS Malware

 

Alina is a PoS malware family that has been around for awhile. Similar to Backoff, I noticed that many of the sandbox runs did not successfully communicate with the malware when the configuration was viewable.

Alina C2 Strings

Alina C2 Strings

I used a similar process to what I did with Backoff to first locate potential C2 candidates and then search for XREFs and disassemble with capstone. Many times the C2 is stored is pushed onto the stack followed by instructions setting local variables and then a subroutine call. Prior to the push of the C2 and the URI, there is another push that represents the length of the string and can also be used to validate the sequence. Once again, this is a great place to utilize capstone to make sure that anything that is extracted matches up with what is desired.

Alina ASM to load C2

Alina ASM to load C2

This sequence of pushes and calls always seems to be preceded by a call to InitializeCriticalSection, so I first look for that, using a dict built from loading the binary into pefile to get at the import table.. The order that the hostname and the c2 occur in the binary can be flip-flopped, so I allow for that. I do make sure that the next push after the strlen is a string  The code can be extended further to validate that the strlen matches the string I extract from the binary, but this is just a PoC :)

    for i in md.disasm(CODE, push_len_addr):
        if instr_cnt == 0:
            # check for InitializeCriticalSection
            if i.mnemonic == 'call' and \
              impts.get(i.operands[0].mem.disp,'') == 'InitializeCriticalSection':
                print "On the right track..."
            else:
                break
        elif i.mnemonic == 'push' and i.operands[0].imm < 0x100:
            strlen = i.operands[0].imm
            str_instr = instr_cnt + 1
            print "Found the strlen push",i.mnemonic,i.op_str
        elif strlen and str_instr == instr_cnt and i.mnemonic == 'push':
            addr = i.operands[0].imm
            if addr == 0x400000+file.find(s):
                print 'found hostname push'
                hostname = get_string(file,addr-0x400000)
                print hostname
            else:
                uri = get_string(file,addr-0x400000)
                if URI_REGEX.match(uri): print uri
        instr_cnt += 1

 

Example: DirtJumper Drive

My last example involves a more complex example. Drive stores its most interesting strings in an encrypted format and does not decrypt all those strings in the same function (for more information see my previous blog post here), instead scattering the calls throughout the binary. In this example, I use the encrypted install name – it always starts with the same characters – to help us locate the decryption function. The decryption function is the function called right after the call  that Xrefs the encrypted install name.

Drive Install Name XRef

Drive Install Name XRef

With the address of the decryption function  known, I use the “k=” string used in the phone-home to help locate the network communication function. This function is where the C2 information is first decrypted and the C2 and the URI are the first two things decrypted in this function. The code can then be walked further down to locate the C2 port, but that code is not shown here.

Drive C2 decryption

Drive C2 decryption

Here’s the first piece of code used to locate the decryption function:


        mov_addr = '\xb8'+struct.pack("<I",0x400000+file.find(s))
        instr_addr = 0x400000+file.find(mov_addr)
        if instr_addr <= 0x400000:
            mov_addr = '\xba'+struct.pack("<I",0x400000+file.find(s))
            instr_addr = 0x400000+file.find(mov_addr)

        # looks for PUSH EBP; MOV EBP, ESP
        func_start = file[:instr_addr-0x400000].rfind('\x55\x8b\xec')
        code = file[func_start:func_start+0x200]
        md = Cs(CS_ARCH_X86, CS_MODE_32)
        md.detail = True
        decrypt_func_next = False
        calls = 0
        for i in md.disasm(code, func_start+0x400000):
            # looking for mov eax, 
            if i.mnemonic == 'mov' and len(i.operands) == 2 \
              and i.operands[0].type == X86_OP_REG and i.operands[0].reg == X86_REG_EAX \
              and i.operands[1].type == X86_OP_IMM and i.operands[1].imm >= 0x400000 \
              and i.operands[1].imm <= 0x500000:
                d = decrypt_drive(get_string(file,i.operands[1].imm-0x400000))
                # validate that this is indeed the install name
                if d.endswith('.exe'):
                    config['install_name'] = d
                    decrypt_func_next = True
            # check for the next call after the install name call
            elif decrypt_func_next and 'install_name' in config \
              and i.mnemonic == 'call' and calls == 1:
                config['decrypt_func'] = i.operands[0].imm
                break
            elif 'install_name' in config and i.mnemonic == 'call':
                calls += 1

Now that the decryption function has been located, the desired C2 information can now be located.


        mov_inst = '\xba'+struct.pack("<I",0x400000+file.find('k='))
        mov_k_addr = 0x400000+file.find(mov_inst)
        # look for PUSH EBP; MOV EBP, ESP
        func_start = file[:instr_addr-0x400000].rfind('\x55\x8b\xec')
        code = file[func_start:func_start+0x200]
        md = Cs(CS_ARCH_X86, CS_MODE_32)
        md.detail = True
        calls = 0
        d = None
        for i in md.disasm(code, func_start + 0x400000):
            # look for mov edx, <addr>
            if i.mnemonic == 'mov' and len(i.operands) == 2 \
              and i.operands[0].type == X86_OP_REG and i.operands[0].reg == X86_REG_EDX \
              and i.operands[1].type == X86_OP_IMM and i.operands[1].imm >= 0x400000 \
              and i.operands[1].imm <= 0x500000:
                d = get_string(file,i.operands[1].imm-0x400000)
            # if call decrypt_func, then decrypt(d)
            elif i.mnemonic == 'call' and i.operands[0].imm == config['decrypt_func'] and d:
                # first call is the c2 host/ip
                if calls == 0:
                    config['host'] = decrypt_drive(d)
                    d = None
                    calls += 1
                # 2nd call is the URI
                elif calls == 1:
                    config['uri'] = decrypt_drive(d)
                    d = None
                    break

Future Work

Capstone is a useful tool to have in your toolbox and hopefully the PoC code presented in this post will aid others in the future. For my own future work, I plan to tighten up the code presented and work on getting code for other interesting malware families into something that will be suitable to push out for public release.

Read more...

Previous Article
DDoS Activity in the Context of Hong Kong’s Pro-democracy Movement
DDoS Activity in the Context of Hong Kong’s Pro-democracy Movement

In early August, we examined data demonstrating a striking correlation...

Next Article
FCC advised on Remediation of Server-based DDoS Attacks
FCC advised on Remediation of Server-based DDoS Attacks

Yesterday, the Communications Security, Reliability and Interoperability...