Rosetta: Extracting Protocol Semantics using Binary Analysis with ...

checked when a message is received and need to be constructedwhen a message is sent. We identify consistencyfields in the following way: if we find that some bytes of amessage being sent have dependencies on all other bytesin the same message, or certain fields, then we flag thosebytes as a consistency field.Keyboard input: Messages often include data providedby the user via the keyboard, such as the filename in a FTPdownload, the domain name in a DNS query or the username and password in an ICQ login session. We identifykeyboard input by analyzing if any part of a sent messagehas been derived from data obtained from the keyboard.File names: We identify file names present in receivedmessages, rather than in sent messages as presented in theabove paragraph, by analyzing if the parameters of systemcalls used to open files (e.g., open) or used to get fileproperties (e.g., NtQueryInformationFile) have been derivedfrom data previously received over the network.Configuration data: Configuration data is data providedby the user in a non-interactive way. For example, protocolconfiguration parameters such as the number of timesto retry a connection. We identify configuration data byanalyzing if some parts of a sent message have been derivedfrom data read from file. As an special case, we alsoconsider if data has been derived from values stored in theWindows registry.To summarize, in this section we have shown how wecan identify dynamic fields using one of two techniques:analyzing how the data in a received message is used toderive the parameters of selected system calls, or analyzinghow the data in a sent message has been derived usingthe return values of selected system calls, where analyzingmeans extracting the dependencies between those values.In Section 5 we show the techniques we use to extract thedependencies.5 Formula GenerationIn this section we describe the formula generation phase.The formula generation phase generates the input-outputformulas, which precisely capture how a value was generated.We can use the formulas to capture two types of dependencies:1) how the different bytes in a message beingsent have been derived from the output of selected systemcalls, and 2) how the input parameters to selected systemcalls have been derived from data previously received overthe network.The second case is only used to identify dynamic fieldsin received messages, as explained in Section 4 but notto understand how the data in the dynamic field was encoded.Since both cases are instances of the same technique,in this section we will only describe the first case.Thus, the dependencies we describe in this section, capturehow the bytes in a message being sent have been derivedfrom the output of selected system calls.The formula generation comprises three steps. First,we convert the execution trace into our intermediate representation(IR) and apply additional SSA and memoryconversions to prepare the IR in the format expected bythe following steps. Then, we apply a Dynamic ProgramSlicing algorithm which extracts, for each variable, allstatements that affected the value of the variable, wherethe variables are in this case the bytes sent over the network.These statements form a slice, which is differentfor each variable and contains all the needed dependencies.Note that, we consider each byte in a message sentover the network independently and each slice containsall statements that affected how a single byte was derived.Finally, we create a formula by combining all the statementsin a slice together and simplify such formula usingdifferent techniques. The output of this simplification isthe input-output formula, which captures how a byte in asent message was constructed.5.1 Conversions5.1.1 IR generationWe convert the instructions logged in the execution traceto an intermediate representation (IR). The advantage ofusing an intermediate representation is that it allows us toperform subsequent steps over the simpler IR statements,instead of the hundreds of cumbersome x86 instructions.The translation from an x86 instruction to our IR is designedto correctly model the semantics of the originalx86 instruction, including making otherwise implicit sideeffects explicit. For example, we correctly model instructionsthat set the eflags register, prefixes that allow singleinstruction loops (e.g., rep), instructions with hiddenoperands (e.g. imul %ebx which also operates on eaxand edx), and instructions that behave differently dependingon the operands (e.g., shifts).Our IR is shown in Table 1. It has assignments (r := v),binary operations (r := u✷ b v), unary operations (r :=✷ u v), loading a value from memory into a register (r 1 :=∗(r 2 )), storing a value (∗(r 1 ) := r 2 ), direct jumps to aknown target label (jmp l), indirect jumps to a computedvalue stored in a register (ijmp r), and conditional jumps(if r then jmp l 1 else jmp l 2 ).When converting the execution trace to the IR, we needto mark the output of the selected system calls so we cantrace how the sent messages are derived from these values.For each occurrence of one of the selected system calls,we mark the return values of the system call as symbolic.A symbolic variable is a variable with a special name thatis never assigned a concrete value throughout the IR.Then, for each instruction that operates on a symbolicvariable, we propagate this symbolic marking to7

Previous page

Next page

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Rosetta: Extracting Protocol Semantics using Binary Analysis with ...

Create successful ePaper yourself

Delete template?

Save as template?