Quick introduction to reverse engineering for beginners
Quick introduction to reverse engineering for beginners
Quick introduction to reverse engineering for beginners
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Quick</strong> <strong>introduction</strong> <strong>to</strong> <strong>reverse</strong> <strong>engineering</strong> <strong>for</strong> <strong>beginners</strong><br />
Dennis Yurichev<br />
<br />
cbnd<br />
c○2013, Dennis Yurichev.<br />
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0<br />
Unported License. To view a copy of this license, visit<br />
http://creativecommons.org/licenses/by-nc-nd/3.0/.<br />
Text version 0.6 (March 15, 2013).
Contents<br />
Preface iv<br />
1 Compiler’s patterns 1<br />
1.1 Hello, world! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.2 Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
1.2.1 Save return address where function should return control after execution . . . . . . . . 4<br />
1.2.2 Function arguments passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
1.2.3 Local variable s<strong>to</strong>rage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
1.2.4 (Windows) SEH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
1.2.5 Buffer overflow protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
1.3 printf() with several arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
1.4 scanf() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
1.4.1 Global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />
1.4.2 scanf() result checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />
1.5 Passing arguments via stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
1.6 One more word about results returning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
1.7 Conditional jumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
1.8 switch()/case/default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
1.8.1 Few number of cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
1.8.2 A lot of cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
1.9 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
1.10 strlen() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
1.11 Division by 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
1.12 Work with FPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />
1.12.1 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />
1.12.2 Passing floating point number via arguments . . . . . . . . . . . . . . . . . . . . . . . 39<br />
1.12.3 Comparison example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
1.13 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
1.13.1 Buffer overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />
1.13.2 One more word about arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
1.13.3 Multidimensional arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
1.14 Bit fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
1.14.1 Specific bit checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
1.14.2 Specific bit setting/clearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
1.14.3 Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
1.14.4 CRC32 calculation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />
1.15 Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
1.15.1 SYSTEMTIME example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
1.15.2 Let’s allocate place <strong>for</strong> structure using malloc() . . . . . . . . . . . . . . . . . . . . . . 66<br />
1.15.3 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />
1.15.4 Fields packing in structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />
1.15.5 Nested structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
1.15.6 Bit fields in structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
i
1.16 C++ classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
1.16.1 GCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />
1.17 Unions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85<br />
1.17.1 Pseudo-random number genera<strong>to</strong>r example . . . . . . . . . . . . . . . . . . . . . . . . 85<br />
1.18 Pointers <strong>to</strong> functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
1.18.1 GCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90<br />
1.19 SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
1.19.1 Vec<strong>to</strong>rization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
1.19.2 SIMD strlen() implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
1.20 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />
2 Couple things <strong>to</strong> add 112<br />
2.1 LEA instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />
2.2 Function prologue and epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
2.3 npad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
2.4 Signed number representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
2.4.1 Integer overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
2.5 Arguments passing methods (calling conventions) . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
2.5.1 cdecl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
2.5.2 stdcall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
2.5.3 fastcall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
2.5.4 thiscall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
2.5.5 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
2.5.6 Returning values of float and double type . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />
3 Finding important/interesting stuff in the code 120<br />
3.1 Communication with the outer world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
3.2 String . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
3.3 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
3.3.1 Magic numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
3.4 Finding the right instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />
3.5 Suspicious code patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />
4 Tasks 124<br />
4.1 Easy level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />
4.1.1 Task 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />
4.1.2 Task 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
4.1.3 Task 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />
4.1.4 Task 1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />
4.1.5 Task 1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
4.1.6 Task 1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />
4.1.7 Task 1.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133<br />
4.1.8 Task 1.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />
4.1.9 Task 1.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />
4.1.10 Task 1.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />
4.2 Middle level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137<br />
4.2.1 Task 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137<br />
4.3 crackme / keygenme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />
5 Tools 145<br />
5.0.1 Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />
ii
6 Books/blogs worth reading 146<br />
6.1 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />
6.1.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />
6.1.2 C/C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />
6.1.3 x86 / x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />
6.2 Blogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />
6.2.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />
7 More examples 147<br />
7.1 “QR9”: Rubik’s cube inspired amateur cryp<strong>to</strong>-algorithm . . . . . . . . . . . . . . . . . . . . . 148<br />
7.2 SAP client network traffic compression case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183<br />
8 Other things 197<br />
8.1 Compiler’s anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197<br />
9 Tasks solutions 199<br />
9.1 Easy level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199<br />
9.1.1 Task 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199<br />
9.1.2 Task 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199<br />
9.1.3 Task 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200<br />
9.1.4 Task 1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200<br />
9.1.5 Task 1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201<br />
9.1.6 Task 1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201<br />
9.1.7 Task 1.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201<br />
9.1.8 Task 1.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203<br />
9.1.9 Task 1.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203<br />
9.2 Middle level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203<br />
9.2.1 Task 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203<br />
iii
Preface<br />
Here (will be) some of my notes about <strong>reverse</strong> <strong>engineering</strong> in English language <strong>for</strong> those <strong>beginners</strong> who like<br />
<strong>to</strong> learn <strong>to</strong> understand x86 code created by C/C++ compilers (which is a most large mass of all executable<br />
software in the world).<br />
There are two most used compilers: MSVC and GCC, these we will use <strong>for</strong> experiments.<br />
There are two most used x86 assembler syntax: Intel (most used in DOS/Windows) and AT&T (used in<br />
*NIX) 1 . Here we use Intel syntax. IDA 5 produce Intel syntax listings <strong>to</strong>o.<br />
1 http://en.wikipedia.org/wiki/X86_assembly_language#Syntax<br />
iv
Chapter 1<br />
Compiler’s patterns<br />
When I first learn C and then C++, I was just writing small pieces of code, compiling it and see what is<br />
producing in assembler, that was an easy way <strong>to</strong> me. I did it a lot times and relation between C/C++ code<br />
and what compiler produce was imprinted in my mind so deep so that I can quickly understand what was<br />
in C code when I look at produced x86 code. Perhaps, this method may be helpful <strong>for</strong> anyone else, so I’ll<br />
try <strong>to</strong> describe here some examples.<br />
1
1.1 Hello, world!<br />
Let’s start with famous example from the book “The C programming Language” 1 :<br />
#include <br />
int main()<br />
{<br />
printf("hello, world");<br />
return 0;<br />
};<br />
Let’s compile it in MSVC 2010: cl 1.cpp /Fa1.asm<br />
(/Fa option mean generate assembly listing file)<br />
CONST SEGMENT<br />
$SG3830 DB ’hello, world’, 00H<br />
CONST ENDS<br />
PUBLIC _main<br />
EXTRN _printf:PROC<br />
; Function compile flags: /Odtp<br />
_TEXT SEGMENT<br />
_main PROC<br />
push ebp<br />
mov ebp, esp<br />
push OFFSET $SG3830<br />
call _printf<br />
add esp, 4<br />
xor eax, eax<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
_TEXT ENDS<br />
Compiler generated 1.obj file which will be linked in<strong>to</strong> 1.exe.<br />
In our case, the file contain two segments: CONST (<strong>for</strong> data constants) and _TEXT (<strong>for</strong> code).<br />
The string “hello, world” in C/C++ has type const char*, however hasn’t its own name.<br />
But compiler need <strong>to</strong> work with this string somehow, so it define internal name $SG3830 <strong>for</strong> it.<br />
As we can see, the string is terminated by zero byte — this is C/C++ standard of strings.<br />
In the code segment _TEXT there are only one function so far — _main.<br />
Function _main starting with prologue code and ending with epilogue code, like almost any function.<br />
Read more about it in section about function prolog and epilog 2.2.<br />
After function prologue we see a function printf() call: CALL _printf.<br />
Be<strong>for</strong>e the call, string address (or pointer <strong>to</strong> it) containing our greeting is placed in<strong>to</strong> stack with help of<br />
PUSH instruction.<br />
When printf() function returning control flow <strong>to</strong> main() function, string address (or pointer <strong>to</strong> it) is<br />
still in stack.<br />
Because we do not need it anymore, stack pointer (ESP register) is <strong>to</strong> be corrected.<br />
ADD ESP, 4 mean add 4 <strong>to</strong> the value in ESP register.<br />
Why 4? Since it is 32-bit code, we need exactly 4 bytes <strong>for</strong> address passing through the stack. It’s 8<br />
bytes in x64-code<br />
“ADD ESP, 4” is equivalent <strong>to</strong> “POP register” but without any register usage 2 .<br />
1 http://en.wikipedia.org/wiki/The_C_Programming_Language<br />
2 CPU flags, however, modified<br />
2
Some compilers like Intel C++ Compiler, at the same point, could emit POP ECX instead of ADD (<strong>for</strong><br />
example, this can be observed in Oracle RDBMS code, compiled by Intel C++ compiler), and this instruction<br />
has almost the same effect, but ECX register contents will be rewritten.<br />
Probably, Intel C++ compiler using POP ECX because this instruction’s opcode is shorter then ADD ESP,<br />
x (1 byte against 3).<br />
Read more about stack in relevant section 1.2.<br />
After printf() call, in original C/C++ code was return 0 — return zero as a main() function result.<br />
In the generated code this is implemented by instruction XOR EAX, EAX<br />
XOR, in fact, just “eXclusive OR” 3 , but compilers using it often instead of MOV EAX, 0 — slightly shorter<br />
opcode again (2 bytes against 5).<br />
Some compilers emitting SUB EAX, EAX, which mean SUBtract EAX value from EAX, which is in any case<br />
will result zero.<br />
Last instruction RET returning control flow <strong>to</strong> calling function. Usually, it’s C/C++ CRT 4 code, which,<br />
in turn, return control <strong>to</strong> operation system.<br />
Now let’s try <strong>to</strong> compile the same C/C++ code in GCC 4.4.1 compiler in Linux: gcc 1.c -o 1<br />
After, with the IDA 5 disassembler assistance, let’s see how main() function was created.<br />
Note: we could also see GCC assembler result applying option -S -masm=intel<br />
main proc near<br />
var_10 = dword ptr -10h<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 10h<br />
mov eax, offset aHelloWorld ; "hello, world"<br />
mov [esp+10h+var_10], eax<br />
call _printf<br />
mov eax, 0<br />
leave<br />
retn<br />
main endp<br />
Almost the same. Address of “hello world” string (s<strong>to</strong>red in data segment) is saved in EAX register first,<br />
then it s<strong>to</strong>red in<strong>to</strong> stack. Also, in function prologue we see AND ESP, 0FFFFFFF0h — this instruction aligning<br />
ESP value on 16-byte border, resulting all values in stack aligned <strong>to</strong>o (CPU per<strong>for</strong>ming better if values it<br />
working with are located in memory at addresses aligned by 4 or 16 byte border) 5 .<br />
SUB ESP, 10h allocate 16 bytes in stack, although, as we could see below, only 4 need here.<br />
This is because the size of allocated stack is also aligned on 16-byte border.<br />
String address (or pointer <strong>to</strong> string) is then writing directly in<strong>to</strong> stack space without PUSH instruction<br />
use. var_10 — is local variable, but also argument <strong>for</strong> printf(). Read below about it.<br />
Then printf() function is called.<br />
Unlike MSVC, GCC while compiling without optimization turned on, emitting MOV EAX, 0 instead of<br />
shorter opcode.<br />
The last instruction LEAVE — is MOV ESP, EBP and POP EBP instructions pair equivalent — in other<br />
words, this instruction setting back stack pointer (ESP) and EBP register <strong>to</strong> its initial state.<br />
This is necessary because we modified these register values (ESP and EBP) at the function start (executing<br />
MOV EBP, ESP / AND ESP, ...).<br />
3 http://en.wikipedia.org/wiki/Exclusive_or<br />
4 C Run-Time Code<br />
5 Wikipedia: Data structure alignment<br />
3
1.2 Stack<br />
Stack — is one of the most fundamental things in computer science. 6 .<br />
Technically, this is just memory block in process memory + ESP register as a pointer within this block.<br />
Most frequently used stack access instructions are PUSH and POP. PUSH subtracting ESP by 4 and then<br />
writing contents of its sole operand <strong>to</strong> the memory address pointing by ESP.<br />
POP is <strong>reverse</strong> operation: get a data from memory pointing by ESP and then add 4 <strong>to</strong> ESP. Of course, this<br />
is <strong>for</strong> 32-bit environment. 8 will be here instead of 4 in x64 environment.<br />
After stack allocation, ESP pointing <strong>to</strong> the end of stack. PUSH increasing ESP, and POP decreasing. The<br />
end of stack is actually at the beginning of allocated <strong>for</strong> stack memory block. It seems strange, but it is so.<br />
What stack is used <strong>for</strong>?<br />
1.2.1 Save return address where function should return control after execution<br />
While calling another function by CALL instruction, the address of point exactly after CALL instruction is<br />
saved <strong>to</strong> stack, and then unconditional jump <strong>to</strong> the address from CALL operand is executed.<br />
CALL is PUSH address_after_call / JMP operand instructions pair equivalent.<br />
RET is fetching value from stack and jump <strong>to</strong> it — it is POP tmp / JMP tmp instructions pair equivalent.<br />
Stack overflow is simple, just run eternal recursion:<br />
void f()<br />
{<br />
f();<br />
};<br />
MSVC 2008 reporting about problem:<br />
c:\tmp6>cl ss.cpp /Fass.asm<br />
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.21022.08 <strong>for</strong> 80x86<br />
Copyright (C) Microsoft Corporation. All rights reserved.<br />
ss.cpp<br />
c:\tmp6\ss.cpp(4) : warning C4717: ’f’ : recursive on all control paths, function will<br />
cause runtime stack overflow<br />
... but generates right code anyway:<br />
?f@@YAXXZ PROC ; f<br />
; File c:\tmp6\ss.cpp<br />
; Line 2<br />
push ebp<br />
mov ebp, esp<br />
; Line 3<br />
call ?f@@YAXXZ ; f<br />
; Line 4<br />
pop ebp<br />
ret 0<br />
?f@@YAXXZ ENDP ; f<br />
... Also, if we turn on optimization (/Ox option), the optimized code will not overflow stack, but will<br />
work correctly 7 :<br />
?f@@YAXXZ PROC ; f<br />
; File c:\tmp6\ss.cpp<br />
; Line 2<br />
6 http://en.wikipedia.org/wiki/Call_stack<br />
7 irony here<br />
4
$LL3@f:<br />
; Line 3<br />
jmp SHORT $LL3@f<br />
?f@@YAXXZ ENDP ; f<br />
GCC 4.4.1 generating the same code in both cases, although not warning about problem.<br />
1.2.2 Function arguments passing<br />
push arg3<br />
push arg2<br />
push arg1<br />
call f<br />
add esp, 4*3<br />
Callee 8 function get its arguments via ESP ponter.<br />
See also section about calling conventions 2.5.<br />
It is important <strong>to</strong> note that no one oblige programmers <strong>to</strong> pass arguments through stack, it is not<br />
prerequisite.<br />
One could implement any other method not using stack.<br />
For example, it is possible <strong>to</strong> allocate a place <strong>for</strong> arguments in heap, fill it and pass <strong>to</strong> a function via<br />
pointer <strong>to</strong> this pack in EAX register. And this will work<br />
However, it is convenient tradition <strong>to</strong> use stack <strong>for</strong> this.<br />
1.2.3 Local variable s<strong>to</strong>rage<br />
A function could allocate some space in stack <strong>for</strong> its local variables just shifting ESP pointer deeply enough<br />
<strong>to</strong> stack bot<strong>to</strong>m.<br />
It is also not prerequisite. You could s<strong>to</strong>re local variables wherever you like. But traditionally it is so.<br />
alloca() function<br />
It is worth noting alloca() function. 9 .<br />
This function works like malloc(), but allocate memory just in stack.<br />
Allocated memory chunk is not needed <strong>to</strong> be freed via free() function call because function epilogue 2.2<br />
will return ESP back <strong>to</strong> initial state and allocated memory will be just annuled.<br />
It is worth noting how alloca() implemented.<br />
This function, if <strong>to</strong> simplify, just shifting ESP deeply <strong>to</strong> stack bot<strong>to</strong>m so much bytes you need and set<br />
ESP as a pointer <strong>to</strong> that allocated block. Let’s try:<br />
#include <br />
#include <br />
void f()<br />
{<br />
char *buf=(char*)alloca (600);<br />
_snprintf (buf, 600, "hi! %d, %d, %d\n", 1, 2, 3);<br />
};<br />
puts (buf);<br />
8 Function being called<br />
9 As of MSVC, function implementation can be found in alloca16.asm and chkstk.asm in C:\Program Files<br />
(x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel<br />
5
(_snprintf() function works just like printf(), but instead dumping result in<strong>to</strong> stdout (e.g., <strong>to</strong> terminal<br />
or console), write it <strong>to</strong> buf buffer. puts() copies buf contents <strong>to</strong> stdout. Of course, these two function calls<br />
might be replaced by one printf() call, but I would like <strong>to</strong> illustrate small buffer usage.)<br />
Let’s compile (MSVC 2010):<br />
...<br />
...<br />
mov eax, 600 ; 00000258H<br />
call __alloca_probe_16<br />
mov esi, esp<br />
push 3<br />
push 2<br />
push 1<br />
push OFFSET $SG2672<br />
push 600 ; 00000258H<br />
push esi<br />
call __snprintf<br />
push esi<br />
call _puts<br />
add esp, 28 ; 0000001cH<br />
The sole alloca() argument passed via EAX (instead of pushing in<strong>to</strong> stack).<br />
After alloca() call, ESP is now pointing <strong>to</strong> the block of 600 bytes and we can use it as memory <strong>for</strong> buf<br />
array.<br />
GCC 4.4.1 can do the same without calling external functions:<br />
public f<br />
f proc near ; CODE XREF: main+6<br />
s = dword ptr -10h<br />
var_C = dword ptr -0Ch<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 38h<br />
mov eax, large gs:14h<br />
mov [ebp+var_C], eax<br />
xor eax, eax<br />
sub esp, 624<br />
lea eax, [esp+18h]<br />
add eax, 0Fh<br />
shr eax, 4 ; align pointer<br />
shl eax, 4 ; by 16-byte border<br />
mov [ebp+s], eax<br />
mov eax, offset <strong>for</strong>mat ; "hi! %d, %d, %d\n"<br />
mov dword ptr [esp+14h], 3<br />
mov dword ptr [esp+10h], 2<br />
mov dword ptr [esp+0Ch], 1<br />
mov [esp+8], eax ; <strong>for</strong>mat<br />
mov dword ptr [esp+4], 600 ; maxlen<br />
mov eax, [ebp+s]<br />
6
mov [esp], eax ; s<br />
call _snprintf<br />
mov eax, [ebp+s]<br />
mov [esp], eax ; s<br />
call _puts<br />
mov eax, [ebp+var_C]<br />
xor eax, large gs:14h<br />
jz short locret_80484EB<br />
call ___stack_chk_fail<br />
locret_80484EB: ; CODE XREF: f+70<br />
leave<br />
retn<br />
f endp<br />
1.2.4 (Windows) SEH<br />
SEH (Structured Exception Handling) records are also s<strong>to</strong>red in stack (if needed). 10 .<br />
1.2.5 Buffer overflow protection<br />
More about it here 1.13.1.<br />
10 Classic Matt Pietrek article about SEH: http://www.microsoft.com/msj/0197/Exception/Exception.aspx<br />
7
1.3 printf() with several arguments<br />
Now let’s extend Hello, world! 1.1 example, replacing printf() in main() function body by this:<br />
printf("a=%d; b=%d; c=%d", 1, 2, 3);<br />
Let’s compile it by MSVC 2010 and we got:<br />
$SG3830 DB ’a=%d; b=%d; c=%d’, 00H<br />
...<br />
push 3<br />
push 2<br />
push 1<br />
push OFFSET $SG3830<br />
call _printf<br />
add esp, 16 ; 00000010H<br />
Almost the same, but now we can see that printf() arguments are pushing in<strong>to</strong> stack in <strong>reverse</strong> order:<br />
and the first argument is pushing in as the last one.<br />
By the way, variables of int type in 32-bit environment has 32-bit width, that’s 4 bytes.<br />
So, we got here 4 arguments. 4*4 = 16 — they occupy in stack exactly 16 bytes: 32-bit pointer <strong>to</strong> string<br />
and 3 number of int type.<br />
When stack pointer (ESP register) is corrected by ADD ESP, X instruction after some function call, often,<br />
the number of function arguments could be deduced here: just divide X by 4.<br />
Of course, this is related only <strong>to</strong> cdecl calling convention.<br />
See also section about calling conventions 2.5.<br />
It is also possible <strong>for</strong> compiler <strong>to</strong> merge several ADD ESP, X instructions in<strong>to</strong> one, after last call:<br />
push a1<br />
push a2<br />
call ...<br />
...<br />
push a1<br />
call ...<br />
...<br />
push a1<br />
push a2<br />
push a3<br />
call ...<br />
add esp, 24<br />
Now let’s compile the same in Linux by GCC 4.4.1 and take a look in IDA 5 what we got:<br />
main proc near<br />
var_10 = dword ptr -10h<br />
var_C = dword ptr -0Ch<br />
var_8 = dword ptr -8<br />
var_4 = dword ptr -4<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 10h<br />
mov eax, offset aADBDCD ; "a=%d; b=%d; c=%d"<br />
8
mov [esp+10h+var_4], 3<br />
mov [esp+10h+var_8], 2<br />
mov [esp+10h+var_C], 1<br />
mov [esp+10h+var_10], eax<br />
call _printf<br />
mov eax, 0<br />
leave<br />
retn<br />
main endp<br />
It can be said, the difference between code by MSVC and GCC is only in method of placing arguments<br />
in<strong>to</strong> stack. Here GCC working directly with stack without PUSH/POP.<br />
By the way, this difference is a good illustration that CPU is not aware of how arguments is passed <strong>to</strong><br />
functions. It is also possible <strong>to</strong> create hypothetical compiler which is able <strong>to</strong> pass arguments via some special<br />
structure not using stack at all.<br />
9
1.4 scanf()<br />
Now let’s use scanf().<br />
int main()<br />
{<br />
int x;<br />
printf ("Enter X:\n");<br />
};<br />
scanf ("%d", &x);<br />
printf ("You entered %d...\n", x);<br />
return 0;<br />
OK, I agree, it is not clever <strong>to</strong> use scanf() <strong>to</strong>day. But I wanted <strong>to</strong> illustrate passing pointer <strong>to</strong> int.<br />
What we got after compiling in MSVC 2010:<br />
CONST SEGMENT<br />
$SG3831 DB ’Enter X:’, 0aH, 00H<br />
ORG $+2<br />
$SG3832 DB ’%d’, 00H<br />
ORG $+1<br />
$SG3833 DB ’You entered %d...’, 0aH, 00H<br />
CONST ENDS<br />
PUBLIC _main<br />
EXTRN _scanf:PROC<br />
EXTRN _printf:PROC<br />
; Function compile flags: /Odtp<br />
_TEXT SEGMENT<br />
_x$ = -4 ; size = 4<br />
_main PROC<br />
push ebp<br />
mov ebp, esp<br />
push ecx<br />
push OFFSET $SG3831<br />
call _printf<br />
add esp, 4<br />
lea eax, DWORD PTR _x$[ebp]<br />
push eax<br />
push OFFSET $SG3832<br />
call _scanf<br />
add esp, 8<br />
mov ecx, DWORD PTR _x$[ebp]<br />
push ecx<br />
push OFFSET $SG3833<br />
call _printf<br />
add esp, 8<br />
xor eax, eax<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
_TEXT ENDS<br />
10
Variable x is local.<br />
C/C++ standard tell us it must be visible only in this function and not from any other place. Traditionally,<br />
local variables are placed in the stack. Probably, there could be other ways, but in x86 it is<br />
so.<br />
Next after function prologue instruction PUSH ECX is not <strong>for</strong> saving ECX state (notice absence of corresponding<br />
POP ECX at the function end).<br />
In fact, this instruction just allocate 4 bytes in stack <strong>for</strong> x variable s<strong>to</strong>rage.<br />
x will be accessed with the assistance of _x$ macro (it equals <strong>to</strong> -4) and EBP register pointing <strong>to</strong> current<br />
frame.<br />
Over a span of function execution, EBP is pointing <strong>to</strong> current stack frame and it is possible <strong>to</strong> have an<br />
access <strong>to</strong> local variables and function arguments via EBP+offset.<br />
It is also possible <strong>to</strong> use ESP, but it’s often changing and not very handy. So it can be said, EBP is frozen<br />
state of ESP at the moment of function execution start.<br />
Function scanf() in our example has two arguments.<br />
First is pointer <strong>to</strong> the string containing “%d” and second — address of variable x.<br />
First of all, address of x is placed in<strong>to</strong> EAX register by lea eax, DWORD PTR _x$[ebp] instruction<br />
LEA meaning load effective address, but over a time it changed its primary application 2.1.<br />
It can be said, LEA here just placing <strong>to</strong> EAX sum of EBP value and _x$ macro.<br />
It is the same as lea eax, [ebp-4].<br />
So, 4 subtracting from EBP value and result is placed <strong>to</strong> EAX. And then value in EAX is pushing in<strong>to</strong> stack<br />
and scanf() is called.<br />
After that, printf() is called. First argument is pointer <strong>to</strong> string: “You entered %d...\n”.<br />
Second argument is prepared as: mov ecx, [ebp-4], this instruction placing <strong>to</strong> ECX not address of x<br />
variable but its contents.<br />
After, ECX value is placing in<strong>to</strong> stack and last printf() called.<br />
Let’s try <strong>to</strong> compile this code in GCC 4.4.1 under Linux:<br />
main proc near<br />
var_20 = dword ptr -20h<br />
var_1C = dword ptr -1Ch<br />
var_4 = dword ptr -4<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 20h<br />
mov [esp+20h+var_20], offset aEnterX ; "Enter X:"<br />
call _puts<br />
mov eax, offset aD ; "%d"<br />
lea edx, [esp+20h+var_4]<br />
mov [esp+20h+var_1C], edx<br />
mov [esp+20h+var_20], eax<br />
call ___isoc99_scanf<br />
mov edx, [esp+20h+var_4]<br />
mov eax, offset aYouEnteredD___ ; "You entered %d...\n"<br />
mov [esp+20h+var_1C], edx<br />
mov [esp+20h+var_20], eax<br />
call _printf<br />
mov eax, 0<br />
leave<br />
retn<br />
main endp<br />
11
GCC replaced first printf() call <strong>to</strong> puts(). Indeed: printf() with only sole argument is almost<br />
analogous <strong>to</strong> puts().<br />
Almost, because we need <strong>to</strong> be sure that this string will not contain printf-control statements starting<br />
with % : then effect of these two functions will be different.<br />
Why GCC replaced printf() <strong>to</strong> puts? Because puts() work faster 11 .<br />
It working faster because just passes characters <strong>to</strong> stdout not comparing each with % symbol.<br />
As be<strong>for</strong>e — arguments are placed in<strong>to</strong> stack by MOV instruction.<br />
1.4.1 Global variables<br />
What if x variable from previous example will not be local but global variable? Then it will be accessible<br />
from any place but not only from function body. It is not very good programming practice, but <strong>for</strong> the sake<br />
of experiment we could do this.<br />
_DATA SEGMENT<br />
COMM _x:DWORD<br />
$SG2456 DB ’Enter X:’, 0aH, 00H<br />
ORG $+2<br />
$SG2457 DB ’%d’, 00H<br />
ORG $+1<br />
$SG2458 DB ’You entered %d...’, 0aH, 00H<br />
_DATA ENDS<br />
PUBLIC _main<br />
EXTRN _scanf:PROC<br />
EXTRN _printf:PROC<br />
; Function compile flags: /Odtp<br />
_TEXT SEGMENT<br />
_main PROC<br />
push ebp<br />
mov ebp, esp<br />
push OFFSET $SG2456<br />
call _printf<br />
add esp, 4<br />
push OFFSET _x<br />
push OFFSET $SG2457<br />
call _scanf<br />
add esp, 8<br />
mov eax, DWORD PTR _x<br />
push eax<br />
push OFFSET $SG2458<br />
call _printf<br />
add esp, 8<br />
xor eax, eax<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
_TEXT ENDS<br />
Now x variable is defined in _DATA segment. Memory in local stack is not allocated anymore. All accesses<br />
<strong>to</strong> it are not via stack but directly <strong>to</strong> process memory. Its value is not defined. This mean that memory<br />
will be allocated by operation system, but not compiler, neither operation system will not take care about<br />
its initial value at the moment of main() function start. As experiment, try <strong>to</strong> declare large array and see<br />
what will it contain after program loading.<br />
Now let’s assign value <strong>to</strong> variable explicitly:<br />
11 http://www.ciselant.de/projects/gcc_printf/gcc_printf.html<br />
12
int x=10; // default value<br />
We got:<br />
_DATA SEGMENT<br />
_x DD 0aH<br />
...<br />
Here we see value 0xA of DWORD type (DD meaning DWORD = 32 bit).<br />
If you will open compiled .exe in IDA 5, you will see x placed at the beginning of _DATA segment, and<br />
after you’ll see text strings.<br />
If you will open compiled .exe in IDA 5 from previous example where x value is not defined, you’ll see<br />
something like this:<br />
.data:0040FA80 _x dd ? ; DATA XREF: _main+10<br />
.data:0040FA80 ; _main+22<br />
.data:0040FA84 dword_40FA84 dd ? ; DATA XREF: _memset+1E<br />
.data:0040FA84 ; unknown_libname_1+28<br />
.data:0040FA88 dword_40FA88 dd ? ; DATA XREF: ___sbh_find_block+5<br />
.data:0040FA88 ; ___sbh_free_block+2BC<br />
.data:0040FA8C ; LPVOID lpMem<br />
.data:0040FA8C lpMem dd ? ; DATA XREF: ___sbh_find_block+B<br />
.data:0040FA8C ; ___sbh_free_block+2CA<br />
.data:0040FA90 dword_40FA90 dd ? ; DATA XREF: _V6_HeapAlloc+13<br />
.data:0040FA90 ; __calloc_impl+72<br />
.data:0040FA94 dword_40FA94 dd ? ; DATA XREF: ___sbh_free_block+2FE<br />
_x marked as ? among another variables not required <strong>to</strong> be initialized. This mean that after loading .exe<br />
<strong>to</strong> memory, place <strong>for</strong> all these variables will be allocated and some random garbage will be here. But in .exe<br />
file these not initialized variables are not occupy anything. It is suitable <strong>for</strong> large arrays, <strong>for</strong> example.<br />
It is almost the same in Linux, except segment names and properties: not initialized variables are located<br />
in _bss segment. In ELF 12 file <strong>for</strong>mat this segment has such attributes:<br />
; Segment type: Uninitialized<br />
; Segment permissions: Read/Write<br />
If <strong>to</strong> assign some value <strong>to</strong> variable, it will be placed in _data segment, this is segment with such attributes:<br />
; Segment type: Pure data<br />
; Segment permissions: Read/Write<br />
1.4.2 scanf() result checking<br />
As I wrote be<strong>for</strong>e, it is not very clever <strong>to</strong> use scanf() <strong>to</strong>day. But if we have <strong>to</strong>, we need at least check if<br />
scanf() finished correctly without error.<br />
int main()<br />
{<br />
int x;<br />
printf ("Enter X:\n");<br />
if (scanf ("%d", &x)==1)<br />
printf ("You entered %d...\n", x);<br />
else<br />
12 Executable file <strong>for</strong>mat widely used in *NIX system including Linux<br />
13
};<br />
return 0;<br />
printf ("What you entered? Huh?\n");<br />
By standard, scanf() 13 function returning number of fields it successfully read.<br />
In our case, if everything went fine and user entered some number, scanf() will return 1 or 0 or EOF in<br />
case of error.<br />
I added C code <strong>for</strong> scanf() result checking and printing error message in case of error.<br />
What we got in assembler (MSVC 2010):<br />
; Line 8<br />
lea eax, DWORD PTR _x$[ebp]<br />
push eax<br />
push OFFSET $SG3833 ; ’%d’, 00H<br />
call _scanf<br />
add esp, 8<br />
cmp eax, 1<br />
jne SHORT $LN2@main<br />
; Line 9<br />
mov ecx, DWORD PTR _x$[ebp]<br />
push ecx<br />
push OFFSET $SG3834 ; ’You entered %d...’, 0aH, 00H<br />
call _printf<br />
add esp, 8<br />
; Line 10<br />
jmp SHORT $LN1@main<br />
$LN2@main:<br />
; Line 11<br />
push OFFSET $SG3836 ; ’What you entered? Huh?’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
$LN1@main:<br />
; Line 13<br />
xor eax, eax<br />
Caller function (main()) must have access <strong>to</strong> result of callee function (scanf()), so callee leave this value<br />
in EAX register.<br />
After, we check it using instruction CMP EAX, 1 (CoMPare), in other words, we compare value in EAX<br />
with 1.<br />
JNE conditional jump follows CMP instruction. JNE mean Jump if Not Equal.<br />
So, if EAX value not equals <strong>to</strong> 1, then the processor will pass execution <strong>to</strong> the address mentioned in<br />
operand of JNE, in our case it is $LN2@main. Passing control <strong>to</strong> this address, microprocesor will execute<br />
function printf() with argument “What you entered? Huh?”. But if everything is fine, conditional jump<br />
will not be taken, and another printf() call will be executed, with two arguments: ’You entered %d...’<br />
and value of variable x.<br />
Because second subsequent printf() not needed <strong>to</strong> be executed, there are JMP after (unconditional jump)<br />
it will pass control <strong>to</strong> the place after second printf() and be<strong>for</strong>e XOR EAX, EAX instruction, which implement<br />
return 0.<br />
So, it can be said that most often, comparing some value with another is implemented by CMP/Jcc<br />
instructions pair, where cc is condition code. CMP comparing two values and set processor flags 14 . Jcc check<br />
flags needed <strong>to</strong> be checked and pass control <strong>to</strong> mentioned address (or not pass).<br />
13 MSDN: scanf, wscanf<br />
14 About x86 flags, see also: http://en.wikipedia.org/wiki/FLAGS_register_(computing).<br />
14
But in fact, this could be perceived paradoxial, but CMP instruction is in fact SUB (subtract). All arithmetic<br />
instructions set processor flags <strong>to</strong>o, not only CMP. If we compare 1 and 1, 1-1 will result zero, ZF flag will<br />
be set (meaning that last result was zero). There are no any other circumstances when it’s possible except<br />
when operands are equal. JNE checks only ZF flag and jumping only if it is not set. JNE is in fact a synonym<br />
of JNZ (Jump if Not Zero) instruction. Assembler translating both JNE and JNZ instructions in<strong>to</strong> one single<br />
opcode. So, CMP can be replaced <strong>to</strong> SUB and almost everything will be fine, but the difference is in that SUB<br />
alter value at first operand. CMP is “SUB without saving result”.<br />
Code generated by GCC 4.4.1 in Linux is almost the same, except differences we already considered.<br />
15
1.5 Passing arguments via stack<br />
Now we figured out that caller function passing arguments <strong>to</strong> callee via stack. But how callee 15 access them?<br />
#include <br />
int f (int a, int b, int c)<br />
{<br />
return a*b+c;<br />
};<br />
int main()<br />
{<br />
printf ("%d\n", f(1, 2, 3));<br />
return 0;<br />
};<br />
What we have after compilation (MSVC 2010 Express):<br />
_TEXT SEGMENT<br />
_a$ = 8 ; size = 4<br />
_b$ = 12 ; size = 4<br />
_c$ = 16 ; size = 4<br />
_f PROC<br />
; File c:\...\1.c<br />
; Line 4<br />
push ebp<br />
mov ebp, esp<br />
; Line 5<br />
mov eax, DWORD PTR _a$[ebp]<br />
imul eax, DWORD PTR _b$[ebp]<br />
add eax, DWORD PTR _c$[ebp]<br />
; Line 6<br />
pop ebp<br />
ret 0<br />
_f ENDP<br />
_main PROC<br />
; Line 9<br />
push ebp<br />
mov ebp, esp<br />
; Line 10<br />
push 3<br />
push 2<br />
push 1<br />
call _f<br />
add esp, 12 ; 0000000cH<br />
push eax<br />
push OFFSET $SG2463 ; ’%d’, 0aH, 00H<br />
call _printf<br />
add esp, 8<br />
; Line 11<br />
xor eax, eax<br />
; Line 12<br />
15 function being called<br />
16
pop ebp<br />
ret 0<br />
_main ENDP<br />
What we see is that 3 numbers are pushing <strong>to</strong> stack in function _main and f(int,int,int) is called<br />
then. Argument access inside f() is organized with help of macros like: _a$ = 8, in the same way as local<br />
variables accessed, but difference in that these offsets are positive (addressed with plus sign). So, adding _a$<br />
macro <strong>to</strong> EBP register value, outer side of stack frame is addressed.<br />
Then a value is s<strong>to</strong>red in<strong>to</strong> EAX. After IMUL instruction execution, EAX value is product 16 of EAX and what<br />
is s<strong>to</strong>red in _b. After IMUL execution, ADD is summing EAX and what is s<strong>to</strong>red in _c. Value in EAX is not<br />
needed <strong>to</strong> be moved: it is already in place it need. Now return <strong>to</strong> caller — it will take value from EAX and<br />
used it as printf() argument. Let’s compile the same in GCC 4.4.1:<br />
public f<br />
f proc near ; CODE XREF: main+20<br />
arg_0 = dword ptr 8<br />
arg_4 = dword ptr 0Ch<br />
arg_8 = dword ptr 10h<br />
push ebp<br />
mov ebp, esp<br />
mov eax, [ebp+arg_0]<br />
imul eax, [ebp+arg_4]<br />
add eax, [ebp+arg_8]<br />
pop ebp<br />
retn<br />
f endp<br />
public main<br />
main proc near ; DATA XREF: _start+17<br />
var_10 = dword ptr -10h<br />
var_C = dword ptr -0Ch<br />
var_8 = dword ptr -8<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 10h ; char *<br />
mov [esp+10h+var_8], 3<br />
mov [esp+10h+var_C], 2<br />
mov [esp+10h+var_10], 1<br />
call f<br />
mov edx, offset aD ; "%d\n"<br />
mov [esp+10h+var_C], eax<br />
mov [esp+10h+var_10], edx<br />
call _printf<br />
mov eax, 0<br />
leave<br />
retn<br />
main endp<br />
Almost the same result.<br />
16 result of multiplication<br />
17
1.6 One more word about results returning.<br />
Function execution result is often returned 17 in EAX register. If it’s byte type — then in lowest register EAX<br />
part — AL. If function returning float number, FPU register ST(0) will be used instead.<br />
That is why old C compilers can’t create functions capable of returning something not fitting in one<br />
register (usually type int), but if one need it, one should return in<strong>for</strong>mation via pointers passed in function<br />
arguments. Now it is possible, <strong>to</strong> return, let’s say, whole structure, but its still not very popular. If function<br />
should return a large structure, caller must allocate it and pass pointer <strong>to</strong> it via first argument, hiddenly<br />
and transparently <strong>for</strong> programmer. That is almost the same as <strong>to</strong> pass pointer in first argument manually,<br />
but compiler hide this.<br />
Small example:<br />
struct s<br />
{<br />
int a;<br />
int b;<br />
int c;<br />
};<br />
struct s get_some_values (int a)<br />
{<br />
struct s rt;<br />
};<br />
rt.a=a+1;<br />
rt.b=a+2;<br />
rt.c=a+3;<br />
return rt;<br />
... what we got (MSVC 2010 /Ox):<br />
$T3853 = 8 ; size = 4<br />
_a$ = 12 ; size = 4<br />
?get_some_values@@YA?AUs@@H@Z PROC ; get_some_values<br />
mov ecx, DWORD PTR _a$[esp-4]<br />
mov eax, DWORD PTR $T3853[esp-4]<br />
lea edx, DWORD PTR [ecx+1]<br />
mov DWORD PTR [eax], edx<br />
lea edx, DWORD PTR [ecx+2]<br />
add ecx, 3<br />
mov DWORD PTR [eax+4], edx<br />
mov DWORD PTR [eax+8], ecx<br />
ret 0<br />
?get_some_values@@YA?AUs@@H@Z ENDP ; get_some_values<br />
Macro name <strong>for</strong> internal variable passing pointer <strong>to</strong> structure is $T3853 here.<br />
17 See also: MSDN: Return Values (C++)<br />
18
1.7 Conditional jumps<br />
Now about conditional jumps.<br />
void f_signed (int a, int b)<br />
{<br />
if (a>b)<br />
printf ("a>b\n");<br />
if (a==b)<br />
printf ("a==b\n");<br />
if (ab)<br />
printf ("a>b\n");<br />
if (a==b)<br />
printf ("a==b\n");<br />
if (ab’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
$LN3@f_signed:<br />
mov ecx, DWORD PTR _a$[ebp]<br />
cmp ecx, DWORD PTR _b$[ebp]<br />
jne SHORT $LN2@f_signed<br />
push OFFSET $SG739 ; ’a==b’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
$LN2@f_signed:<br />
mov edx, DWORD PTR _a$[ebp]<br />
cmp edx, DWORD PTR _b$[ebp]<br />
jge SHORT $LN4@f_signed<br />
19
push OFFSET $SG741 ; ’a
and esp, -16<br />
sub esp, 16<br />
mov DWORD PTR [esp+4], 2<br />
mov DWORD PTR [esp], 1<br />
call f_signed<br />
mov DWORD PTR [esp+4], 2<br />
mov DWORD PTR [esp], 1<br />
call f_unsigned<br />
mov eax, 0<br />
leave<br />
ret<br />
21
1.8 switch()/case/default<br />
1.8.1 Few number of cases<br />
void f (int a)<br />
{<br />
switch (a)<br />
{<br />
case 0: printf ("zero\n"); break;<br />
case 1: printf ("one\n"); break;<br />
case 2: printf ("two\n"); break;<br />
default: printf ("something unknown\n"); break;<br />
};<br />
};<br />
Result (MSVC 2010):<br />
tv64 = -4 ; size = 4<br />
_a$ = 8 ; size = 4<br />
_f PROC<br />
push ebp<br />
mov ebp, esp<br />
push ecx<br />
mov eax, DWORD PTR _a$[ebp]<br />
mov DWORD PTR tv64[ebp], eax<br />
cmp DWORD PTR tv64[ebp], 0<br />
je SHORT $LN4@f<br />
cmp DWORD PTR tv64[ebp], 1<br />
je SHORT $LN3@f<br />
cmp DWORD PTR tv64[ebp], 2<br />
je SHORT $LN2@f<br />
jmp SHORT $LN1@f<br />
$LN4@f:<br />
push OFFSET $SG739 ; ’zero’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
jmp SHORT $LN7@f<br />
$LN3@f:<br />
push OFFSET $SG741 ; ’one’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
jmp SHORT $LN7@f<br />
$LN2@f:<br />
push OFFSET $SG743 ; ’two’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
jmp SHORT $LN7@f<br />
$LN1@f:<br />
push OFFSET $SG745 ; ’something unknown’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
$LN7@f:<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
22
_f ENDP<br />
Nothing specially new <strong>to</strong> us, with the exception that compiler moving input variable a <strong>to</strong> temporary local<br />
variable tv64.<br />
If <strong>to</strong> compile the same in GCC 4.4.1, we’ll get alsmost the same, even with maximal optimization turned<br />
on (-O3 option).<br />
Now let’s turn on optimization in MSVC (/Ox): cl 1.c /Fa1.asm /Ox<br />
_a$ = 8 ; size = 4<br />
_f PROC<br />
mov eax, DWORD PTR _a$[esp-4]<br />
sub eax, 0<br />
je SHORT $LN4@f<br />
sub eax, 1<br />
je SHORT $LN3@f<br />
sub eax, 1<br />
je SHORT $LN2@f<br />
mov DWORD PTR _a$[esp-4], OFFSET $SG791 ; ’something unknown’, 0aH, 00H<br />
jmp _printf<br />
$LN2@f:<br />
mov DWORD PTR _a$[esp-4], OFFSET $SG789 ; ’two’, 0aH, 00H<br />
jmp _printf<br />
$LN3@f:<br />
mov DWORD PTR _a$[esp-4], OFFSET $SG787 ; ’one’, 0aH, 00H<br />
jmp _printf<br />
$LN4@f:<br />
mov DWORD PTR _a$[esp-4], OFFSET $SG785 ; ’zero’, 0aH, 00H<br />
jmp _printf<br />
_f ENDP<br />
Here we can see even dirty hacks.<br />
First: a is placed in<strong>to</strong> EAX and 0 subtracted from it. Sounds absurdly, but it may need <strong>to</strong> check if 0 was in<br />
EAX be<strong>for</strong>e? If yes, flag ZF will be set (this also mean that subtracting from zero is zero) and first conditional<br />
jump JE (Jump if Equal or synonym JZ — Jump if Zero) will be triggered and control flow passed <strong>to</strong> $LN4@f<br />
label, where ’zero’ message is begin printed. If first jump was not triggered, 1 subtracted from input value<br />
and if at some stage 0 will be resulted, corresponding jump will be triggered.<br />
And if no jump triggered at all, control flow passed <strong>to</strong> printf() with argument ’something unknown’.<br />
Second: we see unusual thing <strong>for</strong> us: string pointer is placed in<strong>to</strong> a variable, and then printf() is called<br />
not via CALL, but via JMP. This could be explained simply. Caller pushing <strong>to</strong> stack some value and via<br />
CALL calling our function. CALL itself pushing returning address <strong>to</strong> stack and do unconditional jump <strong>to</strong> our<br />
function address. Our function at any place of its execution (since it do not contain any instruction moving<br />
stack pointer) has the following stack layout:<br />
∙ ESP — pointing <strong>to</strong> return address<br />
∙ ESP+4 — pointing <strong>to</strong> a variable<br />
On the other side, when we need <strong>to</strong> call printf() here, we need exactly the same stack layout, except of<br />
first printf() argument pointing <strong>to</strong> string. And that is what our code does.<br />
It replaces function’s first argument <strong>to</strong> different and jumping <strong>to</strong> printf(), as if not our function f() was<br />
called firstly, but immediately printf(). printf() printing some string <strong>to</strong> stdout and then execute RET<br />
instruction, which POPping return address from stack and control flow is returned not <strong>to</strong> f(), but <strong>to</strong> f()’s<br />
callee, escaping f().<br />
All it’s possible because printf() is called right at the end of f() in any case. In some way, it’s all<br />
similar <strong>to</strong> longjmp() 18 . And of course, it’s all done <strong>for</strong> the sake of speed.<br />
18 http://en.wikipedia.org/wiki/Setjmp.h<br />
23
1.8.2 A lot of cases<br />
If switch() statement contain a lot of case’s, it is not very handy <strong>for</strong> compiler <strong>to</strong> emit <strong>to</strong>o large code with<br />
a lot JE/JNE instructions.<br />
void f (int a)<br />
{<br />
switch (a)<br />
{<br />
case 0: printf ("zero\n"); break;<br />
case 1: printf ("one\n"); break;<br />
case 2: printf ("two\n"); break;<br />
case 3: printf ("three\n"); break;<br />
case 4: printf ("four\n"); break;<br />
default: printf ("something unknown\n"); break;<br />
};<br />
};<br />
We got (MSVC 2010):<br />
tv64 = -4 ; size = 4<br />
_a$ = 8 ; size = 4<br />
_f PROC<br />
push ebp<br />
mov ebp, esp<br />
push ecx<br />
mov eax, DWORD PTR _a$[ebp]<br />
mov DWORD PTR tv64[ebp], eax<br />
cmp DWORD PTR tv64[ebp], 4<br />
ja SHORT $LN1@f<br />
mov ecx, DWORD PTR tv64[ebp]<br />
jmp DWORD PTR $LN11@f[ecx*4]<br />
$LN6@f:<br />
push OFFSET $SG739 ; ’zero’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
jmp SHORT $LN9@f<br />
$LN5@f:<br />
push OFFSET $SG741 ; ’one’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
jmp SHORT $LN9@f<br />
$LN4@f:<br />
push OFFSET $SG743 ; ’two’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
jmp SHORT $LN9@f<br />
$LN3@f:<br />
push OFFSET $SG745 ; ’three’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
jmp SHORT $LN9@f<br />
$LN2@f:<br />
push OFFSET $SG747 ; ’four’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
24
jmp SHORT $LN9@f<br />
$LN1@f:<br />
push OFFSET $SG749 ; ’something unknown’, 0aH, 00H<br />
call _printf<br />
add esp, 4<br />
$LN9@f:<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
npad 2<br />
$LN11@f:<br />
DD $LN6@f ; 0<br />
DD $LN5@f ; 1<br />
DD $LN4@f ; 2<br />
DD $LN3@f ; 3<br />
DD $LN2@f ; 4<br />
_f ENDP<br />
OK, what we see here is: there are a set of printf() calls with various arguments. All them has not<br />
only addresses in process memory, but also internal symbolic labels generated by compiler. All these labels<br />
are also places in<strong>to</strong> $LN11@f internal table.<br />
At the function beginning, if a is greater than 4, control flow is passed <strong>to</strong> label $LN1@f, where printf()<br />
with argument ’something unknown’ is called.<br />
And if a value is less or equals <strong>to</strong> 4, let’s multiply it by 4 and add $LN1@f table address. That is how<br />
address inside of table is constructed, pointing exactly <strong>to</strong> the element we need. For example, let’s say a is<br />
equal <strong>to</strong> 2. 2*4 = 8 (all table elements are addresses within 32-bit process, that is why all elements contain<br />
4 bytes). Address of $LN11@f table + 8 = it will be table element where $LN4@f label is s<strong>to</strong>red. JMP fetch<br />
$LN4@f address from the table and jump <strong>to</strong> it.<br />
Then corresponding printf() is called with argument ’two’. Literally, jmp DWORD PTR $LN11@f[ecx*4]<br />
instruction mean jump <strong>to</strong> DWORD, which is s<strong>to</strong>red at address $LN11@f + ecx * 4.<br />
npad 2.3 is assembler macro, aligning next label so that it will be s<strong>to</strong>red at address aligned by 4 bytes<br />
(or 16). This is very suitable <strong>for</strong> processor, because it can fetch 32-bit values from memory through memory<br />
bus, cache memory, etc, in much effective way if it is aligned.<br />
Let’s see what GCC 4.4.1 generate:<br />
public f<br />
f proc near ; CODE XREF: main+10<br />
var_18 = dword ptr -18h<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 18h ; char *<br />
cmp [ebp+arg_0], 4<br />
ja short loc_8048444<br />
mov eax, [ebp+arg_0]<br />
shl eax, 2<br />
mov eax, ds:off_804855C[eax]<br />
jmp eax<br />
loc_80483FE: ; DATA XREF: .rodata:off_804855C<br />
mov [esp+18h+var_18], offset aZero ; "zero"<br />
call _puts<br />
jmp short locret_8048450<br />
25
loc_804840C: ; DATA XREF: .rodata:08048560<br />
mov [esp+18h+var_18], offset aOne ; "one"<br />
call _puts<br />
jmp short locret_8048450<br />
loc_804841A: ; DATA XREF: .rodata:08048564<br />
mov [esp+18h+var_18], offset aTwo ; "two"<br />
call _puts<br />
jmp short locret_8048450<br />
loc_8048428: ; DATA XREF: .rodata:08048568<br />
mov [esp+18h+var_18], offset aThree ; "three"<br />
call _puts<br />
jmp short locret_8048450<br />
loc_8048436: ; DATA XREF: .rodata:0804856C<br />
mov [esp+18h+var_18], offset aFour ; "four"<br />
call _puts<br />
jmp short locret_8048450<br />
loc_8048444: ; CODE XREF: f+A<br />
mov [esp+18h+var_18], offset aSomethingUnkno ; "something unknown"<br />
call _puts<br />
locret_8048450: ; CODE XREF: f+26<br />
; f+34...<br />
leave<br />
retn<br />
f endp<br />
off_804855C dd offset loc_80483FE ; DATA XREF: f+12<br />
dd offset loc_804840C<br />
dd offset loc_804841A<br />
dd offset loc_8048428<br />
dd offset loc_8048436<br />
It is almost the same, except little nuance: argument arg_0 is multiplied by 4 with by shifting it <strong>to</strong> left<br />
by 2 bits (it is almost the same as multiplication by 4) 1.14.3. Then label address is taken from off_804855C<br />
array, address calculated and s<strong>to</strong>red in<strong>to</strong> EAX, then JMP EAX do actual jump.<br />
26
1.9 Loops<br />
There is a special LOOP instruction in x86 instruction set, it is checking ECX register value and if it is not zero,<br />
do ECX decrement 19 and pass control flow <strong>to</strong> label mentioned in LOOP operand. Probably, this instruction<br />
is not very handy, so, I didn’t ever see any modern compiler emitting it au<strong>to</strong>matically. So, if you see the<br />
instruction somewhere in code, it is most likely this is hand-written piece of assembler code.<br />
By the way, as home exercise, you could try <strong>to</strong> explain, why this instruction is not very handy.<br />
In C/C++ loops are constructed using <strong>for</strong>(), while(), do/while() statements.<br />
Let’s start with <strong>for</strong>().<br />
This statement defines loop initialization (set loop counter <strong>to</strong> initial value), loop condition (is counter is<br />
bigger than a limit?), what is done at each iteration (increment/decrement) and of course loop body.<br />
<strong>for</strong> (initialization; condition; at each iteration)<br />
{<br />
loop_body;<br />
}<br />
So, generated code will be consisted of four parts <strong>to</strong>o.<br />
Let’s start with simple example:<br />
int main()<br />
{<br />
int i;<br />
};<br />
<strong>for</strong> (i=2; i
et 0<br />
_main ENDP<br />
Nothing very special, as we see.<br />
GCC 4.4.1 emitting almost the same code, with small difference:<br />
main proc near ; DATA XREF: _start+17<br />
var_20 = dword ptr -20h<br />
var_4 = dword ptr -4<br />
loc_8048465:<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 20h<br />
mov [esp+20h+var_4], 2 ; i initializing<br />
jmp short loc_8048476<br />
mov eax, [esp+20h+var_4]<br />
mov [esp+20h+var_20], eax<br />
call f<br />
add [esp+20h+var_4], 1 ; i increment<br />
loc_8048476:<br />
cmp [esp+20h+var_4], 9<br />
jle short loc_8048465 ; if i
main proc near ; DATA XREF: _start+17<br />
var_10 = dword ptr -10h<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 10h<br />
mov [esp+10h+var_10], 2<br />
call f<br />
mov [esp+10h+var_10], 3<br />
call f<br />
mov [esp+10h+var_10], 4<br />
call f<br />
mov [esp+10h+var_10], 5<br />
call f<br />
mov [esp+10h+var_10], 6<br />
call f<br />
mov [esp+10h+var_10], 7<br />
call f<br />
mov [esp+10h+var_10], 8<br />
call f<br />
mov [esp+10h+var_10], 9<br />
call f<br />
xor eax, eax<br />
leave<br />
retn<br />
main endp<br />
Huh, GCC just unwind our loop.<br />
Loop unwinding has advantage in these cases when there are not so much iterations and we could<br />
economy some execution speed by removing all loop supporting instructions. On the other side, resulting<br />
code is obviously larger.<br />
OK, let’s increase maximal value of i <strong>to</strong> 100 and try again. GCC resulting:<br />
public main<br />
main proc near<br />
var_20 = dword ptr -20h<br />
loc_80484D0:<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
push ebx<br />
mov ebx, 2 ; i=2<br />
sub esp, 1Ch<br />
nop ; aligning label loc_80484D0 (loop body begin) by 16-byte border<br />
mov [esp+20h+var_20], ebx ; pass i as first argument <strong>to</strong> f()<br />
add ebx, 1 ; i++<br />
call f<br />
cmp ebx, 64h ; i==100?<br />
jnz short loc_80484D0 ; if not, continue<br />
29
add esp, 1Ch<br />
xor eax, eax ; return 0<br />
pop ebx<br />
mov esp, ebp<br />
pop ebp<br />
retn<br />
main endp<br />
It’s quite similar <strong>to</strong> what MSVC 2010 with optimization (/Ox) produce. With the exception that EBX<br />
register will be fixed <strong>to</strong> i variable. GCC is sure this register will not be modified inside of f() function, and<br />
if it will, it will be saved at the function prologue and res<strong>to</strong>red at epilogue, just like here in main().<br />
30
1.10 strlen()<br />
Now let’s talk about loops one more time. Often, strlen() function 20 is implemented using while()<br />
statement. Here is how it’s done in MSVC standard libraries:<br />
int strlen (const char * str)<br />
{<br />
const char *eos = str;<br />
}<br />
while( *eos++ ) ;<br />
return( eos - str - 1 );<br />
Let’s compile:<br />
_eos$ = -4 ; size = 4<br />
_str$ = 8 ; size = 4<br />
_strlen PROC<br />
push ebp<br />
mov ebp, esp<br />
push ecx<br />
mov eax, DWORD PTR _str$[ebp] ; place pointer <strong>to</strong> string from str<br />
mov DWORD PTR _eos$[ebp], eax ; place it <strong>to</strong> local varuable eos<br />
$LN2@strlen_:<br />
mov ecx, DWORD PTR _eos$[ebp] ; ECX=eos<br />
; take 8-bit byte from address in ECX and place it as 32-bit value <strong>to</strong> EDX with sign<br />
extension<br />
movsx edx, BYTE PTR [ecx]<br />
mov eax, DWORD PTR _eos$[ebp] ; EAX=eos<br />
add eax, 1 ; increment EAX<br />
mov DWORD PTR _eos$[ebp], eax ; place EAX back <strong>to</strong> eos<br />
test edx, edx ; EDX is zero?<br />
je SHORT $LN1@strlen_ ; yes, then finish loop<br />
jmp SHORT $LN2@strlen_ ; continue loop<br />
$LN1@strlen_:<br />
; here we calculate the difference between two pointers<br />
mov eax, DWORD PTR _eos$[ebp]<br />
sub eax, DWORD PTR _str$[ebp]<br />
sub eax, 1 ; subtract 1 and return result<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
_strlen_ ENDP<br />
Two new instructions here: MOVSX 1.10 and TEST.<br />
About first: MOVSX 1.10 is intended <strong>to</strong> take byte from some place and s<strong>to</strong>re value in 32-bit register.<br />
MOVSX 1.10 meaning MOV with Sign-Extent. Rest bits starting at 8th till 31th MOVSX 1.10 will set <strong>to</strong> 1 if<br />
source byte in memory has minus sign or <strong>to</strong> 0 if plus.<br />
And here is why all this.<br />
20 counting characters in string in C language<br />
31
C/C++ standard defines char type as signed. If we have two values, one is char and another is int,<br />
(int is signed <strong>to</strong>o), and if first value contain -2 (it is coded as 0xFE) and we just copying this byte in<strong>to</strong> int<br />
container, there will be 0x000000FE, and this, from the point of signed int view is 254, but not -2. In signed<br />
int, -2 is coded as 0xFFFFFFFE. So if we need <strong>to</strong> transfer 0xFE value from variable of char type <strong>to</strong> int, we<br />
need <strong>to</strong> identify its sign and extend it. That is what MOVSX 1.10 does.<br />
See also in section “Signed number representations” 2.4.<br />
I’m not sure if compiler need <strong>to</strong> s<strong>to</strong>re char variable in EDX, it could take 8-bit register part (let’s say DL).<br />
But probably, compiler’s register alloca<strong>to</strong>r 21 works like that.<br />
Then we see TEST EDX, EDX. About TEST instruction, read more in section about bit fields 1.14. But<br />
here, this instruction just checking EDX value, if it is equals <strong>to</strong> 0.<br />
Let’s try GCC 4.4.1:<br />
public strlen<br />
strlen proc near<br />
eos = dword ptr -4<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 10h<br />
mov eax, [ebp+arg_0]<br />
mov [ebp+eos], eax<br />
loc_80483F0:<br />
mov eax, [ebp+eos]<br />
movzx eax, byte ptr [eax]<br />
test al, al<br />
setnz al<br />
add [ebp+eos], 1<br />
test al, al<br />
jnz short loc_80483F0<br />
mov edx, [ebp+eos]<br />
mov eax, [ebp+arg_0]<br />
mov ecx, edx<br />
sub ecx, eax<br />
mov eax, ecx<br />
sub<br />
leave<br />
retn<br />
eax, 1<br />
strlen endp<br />
The result almost the same as MSVC did, but here we see MOVZX isntead of MOVSX 1.10. MOVZX mean<br />
MOV with Zero-Extent. This instruction place 8-bit or 16-bit value in<strong>to</strong> 32-bit register and set the rest bits<br />
<strong>to</strong> zero. In fact, this instruction is handy only because it is able <strong>to</strong> replace two instructions at once: xor<br />
eax, eax / mov al, [...].<br />
On other hand, it is obvious <strong>to</strong> us that compiler could produce that code: mov al, byte ptr [eax] /<br />
test al, al — it is almost the same, however, highest EAX register bits will contain random noise. But let’s<br />
think it is compiler’s drawback — it can’t produce more understandable code. Strictly speaking, compiler<br />
is not obliged <strong>to</strong> emit understandable (<strong>to</strong> humans) code at all.<br />
Next new instruction <strong>for</strong> us is SETNZ. Here, if AL contain not zero, test al, al will set zero <strong>to</strong> ZF flag,<br />
but SETNZ, if ZF==0 (NZ mean not zero) will set 1 <strong>to</strong> AL. Speaking in natural language, if AL is not zero,<br />
let’s jump <strong>to</strong> loc_80483F0. Compiler emitted slightly redundant code, but let’s not <strong>for</strong>get that optimization<br />
21 compiler’s function assigning local variables <strong>to</strong> CPU registers<br />
32
is turned off.<br />
Now let’s compile all this in MSVC 2010, with optimization turned on (/Ox):<br />
_str$ = 8 ; size = 4<br />
_strlen PROC<br />
mov ecx, DWORD PTR _str$[esp-4] ; ECX -> pointer <strong>to</strong> the string<br />
mov eax, ecx ; move <strong>to</strong> EAX<br />
$LL2@strlen_:<br />
mov dl, BYTE PTR [eax] ; DL = *EAX<br />
inc eax ; EAX++<br />
test dl, dl ; DL==0?<br />
jne SHORT $LL2@strlen_ ; no, continue loop<br />
sub eax, ecx ; calculate pointers difference<br />
dec eax ; decrement EAX<br />
ret 0<br />
_strlen_ ENDP<br />
Now it’s all simpler. But it is needless <strong>to</strong> say that compiler could use registers such efficiently only in<br />
small functions with small number of local variables.<br />
INC/DEC — are increment/decrement instruction, in other words: add 1 <strong>to</strong> variable or subtract.<br />
Let’s check GCC 4.4.1 with optimization turned on (-O3 key):<br />
public strlen<br />
strlen proc near<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
mov ecx, [ebp+arg_0]<br />
mov eax, ecx<br />
loc_8048418:<br />
movzx edx, byte ptr [eax]<br />
add eax, 1<br />
test dl, dl<br />
jnz short loc_8048418<br />
not ecx<br />
add eax, ecx<br />
pop<br />
retn<br />
ebp<br />
strlen endp<br />
Here GCC is almost the same as MSVC, except of MOVZX presence.<br />
However, MOVZX could be replaced here <strong>to</strong> mov dl, byte ptr [eax].<br />
Probably, it is simpler <strong>for</strong> GCC compiler’s code genera<strong>to</strong>r <strong>to</strong> remember that whole register is allocated<br />
<strong>for</strong> char variable and it can be sure that highest bits will not contain noise at any point.<br />
After, we also see new instruction NOT. This instruction inverts all bits in operand. It can be said, it<br />
is synonym <strong>to</strong> XOR ECX, 0ffffffffh instruction. NOT and following ADD calculating pointer difference and<br />
subtracting 1. At the beginning ECX, where pointer <strong>to</strong> str is s<strong>to</strong>red, inverted and 1 is subtracted from it.<br />
See also: “Signed number representations” 2.4.<br />
In other words, at the end of function, just after loop body, these operations are executed:<br />
ecx=str;<br />
eax=eos;<br />
ecx=(-ecx)-1;<br />
33
eax=eax+ecx<br />
return eax<br />
... and this is equivalent <strong>to</strong>:<br />
ecx=str;<br />
eax=eos;<br />
eax=eax-ecx;<br />
eax=eax-1;<br />
return eax<br />
Why GCC decided it would be better? I cannot be sure. But I’m assure that both variants are equivalent<br />
in efficiency sense.<br />
34
1.11 Division by 9<br />
Very simple function:<br />
int f(int a)<br />
{<br />
return a/9;<br />
};<br />
Is compiled in a very predictable way:<br />
_a$ = 8 ; size = 4<br />
_f PROC<br />
push ebp<br />
mov ebp, esp<br />
mov eax, DWORD PTR _a$[ebp]<br />
cdq ; sign extend EAX <strong>to</strong> EDX:EAX<br />
mov ecx, 9<br />
idiv ecx<br />
pop ebp<br />
ret 0<br />
_f ENDP<br />
IDIV divides 64-bit number s<strong>to</strong>red in EDX:EAX register pair by value in ECX register. As a result, EAX<br />
will contain quotient 22 , and EDX — remainder. Result is returning from f() function in EAX register, so, that<br />
value is not moved anymore after division operation, it is in right place already. Because IDIV require value<br />
in EDX:EAX register pair, CDQ instruction (be<strong>for</strong>e IDIV) extending EAX value <strong>to</strong> 64-bit value taking value sign<br />
in<strong>to</strong> account, just as MOVSX 1.10 does. If we turn optimization on (/Ox), we got:<br />
_a$ = 8 ; size = 4<br />
_f PROC<br />
mov ecx, DWORD PTR _a$[esp-4]<br />
mov eax, 954437177 ; 38e38e39H<br />
imul ecx<br />
sar edx, 1<br />
mov eax, edx<br />
shr eax, 31 ; 0000001fH<br />
add eax, edx<br />
ret 0<br />
_f ENDP<br />
This is — division using multiplication. Multiplication operation working much faster. And it is possible<br />
<strong>to</strong> use that trick 23 <strong>to</strong> produce a code which is equivalent and faster. GCC 4.4.1 even without optimization<br />
turned on, generate almost the same code as MSVC with optimization turned on:<br />
public f<br />
f proc near<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
mov ecx, [ebp+arg_0]<br />
mov edx, 954437177<br />
22 result of division<br />
23 More about division by multiplication: MSDN: Integer division by constants, http://www.nynaeve.net/?p=115<br />
35
mov eax, ecx<br />
imul edx<br />
sar edx, 1<br />
mov eax, ecx<br />
sar eax, 1Fh<br />
mov ecx, edx<br />
sub ecx, eax<br />
mov eax, ecx<br />
pop ebp<br />
retn<br />
f endp<br />
36
1.12 Work with FPU<br />
FPU (Floating-point unit) — is a device within main CPU specially designed <strong>to</strong> work with floating point<br />
numbers.<br />
It was called coprocessor in past. It looks like programmable calcula<strong>to</strong>r in some way and stay aside of<br />
main processor.<br />
It is worth <strong>to</strong> study stack machines 24 be<strong>for</strong>e FPU studying, or learn Forth language basics 25 .<br />
It is interesting <strong>to</strong> know that in past (be<strong>for</strong>e 80486 CPU) coprocessor was a separate chip and it was not<br />
always settled on motherboard. It was possible <strong>to</strong> buy it separately and install.<br />
From the 80486 CPU, FPU is always present in it.<br />
FPU has a stack capable <strong>to</strong> hold 8 80-bit registers, each register can hold a number in IEEE 754 26 <strong>for</strong>mat.<br />
C/C++ language offer at least two floating number types, float (single-precision 27 , 32 bits) 28 and double<br />
(double-precision 29 , 64 bits).<br />
GCC also supports long double type (extended precision 30 , 80 bit) but MSVC is not.<br />
float type require the same number of bits as int type in 32-bit environment, but number representation<br />
is completely different.<br />
Number consisting of sign, significand (also called fraction) and exponent.<br />
Function having float or double among argument list is getting that value via stack. If function is<br />
returning float or double value, it leave that value in ST(0) register — at <strong>to</strong>p of FPU stack.<br />
1.12.1 Simple example<br />
Let’s consider simple example:<br />
double f (double a, double b)<br />
{<br />
return a/3.14 + b*4.1;<br />
};<br />
Compile it in MSVC 2010:<br />
CONST SEGMENT<br />
__real@4010666666666666 DQ 04010666666666666r ; 4.1<br />
CONST ENDS<br />
CONST SEGMENT<br />
__real@40091eb851eb851f DQ 040091eb851eb851fr ; 3.14<br />
CONST ENDS<br />
_TEXT SEGMENT<br />
_a$ = 8 ; size = 8<br />
_b$ = 16 ; size = 8<br />
_f PROC<br />
push ebp<br />
mov ebp, esp<br />
fld QWORD PTR _a$[ebp]<br />
; current stack state: ST(0) = _a<br />
fdiv QWORD PTR __real@40091eb851eb851f<br />
24 http://en.wikipedia.org/wiki/Stack_machine<br />
25 http://en.wikipedia.org/wiki/Forth_(programming_language)<br />
26 http://en.wikipedia.org/wiki/IEEE_754-2008<br />
27 http://en.wikipedia.org/wiki/Single-precision_floating-point_<strong>for</strong>mat<br />
28 single precision float numbers <strong>for</strong>mat is also addressed in Working with the float type as with a structure 1.15.6 section<br />
29 http://en.wikipedia.org/wiki/Double-precision_floating-point_<strong>for</strong>mat<br />
30 http://en.wikipedia.org/wiki/Extended_precision<br />
37
; current stack state: ST(0) = result of _a divided by 3.13<br />
fld QWORD PTR _b$[ebp]<br />
; current stack state: ST(0) = _b; ST(1) = result of _a divided by 3.13<br />
fmul QWORD PTR __real@4010666666666666<br />
; current stack state: ST(0) = result of _b * 4.1; ST(1) = result of _a divided by 3.13<br />
faddp ST(1), ST(0)<br />
; current stack state: ST(0) = result of addition<br />
pop ebp<br />
ret 0<br />
_f ENDP<br />
FLD takes 8 bytes from stack and load the number in<strong>to</strong> ST(0) register, au<strong>to</strong>matically converting it in<strong>to</strong><br />
internal 80-bit <strong>for</strong>mat extended precision).<br />
FDIV divide value in register ST(0) by number s<strong>to</strong>ring at address __real@40091eb851eb851f — 3.14<br />
value is coded there. Assembler syntax missing floating point numbers, so, what we see here is hexadecimal<br />
representation of 3.14 number in 64-bit IEEE 754 encoded.<br />
After FDIV execution, ST(0) will hold quotient 31 .<br />
By the way, there are also FDIVP instruction, which divide ST(1) by ST(0), popping both these values<br />
from stack and then pushing result. If you know Forth language 32 , you will quickly understand that this is<br />
stack machine 33 .<br />
Next FLD instruction pushing b value in<strong>to</strong> stack.<br />
After that, quotient is placed <strong>to</strong> ST(1), and ST(0) will hold b value.<br />
Next FMUL instruction do multiplication: b from ST(0) register by value at __real@4010666666666666<br />
(4.1 number is there) and leaves result in ST(0).<br />
Very last FADDP instruction adds two values at <strong>to</strong>p of stack, placing result at ST(1) register and then<br />
popping value at ST(1), hereby leaving result at <strong>to</strong>p of stack in ST(0).<br />
The function must return result in ST(0) register, so, after FADDP there are no any other code except of<br />
function epilogue.<br />
GCC 4.4.1 (with -O3 option) emitting the same code, however, slightly different:<br />
public f<br />
f proc near<br />
arg_0 = qword ptr 8<br />
arg_8 = qword ptr 10h<br />
push ebp<br />
fld ds:dbl_8048608 ; 3.14<br />
; stack state now: ST(0) = 3.13<br />
mov ebp, esp<br />
fdivr [ebp+arg_0]<br />
31 division result<br />
32 http://en.wikipedia.org/wiki/Forth_(programming_language)<br />
33 http://en.wikipedia.org/wiki/Stack_machine<br />
38
; stack state now: ST(0) = result of division<br />
fld ds:dbl_8048610 ; 4.1<br />
; stack state now: ST(0) = 4.1, ST(1) = result of division<br />
fmul [ebp+arg_8]<br />
; stack state now: ST(0) = result of multiplication, ST(1) = result of division<br />
pop ebp<br />
faddp st(1), st<br />
; stack state now: ST(0) = result of addition<br />
retn<br />
f endp<br />
The difference is that, first of all, 3.14 is pushing <strong>to</strong> stack (in<strong>to</strong> ST(0)), and then value in arg_0 is dividing<br />
by what is in ST(0) register.<br />
FDIVR meaning Reverse Divide — <strong>to</strong> divide with divisor and dividend swapped with each other. There<br />
are no such instruction <strong>for</strong> multiplication, because multiplication is commutative operation, so we have just<br />
FMUL without its -R counterpart.<br />
FADDP adding two values but also popping one value from stack. After that operation, ST(0) will contain<br />
sum.<br />
This piece of disassembled code was produced using IDA 5 which named ST(0) register as ST <strong>for</strong> short.<br />
1.12.2 Passing floating point number via arguments<br />
int main ()<br />
{<br />
printf ("32.01 ^ 1.54 = %lf\n", pow (32.01,1.54));<br />
}<br />
return 0;<br />
Let’s see what we got in (MSVC 2010):<br />
CONST SEGMENT<br />
__real@40400147ae147ae1 DQ 040400147ae147ae1r ; 32.01<br />
__real@3ff8a3d70a3d70a4 DQ 03ff8a3d70a3d70a4r ; 1.54<br />
CONST ENDS<br />
_main PROC<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 8 ; allocate place <strong>for</strong> the first variable<br />
fld QWORD PTR __real@3ff8a3d70a3d70a4<br />
fstp QWORD PTR [esp]<br />
sub esp, 8 ; allocate place <strong>for</strong> the second variable<br />
fld QWORD PTR __real@40400147ae147ae1<br />
fstp QWORD PTR [esp]<br />
call _pow<br />
add esp, 8 ; "return back" place of one variable.<br />
39
; in local stack here 8 bytes still reserved <strong>for</strong> us.<br />
; result now in ST(0)<br />
fstp QWORD PTR [esp] ; move result from ST(0) <strong>to</strong> local stack <strong>for</strong> printf()<br />
push OFFSET $SG2651<br />
call _printf<br />
add esp, 12<br />
xor eax, eax<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
FLD and FSTP are moving variables from/<strong>to</strong> data segment <strong>to</strong> FPU stack. pow() 34 taking both values from<br />
FPU-stack and returning result in ST(0). printf() takes 8 bytes from local stack and interpret them as<br />
double type variable.<br />
1.12.3 Comparison example<br />
Let’s try this:<br />
double d_max (double a, double b)<br />
{<br />
if (a>b)<br />
return a;<br />
};<br />
return b;<br />
Despite simplicity of that function, it will be harder <strong>to</strong> understand how it works.<br />
MSVC 2010 generated:<br />
PUBLIC _d_max<br />
_TEXT SEGMENT<br />
_a$ = 8 ; size = 8<br />
_b$ = 16 ; size = 8<br />
_d_max PROC<br />
push ebp<br />
mov ebp, esp<br />
fld QWORD PTR _b$[ebp]<br />
; current stack state: ST(0) = _b<br />
; compare _b (ST(0)) and _a, and pop register<br />
fcomp QWORD PTR _a$[ebp]<br />
; stack is empty here<br />
fnstsw ax<br />
test ah, 5<br />
jp SHORT $LN1@d_max<br />
; we are here only if a>b<br />
fld QWORD PTR _a$[ebp]<br />
34 standard C function, raises a number <strong>to</strong> the given power<br />
40
jmp SHORT $LN2@d_max<br />
$LN1@d_max:<br />
fld QWORD PTR _b$[ebp]<br />
$LN2@d_max:<br />
pop ebp<br />
ret 0<br />
_d_max ENDP<br />
So, FLD loading _b in<strong>to</strong> ST(0) register.<br />
FCOMP compares ST(0) register state with what is in _a value and set C3/C2/C0 bits in FPU status word<br />
register. This is 16-bit register reflecting current state of FPU.<br />
For now C3/C2/C0 bits are set, but un<strong>for</strong>tunately, CPU be<strong>for</strong>e Intel P6 35 hasn’t any conditional jumps<br />
instructions which are checking these bits. Probably, it is a matter of his<strong>to</strong>ry (remember: FPU was separate<br />
chip in past). Modern CPU starting at Intel P6 has FCOMI/FCOMIP/FUCOMI/FUCOMIP instructions — which<br />
does that same, but modifies CPU flags ZF/PF/CF.<br />
After bits are set, the FCOMP instruction popping one variable from stack. This is what differentiate if<br />
from FCOM, which is just comparing values, leaving the stack at the same state.<br />
FNSTSW copies FPU status word register <strong>to</strong> AX. Bits C3/C2/C0 are placed at positions 14/10/8, they will<br />
be at the same positions in AX registers and all them are placed in high part of AX — AH.<br />
∙ If b>a in our example, then C3/C2/C0 bits will be set as following: 0, 0, 0.<br />
∙ If a>b, then bits will be set: 0, 0, 1.<br />
∙ If a=b, then bits will be set: 1, 0, 0.<br />
After test ah, 5 execution, bits C3 and C1 will be set <strong>to</strong> 0, but at positions 0 и 2 (in AH registers) C0<br />
and C2 bits will be leaved.<br />
Now let’s talk about parity flag. Another notable epoch rudiment.<br />
One common reason <strong>to</strong> test the parity flag actually has nothing <strong>to</strong> do with parity. The<br />
FPU has four condition flags (C0 <strong>to</strong> C3), but they can not be tested directly, and must instead<br />
be first copied <strong>to</strong> the flags register. When this happens, C0 is placed in the carry flag, C2 in<br />
the parity flag and C3 in the zero flag. The C2 flag is set when e.g. incomparable floating<br />
point values (NaN or unsupported <strong>for</strong>mat) are compared with the FUCOM instructions. 36<br />
This flag is <strong>to</strong> be set <strong>to</strong> 1 if ones number is even. And <strong>to</strong> zero if odd.<br />
Thus, PF flag will be set <strong>to</strong> 1 if both C0 and C2 are set <strong>to</strong> zero or both are ones. And then following JP<br />
(jump if PF==1 ) will be triggered. If we remembered values of C3/C2/C0 <strong>for</strong> different cases, we will see that<br />
conditional jump JP will be triggered in two cases: if b>a or a==b (C3 bit is already not considering here,<br />
because it was cleared while execution of test ah, 5 instruction).<br />
It is all simple thereafter. If conditional jump was triggered, FLD will load _b value <strong>to</strong> ST(0), and if it’s<br />
not triggered, _a will be loaded.<br />
But it is not over yet!<br />
Now let’s compile it with MSVC 2010 with optimization option /Ox<br />
_a$ = 8 ; size = 8<br />
_b$ = 16 ; size = 8<br />
_d_max PROC<br />
fld QWORD PTR _b$[esp-4]<br />
35 Intel P6 is Pentium Pro, Pentium II, etc<br />
41
fld QWORD PTR _a$[esp-4]<br />
; current stack state: ST(0) = _a, ST(1) = _b<br />
fcom ST(1) ; compare _a and ST(1) = (_b)<br />
fnstsw ax<br />
test ah, 65 ; 00000041H<br />
jne SHORT $LN5@d_max<br />
fstp ST(1) ; copy ST(0) <strong>to</strong> ST(1) and pop register, leave (_a) on <strong>to</strong>p<br />
; current stack state: ST(0) = _a<br />
ret 0<br />
$LN5@d_max:<br />
fstp ST(0) ; copy ST(0) <strong>to</strong> ST(0) and pop register, leave (_b) on <strong>to</strong>p<br />
; current stack state: ST(0) = _b<br />
ret 0<br />
_d_max ENDP<br />
FCOM is different from FCOMP is that sense that it just comparing values and leave FPU stack in the same<br />
state. Unlike previous example, operands here in <strong>reverse</strong>d order. And that is why result of comparision in<br />
C3/C2/C0 will be different:<br />
∙ If a>b in our example, then C3/C2/C0 bits will be set as: 0, 0, 0.<br />
∙ If b>a, then bits will be set as: 0, 0, 1.<br />
∙ If a=b, then bits will be set as: 1, 0, 0.<br />
It can be said, test ah, 65 instruction just leave two bits — C3 и C0. Both will be zeroes if a>b: in<br />
that case JNE jump will not be triggered. Then FSTP ST(1) is following — this instruction copies ST(0)<br />
value in<strong>to</strong> operand and popping one value from FPU stack. In other words, that instruction copies ST(0)<br />
(where _a value now) in<strong>to</strong> ST(1). After that, two values of _a are at the <strong>to</strong>p of stack now. After that, one<br />
value is popping. After that, ST(0) will contain _a and function is finished.<br />
Conditional jump JNE is triggered in two cases: of b>a or a==b. ST(0) in<strong>to</strong> ST(0) will be copied, it<br />
is just like idle (NOP) operation, then one value is popping from stack and <strong>to</strong>p of stack (ST(0)) will contain<br />
what was in ST(1) be<strong>for</strong>e (that is _b). Then function finishes. That instruction used here probably because<br />
FPU has no instruction <strong>to</strong> pop value from stack and not <strong>to</strong> s<strong>to</strong>re it anywhere.<br />
Well, but it is still not over.<br />
GCC 4.4.1<br />
d_max proc near<br />
b = qword ptr -10h<br />
a = qword ptr -8<br />
a_first_half = dword ptr 8<br />
a_second_half = dword ptr 0Ch<br />
b_first_half = dword ptr 10h<br />
b_second_half = dword ptr 14h<br />
push ebp<br />
mov ebp, esp<br />
42
sub esp, 10h<br />
; put a and b <strong>to</strong> local stack:<br />
mov eax, [ebp+a_first_half]<br />
mov dword ptr [ebp+a], eax<br />
mov eax, [ebp+a_second_half]<br />
mov dword ptr [ebp+a+4], eax<br />
mov eax, [ebp+b_first_half]<br />
mov dword ptr [ebp+b], eax<br />
mov eax, [ebp+b_second_half]<br />
mov dword ptr [ebp+b+4], eax<br />
; load a and b <strong>to</strong> FPU stack:<br />
fld [ebp+a]<br />
fld [ebp+b]<br />
; current stack state: ST(0) - b; ST(1) - a<br />
fxch st(1) ; this instruction swapping ST(1) and ST(0)<br />
; current stack state: ST(0) - a; ST(1) - b<br />
fucompp ; compare a and b and pop two values from stack, i.e., a and b<br />
fnstsw ax ; s<strong>to</strong>re FPU status <strong>to</strong> AX<br />
sahf ; load SF, ZF, AF, PF, and CF flags state from AH<br />
setnbe al ; s<strong>to</strong>re 1 <strong>to</strong> AL if CF=0 and ZF=0<br />
test al, al ; AL==0 ?<br />
jz short loc_8048453 ; yes<br />
fld [ebp+a]<br />
jmp short locret_8048456<br />
loc_8048453:<br />
fld [ebp+b]<br />
locret_8048456:<br />
leave<br />
retn<br />
d_max endp<br />
FUCOMPP — is almost like FCOM, but popping both values from stack and handling “not-a-numbers” differently.<br />
More about not-a-numbers:<br />
FPU is able <strong>to</strong> work with special values which are not-a-numbers or NaNs 37 . These are infinity, result of<br />
dividing by zero, etc. Not-a-numbers can be “quiet” and “signalling”. It is possible <strong>to</strong> continue <strong>to</strong> work with<br />
“quiet” NaNs, but if one try <strong>to</strong> do some operation with “signalling” NaNs — an exception will be raised.<br />
FCOM will raise exception if any operand — NaN. FUCOM will raise exception only if any operand —<br />
signalling NaN (SNaN).<br />
The following instruction is SAHF — this is rare instruction in the code which is not use FPU. 8 bits from<br />
AH is movin<strong>to</strong> in<strong>to</strong> lower 8 bits of CPU flags in the following order: SF:ZF:-:AF:-:PF:-:CF
In other words, fnstsw ax / sahf instruction pair is moving C3/C2/C0 in<strong>to</strong> ZF, PF, CF CPU flags.<br />
Now let’s also remember, what values of C3/C2/C0 bits will be set:<br />
∙ If a is greater than b in our example, then C3/C2/C0 bits will be set as: 0, 0, 0.<br />
∙ if a is less than b, then bits will be set as: 0, 0, 1.<br />
∙ If a=b, then bits will be set: 1, 0, 0.<br />
In other words, after FUCOMPP/FNSTSW/SAHF instructions, we will have these CPU flags states:<br />
∙ If a>b, CPU flags will be set as: ZF=0, PF=0, CF=0.<br />
∙ If ab.<br />
Then one will be s<strong>to</strong>red <strong>to</strong> AL and the following JZ will not be triggered and function will return _a. On<br />
all other cases, _b will be returned.<br />
But it is still not over.<br />
GCC 4.4.1 with -O3 optimization turned on<br />
public d_max<br />
d_max proc near<br />
arg_0 = qword ptr 8<br />
arg_8 = qword ptr 10h<br />
push ebp<br />
mov ebp, esp<br />
fld [ebp+arg_0] ; _a<br />
fld [ebp+arg_8] ; _b<br />
; stack state now: ST(0) = _b, ST(1) = _a<br />
fxch st(1)<br />
; stack state now: ST(0) = _a, ST(1) = _b<br />
fucom st(1) ; compare _a and _b<br />
fnstsw ax<br />
sahf<br />
ja short loc_8048448<br />
; s<strong>to</strong>re ST(0) <strong>to</strong> ST(0) (idle operation), pop value at <strong>to</strong>p of stack, leave _b at <strong>to</strong>p<br />
fstp st<br />
jmp short loc_804844A<br />
loc_8048448:<br />
; s<strong>to</strong>re _a <strong>to</strong> ST(0), pop value at <strong>to</strong>p of stack, leave _a at <strong>to</strong>p<br />
fstp st(1)<br />
38 cc is condition code<br />
44
loc_804844A:<br />
pop ebp<br />
retn<br />
d_max endp<br />
It is almost the same except one: JA usage instead of SAHF. Actually, conditional jump instructions<br />
checking “larger”, “lesser” or “equal” <strong>for</strong> unsigned number comparison (JA, JAE, JBE, JBE, JE/JZ, JNA, JNAE,<br />
JNB, JNBE, JNE/JNZ) are checking only CF and ZF flags. And C3/C2/C0 bits after comparison are moving in<strong>to</strong><br />
these flags exactly in the same fashion so conditional jumps will work here. JA will work if both CF are ZF<br />
zero.<br />
Thereby, conditional jumps instructions listed here can be used after FNSTSW/SAHF instructions pair.<br />
It seems, FPU C3/C2/C0 status bits was placed there deliberately so <strong>to</strong> map them <strong>to</strong> base CPU flags<br />
without additional permutations.<br />
45
1.13 Arrays<br />
Array is just a set of variables in memory, always lying next <strong>to</strong> each other, always has same type.<br />
#include <br />
int main()<br />
{<br />
int a[20];<br />
int i;<br />
};<br />
<strong>for</strong> (i=0; i
push edx<br />
mov eax, DWORD PTR _i$[ebp]<br />
push eax<br />
push OFFSET $SG2463<br />
call _printf<br />
add esp, 12 ; 0000000cH<br />
jmp SHORT $LN2@main<br />
$LN1@main:<br />
xor eax, eax<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
Nothing very special, just two loops: first is filling loop and second is printing loop. shl ecx, 1 instruction<br />
is used <strong>for</strong> ECX value multiplication by 2, more about below 1.14.3.<br />
80 bytes are allocated in stack <strong>for</strong> array, that’s 20 elements of 4 bytes.<br />
Here is what GCC 4.4.1 does:<br />
public main<br />
main proc near ; DATA XREF: _start+17<br />
var_70 = dword ptr -70h<br />
var_6C = dword ptr -6Ch<br />
var_68 = dword ptr -68h<br />
i_2 = dword ptr -54h<br />
i = dword ptr -4<br />
loc_80483F7:<br />
loc_804840A:<br />
loc_804841B:<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 70h<br />
mov [esp+70h+i], 0 ; i=0<br />
jmp short loc_804840A<br />
mov eax, [esp+70h+i]<br />
mov edx, [esp+70h+i]<br />
add edx, edx ; edx=i*2<br />
mov [esp+eax*4+70h+i_2], edx<br />
add [esp+70h+i], 1 ; i++<br />
cmp [esp+70h+i], 13h<br />
jle short loc_80483F7<br />
mov [esp+70h+i], 0<br />
jmp short loc_8048441<br />
mov eax, [esp+70h+i]<br />
mov edx, [esp+eax*4+70h+i_2]<br />
mov eax, offset aADD ; "a[%d]=%d\n"<br />
mov [esp+70h+var_68], edx<br />
mov edx, [esp+70h+i]<br />
mov [esp+70h+var_6C], edx<br />
47
mov [esp+70h+var_70], eax<br />
call _printf<br />
add [esp+70h+i], 1<br />
loc_8048441:<br />
cmp [esp+70h+i], 13h<br />
jle short loc_804841B<br />
mov eax, 0<br />
leave<br />
retn<br />
main endp<br />
1.13.1 Buffer overflow<br />
So, array indexing is just array[index]. If you study generated code closely, you’ll probably note missing<br />
index bounds checking, which could check index, if it is less than 20. What if index will be greater than 20?<br />
That’s the one C/C++ feature it’s often blamed <strong>for</strong>.<br />
Here is a code successfully compiling and working:<br />
#include <br />
int main()<br />
{<br />
int a[20];<br />
int i;<br />
};<br />
<strong>for</strong> (i=0; i
mov DWORD PTR _a$[ebp+edx*4], ecx<br />
jmp SHORT $LN2@main<br />
$LN1@main:<br />
mov eax, DWORD PTR _a$[ebp+400]<br />
push eax<br />
push OFFSET $SG2460<br />
call _printf<br />
add esp, 8<br />
xor eax, eax<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
I’m running it, and I got:<br />
a[100]=760826203<br />
It is just something, lying in the stack near <strong>to</strong> array, 400 bytes from its first element.<br />
Indeed, how it could be done differently? Compiler may incorporate some code, checking index value<br />
<strong>to</strong> be always in array’s bound, like in higher-level programming languages 39 , but this makes running code<br />
slower.<br />
OK, we read some values in stack illegally, but what if we could write something <strong>to</strong> it?<br />
Here is what we will write:<br />
#include <br />
int main()<br />
{<br />
int a[20];<br />
int i;<br />
};<br />
<strong>for</strong> (i=0; i
jge SHORT $LN1@main<br />
mov ecx, DWORD PTR _i$[ebp]<br />
mov edx, DWORD PTR _i$[ebp] ; that insruction is obviously redundant<br />
mov DWORD PTR _a$[ebp+ecx*4], edx ; ECX could be used as second operand here instead<br />
jmp SHORT $LN2@main<br />
$LN1@main:<br />
xor eax, eax<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
Run compiled program and its crashing. No wonder. Let’s see, where exactly it’s crashing.<br />
I’m not using debugger anymore, because I tired <strong>to</strong> run it each time, move mouse, etc, when I need just <strong>to</strong><br />
spot some register’s state at specific point. That’s why I wrote very minimalistic <strong>to</strong>ol <strong>for</strong> myself, tracer 5.0.1,<br />
which is enough <strong>for</strong> my tasks.<br />
I can also use it just <strong>to</strong> see, where debuggee is crashed. So let’s see:<br />
generic tracer 0.4 (WIN32), http://conus.info/gt<br />
New process: C:\PRJ\...\1.exe, PID=7988<br />
EXCEPTION_ACCESS_VIOLATION: 0x15 (),<br />
ExceptionIn<strong>for</strong>mation[0]=8<br />
EAX=0x00000000 EBX=0x7EFDE000 ECX=0x0000001D EDX=0x0000001D<br />
ESI=0x00000000 EDI=0x00000000 EBP=0x00000014 ESP=0x0018FF48<br />
EIP=0x00000015<br />
FLAGS=PF ZF IF RF<br />
PID=7988|Process exit, return code -1073740791<br />
So, please keep an eye on registers.<br />
Exception occured at address 0x15. It’s not legal address <strong>for</strong> code — at least <strong>for</strong> win32 code! We trapped<br />
there somehow against our will. It’s also interesting fact that EBP register contain 0x14, ECX and EDX — 0x1D.<br />
Let’s study stack layout more.<br />
After control flow was passed in<strong>to</strong> main(), EBP register value was saved in<strong>to</strong> stack. Then, 84 bytes was<br />
allocated <strong>for</strong> array and i variable. That’s (20+1)*sizeof(int). ESP pointing now <strong>to</strong> _i variable in local<br />
stack and after execution of next PUSH something, something will be appeared next <strong>to</strong> _i.<br />
That’s stack layout while control is inside _main:<br />
ESP 4 bytes <strong>for</strong> i<br />
ESP+4 80 bytes <strong>for</strong> a[20] array<br />
ESP+84 saved EBP value<br />
ESP+88 returning address<br />
Instruction a[19]=something writes last int in array bounds (in bounds yet!)<br />
Instruction a[20]=something writes something <strong>to</strong> the place where EBP value is saved.<br />
Please take a look at registers state at the crash moment. In our case, number 20 was written <strong>to</strong> 20th<br />
element. By the function ending, function epilogue res<strong>to</strong>re EBP value. (20 in decimal system is 0x14 in<br />
hexadecimal). Then, RET instruction was executed, which is equivalent <strong>to</strong> POP EIP instruction.<br />
RET instruction taking returning adddress from stack (that’s address in some CRT 40 -function, which was<br />
called _main), and 21 was s<strong>to</strong>red there (0x15 in hexadecimal). The CPU trapped at the address 0x15, but<br />
there are no executable code, so exception was raised.<br />
Welcome! It’s called buffer overflow 41 .<br />
40 C Run-Time<br />
41 http://en.wikipedia.org/wiki/Stack_buffer_overflow<br />
50
Replace int array by string (char array), create a long string deliberately, pass it <strong>to</strong> the program, <strong>to</strong><br />
the function which is not checking string length and copies it <strong>to</strong> short buffer, and you’ll able <strong>to</strong> point <strong>to</strong> a<br />
program an address <strong>to</strong> which it should jump. Not that simple in reality, but that’s how it was started 42 .<br />
There are several methods <strong>to</strong> protect against it, regardless of C/C++ programmers’ negligence. MSVC<br />
has options like 43 :<br />
/RTCs Stack Frame runtime checking<br />
/GZ Enable stack checks (/RTCs)<br />
One of the methods is <strong>to</strong> write random value among local variables <strong>to</strong> stack at function prologue and<br />
<strong>to</strong> check it in function epilogue be<strong>for</strong>e function exiting. And if value is not the same, do not execute last<br />
instruction RET, but halt (or hang). Process will hang, but that’s much better then remote attack <strong>to</strong> your<br />
host.<br />
Let’s try the same code in GCC 4.4.1. We got:<br />
public main<br />
main proc near<br />
a = dword ptr -54h<br />
i = dword ptr -4<br />
loc_80483C3:<br />
loc_80483D1:<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 60h<br />
mov [ebp+i], 0<br />
jmp short loc_80483D1<br />
mov eax, [ebp+i]<br />
mov edx, [ebp+i]<br />
mov [ebp+eax*4+a], edx<br />
add [ebp+i], 1<br />
cmp [ebp+i], 1Dh<br />
jle short loc_80483C3<br />
mov eax, 0<br />
leave<br />
retn<br />
main endp<br />
Running this in Linux will produce: Segmentation fault.<br />
If we run this in GDB debugger, we getting this:<br />
(gdb) r<br />
Starting program: /home/dennis/RE/1<br />
Program received signal SIGSEGV, Segmentation fault.<br />
0x00000016 in ?? ()<br />
(gdb) info registers<br />
eax 0x0 0<br />
ecx 0xd2f96388 -755407992<br />
edx 0x1d 29<br />
ebx 0x26eff4 2551796<br />
esp 0xbffff4b0 0xbffff4b0<br />
42 Classic article about it: Smashing The Stack For Fun And Profit<br />
43 Wikipedia: compiler-side buffer overflow protection methods<br />
51
ebp 0x15 0x15<br />
esi 0x0 0<br />
edi 0x0 0<br />
eip 0x16 0x16<br />
eflags 0x10202 [ IF RF ]<br />
cs 0x73 115<br />
ss 0x7b 123<br />
ds 0x7b 123<br />
es 0x7b 123<br />
fs 0x0 0<br />
gs 0x33 51<br />
(gdb)<br />
Register values are slightly different then in win32 example, that’s because stack layout is slightly different<br />
<strong>to</strong>o.<br />
1.13.2 One more word about arrays<br />
Now we understand, why it’s not possible <strong>to</strong> write something like that in C/C++ code 44 :<br />
void f(int size)<br />
{<br />
int a[size];<br />
...<br />
};<br />
That’s just because compiler should know exact array size <strong>to</strong> allocate place <strong>for</strong> it in local stack layout or<br />
in data segment (in case of global variable) on compiling stage.<br />
If you need array of arbitrary size, allocate it by malloc(), then access allocated memory block as array<br />
of variables of type you need.<br />
1.13.3 Multidimensional arrays<br />
Internally, multidimensional array is essentially the same thing as linear array.<br />
Let’s see:<br />
#include <br />
int a[10][20][30];<br />
void insert(int x, int y, int z, int value)<br />
{<br />
a[x][y][z]=value;<br />
};<br />
We got (MSVC 2010):<br />
_DATA SEGMENT<br />
COMM _a:DWORD:01770H<br />
_DATA ENDS<br />
PUBLIC _insert<br />
_TEXT SEGMENT<br />
_x$ = 8 ; size = 4<br />
_y$ = 12 ; size = 4<br />
_z$ = 16 ; size = 4<br />
44 GCC can actually do this by allocating array dynammically in stack (like alloca()), but it’s not standard langauge extension<br />
52
_value$ = 20 ; size = 4<br />
_insert PROC<br />
push ebp<br />
mov ebp, esp<br />
mov eax, DWORD PTR _x$[ebp]<br />
imul eax, 2400 ; 00000960H<br />
mov ecx, DWORD PTR _y$[ebp]<br />
imul ecx, 120 ; 00000078H<br />
lea edx, DWORD PTR _a[eax+ecx]<br />
mov eax, DWORD PTR _z$[ebp]<br />
mov ecx, DWORD PTR _value$[ebp]<br />
mov DWORD PTR [edx+eax*4], ecx<br />
pop ebp<br />
ret 0<br />
_insert ENDP<br />
_TEXT ENDS<br />
Nothing special. For index calculation, three input arguments are multiplying in such way <strong>to</strong> represent<br />
array as multidimensional.<br />
GCC 4.4.1:<br />
public insert<br />
insert proc near<br />
x = dword ptr 8<br />
y = dword ptr 0Ch<br />
z = dword ptr 10h<br />
value = dword ptr 14h<br />
push ebp<br />
mov ebp, esp<br />
push ebx<br />
mov ebx, [ebp+x]<br />
mov eax, [ebp+y]<br />
mov ecx, [ebp+z]<br />
lea edx, [eax+eax]<br />
mov eax, edx<br />
shl eax, 4<br />
sub eax, edx<br />
imul edx, ebx, 600<br />
add eax, edx<br />
lea edx, [eax+ecx]<br />
mov eax, [ebp+value]<br />
mov dword ptr ds:a[edx*4], eax<br />
pop ebx<br />
pop ebp<br />
retn<br />
insert endp<br />
53
1.14 Bit fields<br />
A lot of functions defining input flags in arguments using bit fields. Of course, it could be substituted by<br />
bool-typed variables set, but it’s not frugally.<br />
1.14.1 Specific bit checking<br />
Win32 API example:<br />
HANDLE fh;<br />
fh=CreateFile ("file", GENERIC_WRITE | GENERIC_READ, FILE_SHARE_READ, NULL,<br />
OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);<br />
We got (MSVC 2010):<br />
push 0<br />
push 128 ; 00000080H<br />
push 4<br />
push 0<br />
push 1<br />
push -1073741824 ; c0000000H<br />
push OFFSET $SG78813<br />
call DWORD PTR __imp__CreateFileA@28<br />
mov DWORD PTR _fh$[ebp], eax<br />
Let’s take a look in<strong>to</strong> WinNT.h:<br />
#define GENERIC_READ (0x80000000L)<br />
#define GENERIC_WRITE (0x40000000L)<br />
#define GENERIC_EXECUTE (0x20000000L)<br />
#define GENERIC_ALL (0x10000000L)<br />
Everything is clear, GENERIC_READ | GENERIC_WRITE = 0x80000000 | 0x40000000 = 0xC0000000, and<br />
that’s value is used as second argument <strong>for</strong> CreateFile() 45 function.<br />
How CreateFile() will check flags?<br />
Let’s take a look in<strong>to</strong> KERNEL32.DLL in Windows XP SP3 x86 and we’ll find this piece of code in the<br />
function CreateFileW:<br />
.text:7C83D429 test byte ptr [ebp+dwDesiredAccess+3], 40h<br />
.text:7C83D42D mov [ebp+var_8], 1<br />
.text:7C83D434 jz short loc_7C83D417<br />
.text:7C83D436 jmp loc_7C810817<br />
Here we see TEST instruction, it takes, however, not the whole second argument, but only most significant<br />
byte (ebp+dwDesiredAccess+3) and checks it <strong>for</strong> 0x40 flag (meaning GENERIC_WRITE flag here)<br />
TEST is merely the same instruction as AND, but without result saving (recall the fact CMP instruction is<br />
merely the same as SUB, but without result saving 1.4.2).<br />
This piece of code logic is as follows:<br />
if ((dwDesiredAccess&0x40000000) == 0) go<strong>to</strong> loc_7C83D417<br />
If AND instruction leaving this bit, ZF flag will be cleared and JZ conditional jump will not be triggered.<br />
Conditional jump is possible only if 0x40000000 bit is absent in dwDesiredAccess variable — then AND result<br />
will be 0, ZF flag will be set and conditional jump is <strong>to</strong> be triggered.<br />
Let’s try GCC 4.4.1 and Linux:<br />
45 MSDN: CreateFile function<br />
54
#include <br />
#include <br />
void main()<br />
{<br />
int handle;<br />
};<br />
We got:<br />
handle=open ("file", O_RDWR | O_CREAT);<br />
public main<br />
main proc near<br />
var_20 = dword ptr -20h<br />
var_1C = dword ptr -1Ch<br />
var_4 = dword ptr -4<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 20h<br />
mov [esp+20h+var_1C], 42h<br />
mov [esp+20h+var_20], offset aFile ; "file"<br />
call _open<br />
mov [esp+20h+var_4], eax<br />
leave<br />
retn<br />
main endp<br />
Let’s take a look in<strong>to</strong> open() function in libc.so.6 library, but there is only syscall calling:<br />
.text:000BE69B mov edx, [esp+4+mode] ; mode<br />
.text:000BE69F mov ecx, [esp+4+flags] ; flags<br />
.text:000BE6A3 mov ebx, [esp+4+filename] ; filename<br />
.text:000BE6A7 mov eax, 5<br />
.text:000BE6AC int 80h ; LINUX - sys_open<br />
So, open() bit fields are probably checked somewhere in Linux kernel.<br />
Of course, it is easily <strong>to</strong> download both Glibc and Linux kernel source code, but we are interesting <strong>to</strong><br />
understand the matter without it.<br />
So, as of Linux 2.6, when sys_open syscall is called, control eventually passed in<strong>to</strong> do_sys_open kernel<br />
function. From there — <strong>to</strong> do_filp_open() function (this function located in kernel source tree in the file<br />
fs/namei.c).<br />
Important note. Aside from usual passing arguments via stack, there are also method <strong>to</strong> pass some of<br />
them via registers. This is also called fastcall 2.5.3. This works faster, because CPU not needed <strong>to</strong> access a<br />
stack in memory <strong>to</strong> read argument values. GCC has option regparm 46 , and it’s possible <strong>to</strong> set a number of<br />
arguments which might be passed via registers.<br />
Linux 2.6 kernel compiled with -mregparm=3 option 47 48 .<br />
What it means <strong>to</strong> us, the first 3 arguments will be passed via EAX, EDX and ECX registers, the other ones<br />
via stack. Of course, if arguments number is less than 3, only part of registers will be used.<br />
46 http://ohse.de/uwe/articles/gcc-attributes.html#func-regparm<br />
47 http://kernelnewbies.org/Linux_2_6_20#head-042c62f290834eb1fe0a1942bbf5bb9a4accbc8f<br />
48 See also arch\x86\include\asm\calling.h file in kernel tree<br />
55
So, let’s download Linux Kernel 2.6.31, compile it in Ubuntu: make vmlinux, open it in IDA 5, find the<br />
do_filp_open() function. At the beginning, we will see (comments are mine):<br />
do_filp_open proc near<br />
...<br />
push ebp<br />
mov ebp, esp<br />
push edi<br />
push esi<br />
push ebx<br />
mov ebx, ecx<br />
add ebx, 1<br />
sub esp, 98h<br />
mov esi, [ebp+arg_4] ; acc_mode (5th arg)<br />
test bl, 3<br />
mov [ebp+var_80], eax ; dfd (1th arg)<br />
mov [ebp+var_7C], edx ; pathname (2th arg)<br />
mov [ebp+var_78], ecx ; open_flag (3th arg)<br />
jnz short loc_C01EF684<br />
mov ebx, ecx ; ebx
We got (MSVC 2010):<br />
_rt$ = -4 ; size = 4<br />
_a$ = 8 ; size = 4<br />
_f PROC<br />
push ebp<br />
mov ebp, esp<br />
push ecx<br />
mov eax, DWORD PTR _a$[ebp]<br />
mov DWORD PTR _rt$[ebp], eax<br />
mov ecx, DWORD PTR _rt$[ebp]<br />
or ecx, 16384 ; 00004000H<br />
mov DWORD PTR _rt$[ebp], ecx<br />
mov edx, DWORD PTR _rt$[ebp]<br />
and edx, -513 ; fffffdffH<br />
mov DWORD PTR _rt$[ebp], edx<br />
mov eax, DWORD PTR _rt$[ebp]<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
_f ENDP<br />
OR instruction adding one more bit <strong>to</strong> value, ignoring others.<br />
AND resetting one bit. It can be said, AND just copies all bits except one. Indeed, in the second AND<br />
operand only those bits are set, which are needed <strong>to</strong> be saved, except one bit we wouldn’t like <strong>to</strong> copy (which<br />
is 0 in bitmask). It’s easier way <strong>to</strong> memorize the logic.<br />
If we compile it in MSVC with optimization turned on (/Ox), the code will be even shorter:<br />
_a$ = 8 ; size = 4<br />
_f PROC<br />
mov eax, DWORD PTR _a$[esp-4]<br />
and eax, -513 ; fffffdffH<br />
or eax, 16384 ; 00004000H<br />
ret 0<br />
_f ENDP<br />
Let’s try GCC 4.4.1 without optimization:<br />
public f<br />
f proc near<br />
var_4 = dword ptr -4<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 10h<br />
mov eax, [ebp+arg_0]<br />
mov [ebp+var_4], eax<br />
or [ebp+var_4], 4000h<br />
and [ebp+var_4], 0FFFFFDFFh<br />
mov eax, [ebp+var_4]<br />
leave<br />
retn<br />
f endp<br />
57
There are some redundant code present, however, it’s shorter then MSVC version without optimization.<br />
Now let’s try GCC with optimization turned on -O3:<br />
public f<br />
f proc near<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
mov eax, [ebp+arg_0]<br />
pop ebp<br />
or ah, 40h<br />
and ah, 0FDh<br />
retn<br />
f endp<br />
That’s shorter. It is important <strong>to</strong> note that compiler works with EAX register part via AH register —<br />
that’s EAX register part from 8th <strong>to</strong> 15th bits inclusive.<br />
Important note: 16-bit CPU 8086 accumula<strong>to</strong>r was named AX and consisted of two 8-bit halves — AL<br />
(lower byte) and AH (higher byte). In 80386 almost all regsiters were extended <strong>to</strong> 32-bit, accumula<strong>to</strong>r was<br />
named EAX, but <strong>for</strong> the sake of compatibility, its older parts may be still accessed as AX/AH/AL registers.<br />
Because all x86 CPUs are 16-bit 8086 CPU successors, these older 16-bit opcodes are shorter than newer<br />
32-bit opcodes. That’s why or ah, 40h instruction occupying only 3 bytes. It would be more logical way<br />
<strong>to</strong> emit here or eax, 04000h, but that’s 5 bytes, or even 6 (if register in first operand is not EAX).<br />
It would be even shorter if <strong>to</strong> turn on -O3 optimization flag and also set regparm=3.<br />
public f<br />
f proc near<br />
push ebp<br />
or ah, 40h<br />
mov ebp, esp<br />
and ah, 0FDh<br />
pop ebp<br />
retn<br />
f endp<br />
Indeed — first argument is already loaded in<strong>to</strong> EAX, so it’s possible <strong>to</strong> work with it in-place. It’s worth<br />
noting that both function prologue (push ebp / mov ebp,esp) and epilogue can easily be omitted here, but<br />
GCC probably isn’t good enough <strong>for</strong> such code size optimizations. However, such short functions are better<br />
<strong>to</strong> be inlined functions 49 .<br />
1.14.3 Shifts<br />
Bit shifts in C/C++ are implemented via ≪ and ≫ opera<strong>to</strong>rs.<br />
Here is a simple example of function, calculating number of 1 bits in input variable:<br />
#define IS_SET(flag, bit) ((flag) & (bit))<br />
int f(unsigned int a)<br />
{<br />
int i;<br />
int rt=0;<br />
<strong>for</strong> (i=0; i
t = dword ptr -0Ch<br />
i = dword ptr -8<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
push ebx<br />
sub esp, 10h<br />
mov [ebp+rt], 0<br />
mov [ebp+i], 0<br />
jmp short loc_80483EF<br />
loc_80483D0:<br />
mov eax, [ebp+i]<br />
mov edx, 1<br />
mov ebx, edx<br />
mov ecx, eax<br />
shl ebx, cl<br />
mov eax, ebx<br />
and eax, [ebp+arg_0]<br />
test eax, eax<br />
jz short loc_80483EB<br />
add [ebp+rt], 1<br />
loc_80483EB:<br />
add [ebp+i], 1<br />
loc_80483EF:<br />
cmp [ebp+i], 1Fh<br />
jle short loc_80483D0<br />
mov eax, [ebp+rt]<br />
add esp, 10h<br />
pop ebx<br />
pop<br />
retn<br />
ebp<br />
f endp<br />
Shift instructions are often used in division and multiplications by power of two numbers (1, 2, 4, 8, etc).<br />
For example:<br />
unsigned int f(unsigned int a)<br />
{<br />
return a/4;<br />
};<br />
We got (MSVC 2010):<br />
_a$ = 8 ; size = 4<br />
_f PROC<br />
mov eax, DWORD PTR _a$[esp-4]<br />
shr eax, 2<br />
ret 0<br />
_f ENDP<br />
SHR (SHift Right) instruction is shifting a number by 2 bits right. Two freed bits at left (e.g., two most<br />
significant bits) are set <strong>to</strong> zero. Two least significant bits are dropped. In fact, these two dropped bits —<br />
division operation remainder.<br />
It can be easily unders<strong>to</strong>od if <strong>to</strong> imagine decimal numeral system and number 23. 23 can be easily divided<br />
60
y 10 just by dropping last digit (3 — is division remainder). 2 is leaving after operation as a quotient 50 .<br />
The same s<strong>to</strong>ry about multiplication. Multiplication by 4 is just shifting the number <strong>to</strong> the left by 2 bits,<br />
inserting 2 zero bits at right (as the last two bits).<br />
1.14.4 CRC32 calculation example<br />
This is very popular table-based CRC32 hash calculation method 51 .<br />
/* By Bob Jenkins, (c) 2006, Public Domain */<br />
#include <br />
#include <br />
#include <br />
typedef unsigned long ub4;<br />
typedef unsigned char ub1;<br />
static const ub4 crctab[256] = {<br />
0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,<br />
0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,<br />
0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,<br />
0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,<br />
0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,<br />
0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,<br />
0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,<br />
0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,<br />
0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,<br />
0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,<br />
0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,<br />
0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,<br />
0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,<br />
0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,<br />
0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,<br />
0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,<br />
0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,<br />
0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,<br />
0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,<br />
0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,<br />
0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,<br />
0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,<br />
0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,<br />
0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,<br />
0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,<br />
0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,<br />
0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,<br />
0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,<br />
0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,<br />
0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,<br />
0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,<br />
0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,<br />
0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,<br />
0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,<br />
50 division result<br />
51 Source code was taken here: http://burtleburtle.net/bob/c/crc.c<br />
61
0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,<br />
0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,<br />
0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,<br />
0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,<br />
0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,<br />
0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,<br />
0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,<br />
0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,<br />
0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d,<br />
};<br />
/* how <strong>to</strong> derive the values in crctab[] from polynomial 0xedb88320 */<br />
void build_table()<br />
{<br />
ub4 i, j;<br />
<strong>for</strong> (i=0; i>1) ^ ((j&1) ? 0xedb88320 : 0);<br />
j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);<br />
j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);<br />
j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);<br />
j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);<br />
j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);<br />
j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);<br />
j = (j>>1) ^ ((j&1) ? 0xedb88320 : 0);<br />
printf("0x%.8lx, ", j);<br />
if (i%6 == 5) printf("\n");<br />
}<br />
}<br />
/* the hash function */<br />
ub4 crc(const void *key, ub4 len, ub4 hash)<br />
{<br />
ub4 i;<br />
const ub1 *k = key;<br />
<strong>for</strong> (hash=len, i=0; i> 8) ^ crctab[(hash & 0xff) ^ k[i]];<br />
return hash;<br />
}<br />
/* To use, try "gcc -O crc.c -o crc; crc < crc.c" */<br />
int main()<br />
{<br />
char s[1000];<br />
while (gets(s)) printf("%.8lx\n", crc(s, strlen(s), 0));<br />
return 0;<br />
}<br />
We are interesting in crc() function only. By the way, please note: programmer used two loop initializers<br />
in <strong>for</strong>() statement: hash=len, i=0. C/C++ standard allows this, of course. Emited code will contain two<br />
operations in loop initialization part instead of usual one.<br />
Let’s compile it in MSVC with optimization (/Ox). For the sake of brevity, only crc() function is listed<br />
here, with my comments.<br />
62
_key$ = 8 ; size = 4<br />
_len$ = 12 ; size = 4<br />
_hash$ = 16 ; size = 4<br />
_crc PROC<br />
mov edx, DWORD PTR _len$[esp-4]<br />
xor ecx, ecx ; i will be s<strong>to</strong>red in ECX<br />
mov eax, edx<br />
test edx, edx<br />
jbe SHORT $LN1@crc<br />
push ebx<br />
push esi<br />
mov esi, DWORD PTR _key$[esp+4] ; ESI = key<br />
push edi<br />
$LL3@crc:<br />
; work with bytes using only 32-bit registers. byte from address key+i we s<strong>to</strong>re in<strong>to</strong> EDI<br />
movzx edi, BYTE PTR [ecx+esi]<br />
mov ebx, eax ; EBX = (hash = len)<br />
and ebx, 255 ; EBX = hash & 0xff<br />
; XOR EDI, EBX (EDI=EDI^EBX) - this operation uses all 32 bits of each register<br />
; but other bits (8-31) are cleared all time, so it’s OK<br />
; these are cleared because, as <strong>for</strong> EDI, it was done by MOVZX instruction above<br />
; high bits of EBX was cleared by AND EBX, 255 instruction above (255 = 0xff)<br />
xor edi, ebx<br />
; EAX=EAX>>8; bits 24-31 taken "from nowhere" will be cleared<br />
shr eax, 8<br />
; EAX=EAX^crctab[EDI*4] - choose EDI-th element from crctab[] table<br />
xor eax, DWORD PTR _crctab[edi*4]<br />
inc ecx ; i++<br />
cmp ecx, edx ; i
loc_80484B8:<br />
loc_80484D3:<br />
pop ebx<br />
pop esi<br />
pop ebp<br />
retn<br />
crc endp<br />
\<br />
push esi<br />
mov esi, [ebp+key]<br />
push ebx<br />
mov ebx, [ebp+hash]<br />
test ebx, ebx<br />
mov eax, ebx<br />
jz short loc_80484D3<br />
nop ; padding<br />
lea esi, [esi+0] ; padding; ESI doesn’t changing here<br />
mov ecx, eax ; save previous state of hash <strong>to</strong> ECX<br />
xor al, [esi+edx] ; AL=*(key+i)<br />
add edx, 1 ; i++<br />
shr ecx, 8 ; ECX=hash>>8<br />
movzx eax, al ; EAX=*(key+i)<br />
mov eax, dword ptr ds:crctab[eax*4] ; EAX=crctab[EAX]<br />
xor eax, ecx ; hash=EAX^ECX<br />
cmp ebx, edx<br />
ja short loc_80484B8<br />
GCC aligned loop start by 8-byte border by adding NOP and lea esi, [esi+0] (that’s idle operation<br />
<strong>to</strong>o). Read more about it in npad section 2.3.<br />
64
1.15 Structures<br />
It can be defined that C/C++ structure, with some assumptions, just a set of variables, always s<strong>to</strong>red in<br />
memory <strong>to</strong>gether, not necessary of the same type.<br />
1.15.1 SYSTEMTIME example<br />
Let’s take SYSTEMTIME 52 win32 structure describing time.<br />
That’s how it’s defined:<br />
typedef struct _SYSTEMTIME {<br />
WORD wYear;<br />
WORD wMonth;<br />
WORD wDayOfWeek;<br />
WORD wDay;<br />
WORD wHour;<br />
WORD wMinute;<br />
WORD wSecond;<br />
WORD wMilliseconds;<br />
} SYSTEMTIME, *PSYSTEMTIME;<br />
Let’s write a C function <strong>to</strong> get current time:<br />
#include <br />
#include <br />
void main()<br />
{<br />
SYSTEMTIME t;<br />
GetSystemTime (&t);<br />
};<br />
printf ("%04d-%02d-%02d %02d:%02d:%02d\n",<br />
t.wYear, t.wMonth, t.wDay,<br />
t.wHour, t.wMinute, t.wSecond);<br />
return;<br />
We got (MSVC 2010):<br />
_t$ = -16 ; size = 16<br />
_main PROC<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 16 ; 00000010H<br />
lea eax, DWORD PTR _t$[ebp]<br />
push eax<br />
call DWORD PTR __imp__GetSystemTime@4<br />
movzx ecx, WORD PTR _t$[ebp+12] ; wSecond<br />
push ecx<br />
movzx edx, WORD PTR _t$[ebp+10] ; wMinute<br />
push edx<br />
movzx eax, WORD PTR _t$[ebp+8] ; wHour<br />
push eax<br />
movzx ecx, WORD PTR _t$[ebp+6] ; wDay<br />
52 MSDN: SYSTEMTIME structure<br />
65
push ecx<br />
movzx edx, WORD PTR _t$[ebp+2] ; wMonth<br />
push edx<br />
movzx eax, WORD PTR _t$[ebp] ; wYear<br />
push eax<br />
push OFFSET $SG78811 ; ’%04d-%02d-%02d %02d:%02d:%02d’, 0aH, 00H<br />
call _printf<br />
add esp, 28 ; 0000001cH<br />
xor eax, eax<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
16 bytes are allocated <strong>for</strong> this structure in local stack — that’s exactly sizeof(WORD)*8 (there are 8<br />
WORD variables in the structure).<br />
Take a note: the structure beginning with wYear field. It can be said, an pointer <strong>to</strong> SYSTEMTIME<br />
structure is passed <strong>to</strong> GetSystemTime() 53 , but it’s also can be said, pointer <strong>to</strong> wYear field is passed, and<br />
that’s the same! GetSystemTime() writting current year <strong>to</strong> the WORD pointer pointing <strong>to</strong>, then shifting 2<br />
bytes ahead, then writting current month, etc, etc.<br />
1.15.2 Let’s allocate place <strong>for</strong> structure using malloc()<br />
However, sometimes it’s simpler <strong>to</strong> place structures not in local stack, but in heap:<br />
#include <br />
#include <br />
void main()<br />
{<br />
SYSTEMTIME *t;<br />
};<br />
t=(SYSTEMTIME *)malloc (sizeof (SYSTEMTIME));<br />
GetSystemTime (t);<br />
printf ("%04d-%02d-%02d %02d:%02d:%02d\n",<br />
t->wYear, t->wMonth, t->wDay,<br />
t->wHour, t->wMinute, t->wSecond);<br />
free (t);<br />
return;<br />
Let’s compile it now with optimization (/Ox) so <strong>to</strong> easily see what we need.<br />
_main PROC<br />
push esi<br />
push 16 ; 00000010H<br />
call _malloc<br />
add esp, 4<br />
mov esi, eax<br />
push esi<br />
53 MSDN: SYSTEMTIME structure<br />
66
call DWORD PTR __imp__GetSystemTime@4<br />
movzx eax, WORD PTR [esi+12] ; wSecond<br />
movzx ecx, WORD PTR [esi+10] ; wMinute<br />
movzx edx, WORD PTR [esi+8] ; wHour<br />
push eax<br />
movzx eax, WORD PTR [esi+6] ; wDay<br />
push ecx<br />
movzx ecx, WORD PTR [esi+2] ; wMonth<br />
push edx<br />
movzx edx, WORD PTR [esi] ; wYear<br />
push eax<br />
push ecx<br />
push edx<br />
push OFFSET $SG78833<br />
call _printf<br />
push esi<br />
call _free<br />
add esp, 32 ; 00000020H<br />
xor eax, eax<br />
pop esi<br />
ret 0<br />
_main ENDP<br />
So, sizeof(SYSTEMTIME) = 16, that’s exact number of bytes <strong>to</strong> be allocated by malloc(). It return the<br />
pointer <strong>to</strong> freshly allocated memory block in EAX, which is then moved in<strong>to</strong> ESI. GetSystemTime() win32<br />
function undertake <strong>to</strong> save ESI value, and that’s why it is not saved here and continue <strong>to</strong> be used after<br />
GetSystemTime() call.<br />
New instruction — MOVZX (Move with Zero eXtent). It may be used almost in those cases as MOVSX 1.10,<br />
but, it clearing other bits <strong>to</strong> 0. That’s because printf() require 32-bit int, but we got WORD in structure<br />
— that’s 16-bit unsigned type. That’s why by copying value from WORD in<strong>to</strong> int, bits from 16 <strong>to</strong> 31 should<br />
be cleared, because there will be random noise otherwise, leaved from previous operations on registers.<br />
1.15.3 Linux<br />
As of Linux, let’s take tm structure from time.h <strong>for</strong> example:<br />
#include <br />
#include <br />
void main()<br />
{<br />
struct tm t;<br />
time_t unix_time;<br />
};<br />
unix_time=time(NULL);<br />
localtime_r (&unix_time, &t);<br />
printf ("Year: %d\n", t.tm_year+1900);<br />
printf ("Month: %d\n", t.tm_mon);<br />
printf ("Day: %d\n", t.tm_mday);<br />
printf ("Hour: %d\n", t.tm_hour);<br />
printf ("Minutes: %d\n", t.tm_min);<br />
printf ("Seconds: %d\n", t.tm_sec);<br />
67
Let’s compile it in GCC 4.4.1:<br />
main proc near<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 40h<br />
mov dword ptr [esp], 0 ; first argument <strong>for</strong> time()<br />
call time<br />
mov [esp+3Ch], eax<br />
lea eax, [esp+3Ch] ; take pointer <strong>to</strong> what time() returned<br />
lea edx, [esp+10h] ; at ESP+10h struct tm will begin<br />
mov [esp+4], edx ; pass pointer <strong>to</strong> the structure begin<br />
mov [esp], eax ; pass pointer <strong>to</strong> result of time()<br />
call localtime_r<br />
mov eax, [esp+24h] ; tm_year<br />
lea edx, [eax+76Ch] ; edx=eax+1900<br />
mov eax, offset <strong>for</strong>mat ; "Year: %d\n"<br />
mov [esp+4], edx<br />
mov [esp], eax<br />
call printf<br />
mov edx, [esp+20h] ; tm_mon<br />
mov eax, offset aMonthD ; "Month: %d\n"<br />
mov [esp+4], edx<br />
mov [esp], eax<br />
call printf<br />
mov edx, [esp+1Ch] ; tm_mday<br />
mov eax, offset aDayD ; "Day: %d\n"<br />
mov [esp+4], edx<br />
mov [esp], eax<br />
call printf<br />
mov edx, [esp+18h] ; tm_hour<br />
mov eax, offset aHourD ; "Hour: %d\n"<br />
mov [esp+4], edx<br />
mov [esp], eax<br />
call printf<br />
mov edx, [esp+14h] ; tm_min<br />
mov eax, offset aMinutesD ; "Minutes: %d\n"<br />
mov [esp+4], edx<br />
mov [esp], eax<br />
call printf<br />
mov edx, [esp+10h]<br />
mov eax, offset aSecondsD ; "Seconds: %d\n"<br />
mov [esp+4], edx ; tm_sec<br />
mov [esp], eax<br />
call printf<br />
leave<br />
retn<br />
main endp<br />
Somehow, IDA 5 didn’t created local variables names in local stack. But since we already experienced<br />
<strong>reverse</strong> engineers :-) we may do it without this in<strong>for</strong>mation in this simple example.<br />
Please also pay attention <strong>to</strong> lea edx, [eax+76Ch] — this instruction just adding 0x76C <strong>to</strong> EAX, but not<br />
modify any flags. See also relevant section about LEA 2.1.<br />
68
1.15.4 Fields packing in structure<br />
One important thing is fields packing in structures 54 .<br />
Let’s take a simple example:<br />
#include <br />
struct s<br />
{<br />
char a;<br />
int b;<br />
char c;<br />
int d;<br />
};<br />
void f(struct s s)<br />
{<br />
printf ("a=%d; b=%d; c=%d; d=%d\n", s.a, s.b, s.c, s.d);<br />
};<br />
As we see, we have two char fields (each is exactly one byte) and two more — int (each - 4 bytes).<br />
That’s all compiling in<strong>to</strong>:<br />
_s$ = 8 ; size = 16<br />
?f@@YAXUs@@@Z PROC ; f<br />
push ebp<br />
mov ebp, esp<br />
mov eax, DWORD PTR _s$[ebp+12]<br />
push eax<br />
movsx ecx, BYTE PTR _s$[ebp+8]<br />
push ecx<br />
mov edx, DWORD PTR _s$[ebp+4]<br />
push edx<br />
movsx eax, BYTE PTR _s$[ebp]<br />
push eax<br />
push OFFSET $SG3842<br />
call _printf<br />
add esp, 20 ; 00000014H<br />
pop ebp<br />
ret 0<br />
?f@@YAXUs@@@Z ENDP ; f<br />
_TEXT ENDS<br />
As we can see, each field’s address is aligned by 4-bytes border. That’s why each char using 4 bytes here,<br />
like int. Why? Thus it’s easier <strong>for</strong> CPU <strong>to</strong> access memory at aligned addresses and <strong>to</strong> cache data from it.<br />
However, it’s not very economical in size sense.<br />
Let’s try <strong>to</strong> compile it with option (/Zp1) (/Zp[n] pack structs on n-byte boundary).<br />
_TEXT SEGMENT<br />
_s$ = 8 ; size = 10<br />
?f@@YAXUs@@@Z PROC ; f<br />
push ebp<br />
mov ebp, esp<br />
mov eax, DWORD PTR _s$[ebp+6]<br />
push eax<br />
54 See also: Wikipedia: Data structure alignment<br />
69
movsx ecx, BYTE PTR _s$[ebp+5]<br />
push ecx<br />
mov edx, DWORD PTR _s$[ebp+1]<br />
push edx<br />
movsx eax, BYTE PTR _s$[ebp]<br />
push eax<br />
push OFFSET $SG3842<br />
call _printf<br />
add esp, 20 ; 00000014H<br />
pop ebp<br />
ret 0<br />
?f@@YAXUs@@@Z ENDP ; f<br />
Now the structure takes only 10 bytes and each char value takes 1 byte. What it give <strong>to</strong> us? Size<br />
economy. And as drawback — CPU will access these fields without maximal per<strong>for</strong>mance it can.<br />
As it can be easily guessed, if the structure is used in many source and object files, all these should be<br />
compiled with the same convention about structures packing.<br />
Aside from MSVC /Zp option which set how <strong>to</strong> align each structure field, here is also #pragma pack<br />
compiler option, it can be defined right in source code. It’s available in both MSVC 55 and GCC 56 .<br />
Let’s back <strong>to</strong> SYSTEMTIME structure consisting in 16-bit fields. How our compiler know <strong>to</strong> pack them on<br />
1-byte alignment method?<br />
WinNT.h file has this:<br />
#include "pshpack1.h"<br />
And this:<br />
#include "pshpack4.h" // 4 byte packing is the default<br />
The file PshPack1.h looks like:<br />
#if ! (defined(lint) || defined(RC_INVOKED))<br />
#if ( _MSC_VER >= 800 && !defined(_M_I86)) || defined(_PUSHPOP_SUPPORTED)<br />
#pragma warning(disable:4103)<br />
#if !(defined( MIDL_PASS )) || defined( __midl )<br />
#pragma pack(push,1)<br />
#else<br />
#pragma pack(1)<br />
#endif<br />
#else<br />
#pragma pack(1)<br />
#endif<br />
#endif /* ! (defined(lint) || defined(RC_INVOKED)) */<br />
That’s how compiler will pack structures defined after #pragma pack.<br />
1.15.5 Nested structures<br />
Now what about situations when one structure define another structure inside?<br />
#include <br />
struct inner_struct<br />
{<br />
int a;<br />
55 MSDN: Working with Packing Structures<br />
56 Structure-Packing Pragmas<br />
70
};<br />
int b;<br />
struct outer_struct<br />
{<br />
char a;<br />
int b;<br />
struct inner_struct c;<br />
char d;<br />
int e;<br />
};<br />
void f(struct outer_struct s)<br />
{<br />
printf ("a=%d; b=%d; c.a=%d; c.b=%d; d=%d; e=%d\n",<br />
s.a, s.b, s.c.a, s.c.b, s.d, s.e);<br />
};<br />
... in this case, both inner_struct fields will be placed between a,b and d,e fields of outer_struct.<br />
Let’s compile (MSVC 2010):<br />
_s$ = 8 ; size = 24<br />
_f PROC<br />
push ebp<br />
mov ebp, esp<br />
mov eax, DWORD PTR _s$[ebp+20] ; e<br />
push eax<br />
movsx ecx, BYTE PTR _s$[ebp+16] ; d<br />
push ecx<br />
mov edx, DWORD PTR _s$[ebp+12] ; c.b<br />
push edx<br />
mov eax, DWORD PTR _s$[ebp+8] ; c.a<br />
push eax<br />
mov ecx, DWORD PTR _s$[ebp+4] ; b<br />
push ecx<br />
movsx edx, BYTE PTR _s$[ebp] ;a<br />
push edx<br />
push OFFSET $SG2466<br />
call _printf<br />
add esp, 28 ; 0000001cH<br />
pop ebp<br />
ret 0<br />
_f ENDP<br />
One curious point here is that by looking on<strong>to</strong> this assembler code, we do not even see that another<br />
structure was used inside of it! Thus, we would say, nested structures are finally unfolds in<strong>to</strong> linear or<br />
one-dimensional structure.<br />
Of course, if <strong>to</strong> replace struct inner_struct c; declaration <strong>to</strong> struct inner_struct *c; (thus making<br />
a pointer here) situation will be significally different.<br />
1.15.6 Bit fields in structure<br />
CPUID example<br />
C/C++ language allow <strong>to</strong> define exact number of bits <strong>for</strong> each structure fields. It’s very useful if one need <strong>to</strong><br />
save memory space. For example, one bit is enough <strong>for</strong> variable of bool type. But of course, it’s not rational<br />
71
if speed is important.<br />
Let’s consider CPUID 57 instruction example. This instruction return in<strong>for</strong>mation about current CPU and<br />
its features.<br />
If EAX is set <strong>to</strong> 1 be<strong>for</strong>e instruction execution, CPUID will return this in<strong>for</strong>mation packed in<strong>to</strong> EAX register:<br />
3:0 Stepping<br />
7:4 Model<br />
11:8 Family<br />
13:12 Processor Type<br />
19:16 Extended Model<br />
27:20 Extended Family<br />
MSVC 2010 has CPUID macro, but GCC 4.4.1 — hasn’t. So let’s make this function by yourself <strong>for</strong> GCC,<br />
using its built-in assembler 58 .<br />
#include <br />
#ifdef __GNUC__<br />
static inline void cpuid(int code, int *a, int *b, int *c, int *d) {<br />
asm volatile("cpuid":"=a"(*a),"=b"(*b),"=c"(*c),"=d"(*d):"a"(code));<br />
}<br />
#endif<br />
#ifdef _MSC_VER<br />
#include <br />
#endif<br />
struct CPUID_1_EAX<br />
{<br />
unsigned int stepping:4;<br />
unsigned int model:4;<br />
unsigned int family_id:4;<br />
unsigned int processor_type:2;<br />
unsigned int reserved1:2;<br />
unsigned int extended_model_id:4;<br />
unsigned int extended_family_id:8;<br />
unsigned int reserved2:4;<br />
};<br />
int main()<br />
{<br />
struct CPUID_1_EAX *tmp;<br />
int b[4];<br />
#ifdef _MSC_VER<br />
__cpuid(b,1);<br />
#endif<br />
#ifdef __GNUC__<br />
cpuid (1, &b[0], &b[1], &b[2], &b[3]);<br />
#endif<br />
tmp=(struct CPUID_1_EAX *)&b[0];<br />
57 http://en.wikipedia.org/wiki/CPUID<br />
58 More about internal GCC assembler<br />
72
};<br />
printf ("stepping=%d\n", tmp->stepping);<br />
printf ("model=%d\n", tmp->model);<br />
printf ("family_id=%d\n", tmp->family_id);<br />
printf ("processor_type=%d\n", tmp->processor_type);<br />
printf ("extended_model_id=%d\n", tmp->extended_model_id);<br />
printf ("extended_family_id=%d\n", tmp->extended_family_id);<br />
return 0;<br />
After CPUID will fill EAX/EBX/ECX/EDX, these registers will be reflected in b[] array. Then, we have a<br />
pointer <strong>to</strong> CPUID_1_EAX structure and we point it <strong>to</strong> EAX value from b[] array.<br />
In other words, we treat 32-bit int value as a structure.<br />
Then we read from the stucture.<br />
Let’s compile it in MSVC 2008 with /Ox option:<br />
_b$ = -16 ; size = 16<br />
_main PROC<br />
sub esp, 16 ; 00000010H<br />
push ebx<br />
xor ecx, ecx<br />
mov eax, 1<br />
cpuid<br />
push esi<br />
lea esi, DWORD PTR _b$[esp+24]<br />
mov DWORD PTR [esi], eax<br />
mov DWORD PTR [esi+4], ebx<br />
mov DWORD PTR [esi+8], ecx<br />
mov DWORD PTR [esi+12], edx<br />
mov esi, DWORD PTR _b$[esp+24]<br />
mov eax, esi<br />
and eax, 15 ; 0000000fH<br />
push eax<br />
push OFFSET $SG15435 ; ’stepping=%d’, 0aH, 00H<br />
call _printf<br />
mov ecx, esi<br />
shr ecx, 4<br />
and ecx, 15 ; 0000000fH<br />
push ecx<br />
push OFFSET $SG15436 ; ’model=%d’, 0aH, 00H<br />
call _printf<br />
mov edx, esi<br />
shr edx, 8<br />
and edx, 15 ; 0000000fH<br />
push edx<br />
push OFFSET $SG15437 ; ’family_id=%d’, 0aH, 00H<br />
call _printf<br />
mov eax, esi<br />
shr eax, 12 ; 0000000cH<br />
73
and eax, 3<br />
push eax<br />
push OFFSET $SG15438 ; ’processor_type=%d’, 0aH, 00H<br />
call _printf<br />
mov ecx, esi<br />
shr ecx, 16 ; 00000010H<br />
and ecx, 15 ; 0000000fH<br />
push ecx<br />
push OFFSET $SG15439 ; ’extended_model_id=%d’, 0aH, 00H<br />
call _printf<br />
shr esi, 20 ; 00000014H<br />
and esi, 255 ; 000000ffH<br />
push esi<br />
push OFFSET $SG15440 ; ’extended_family_id=%d’, 0aH, 00H<br />
call _printf<br />
add esp, 48 ; 00000030H<br />
pop esi<br />
xor eax, eax<br />
pop ebx<br />
add esp, 16 ; 00000010H<br />
ret 0<br />
_main ENDP<br />
SHR instruction shifting value in EAX by number of bits should be skipped, e.g., we ignore some bits at<br />
right.<br />
AND instruction clearing not needed bits at left, or, in other words, leave only those bits in EAX we need<br />
now.<br />
Let’s try GCC 4.4.1 with -O3 option.<br />
main proc near ; DATA XREF: _start+17<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
push esi<br />
mov esi, 1<br />
push ebx<br />
mov eax, esi<br />
sub esp, 18h<br />
cpuid<br />
mov esi, eax<br />
and eax, 0Fh<br />
mov [esp+8], eax<br />
mov dword ptr [esp+4], offset aSteppingD ; "stepping=%d\n"<br />
mov dword ptr [esp], 1<br />
call ___printf_chk<br />
mov eax, esi<br />
shr eax, 4<br />
and eax, 0Fh<br />
mov [esp+8], eax<br />
mov dword ptr [esp+4], offset aModelD ; "model=%d\n"<br />
mov dword ptr [esp], 1<br />
74
call ___printf_chk<br />
mov eax, esi<br />
shr eax, 8<br />
and eax, 0Fh<br />
mov [esp+8], eax<br />
mov dword ptr [esp+4], offset aFamily_idD ; "family_id=%d\n"<br />
mov dword ptr [esp], 1<br />
call ___printf_chk<br />
mov eax, esi<br />
shr eax, 0Ch<br />
and eax, 3<br />
mov [esp+8], eax<br />
mov dword ptr [esp+4], offset aProcessor_type ; "processor_type=%d\n"<br />
mov dword ptr [esp], 1<br />
call ___printf_chk<br />
mov eax, esi<br />
shr eax, 10h<br />
shr esi, 14h<br />
and eax, 0Fh<br />
and esi, 0FFh<br />
mov [esp+8], eax<br />
mov dword ptr [esp+4], offset aExtended_model ; "extended_model_id=%d\n"<br />
mov dword ptr [esp], 1<br />
call ___printf_chk<br />
mov [esp+8], esi<br />
mov dword ptr [esp+4], offset unk_80486D0<br />
mov dword ptr [esp], 1<br />
call ___printf_chk<br />
add esp, 18h<br />
xor eax, eax<br />
pop ebx<br />
pop esi<br />
mov esp, ebp<br />
pop ebp<br />
retn<br />
main endp<br />
Almost the same. The only thing <strong>to</strong> note is that GCC somehow united calculation of extended_model_id<br />
and extended_family_id in<strong>to</strong> one block, instead of calculating them separately, be<strong>for</strong>e corresponding each<br />
printf() call.<br />
Working with the float type as with a structure<br />
As it was already noted in section about FPU 1.12, both float and double types consisted of sign, significand<br />
(or fraction) and exponent. But will we able <strong>to</strong> work with these fields directly? Let’s try with float.<br />
#include <br />
#include <br />
Figure 1.1: float value <strong>for</strong>mat (illustration taken from wikipedia)<br />
75
#include <br />
#include <br />
struct float_as_struct<br />
{<br />
unsigned int fraction : 23; // fractional part<br />
unsigned int exponent : 8; // exponent + 0x3FF<br />
unsigned int sign : 1; // sign bit<br />
};<br />
float f(float _in)<br />
{<br />
float f=_in;<br />
struct float_as_struct t;<br />
};<br />
assert (sizeof (struct float_as_struct) == sizeof (float));<br />
memcpy (&t, &f, sizeof (float));<br />
t.sign=1; // set negative sign<br />
t.exponent=t.exponent+2; // multiple d by 2^n (n here is 2)<br />
memcpy (&f, &t, sizeof (float));<br />
return f;<br />
int main()<br />
{<br />
printf ("%f\n", f(1.234));<br />
};<br />
float_as_struct structure occupies as much space is memory as float, e.g., 4 bytes or 32 bits.<br />
Now we setting negative sign in input value and also by addding 2 <strong>to</strong> exponent we thereby multiplicating<br />
the whole number by 2 2 , e.g., by 4.<br />
Let’s compile in MSVC 2008 without optimization:<br />
_t$ = -8 ; size = 4<br />
_f$ = -4 ; size = 4<br />
__in$ = 8 ; size = 4<br />
?f@@YAMM@Z PROC ; f<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 8<br />
fld DWORD PTR __in$[ebp]<br />
fstp DWORD PTR _f$[ebp]<br />
push 4<br />
lea eax, DWORD PTR _f$[ebp]<br />
push eax<br />
lea ecx, DWORD PTR _t$[ebp]<br />
push ecx<br />
call _memcpy<br />
add esp, 12 ; 0000000cH<br />
76
mov edx, DWORD PTR _t$[ebp]<br />
or edx, -2147483648 ; 80000000H - set minus sign<br />
mov DWORD PTR _t$[ebp], edx<br />
mov eax, DWORD PTR _t$[ebp]<br />
shr eax, 23 ; 00000017H - drop significand<br />
and eax, 255 ; 000000ffH - leave here only exponent<br />
add eax, 2 ; add 2 <strong>to</strong> it<br />
and eax, 255 ; 000000ffH<br />
shl eax, 23 ; 00000017H - shift result <strong>to</strong> place of bits 30:23<br />
mov ecx, DWORD PTR _t$[ebp]<br />
and ecx, -2139095041 ; 807fffffH - drop exponent<br />
or ecx, eax ; add original value without exponent with new calculated<br />
explonent<br />
mov DWORD PTR _t$[ebp], ecx<br />
push 4<br />
lea edx, DWORD PTR _t$[ebp]<br />
push edx<br />
lea eax, DWORD PTR _f$[ebp]<br />
push eax<br />
call _memcpy<br />
add esp, 12 ; 0000000cH<br />
fld DWORD PTR _f$[ebp]<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
?f@@YAMM@Z ENDP ; f<br />
Redundant <strong>for</strong> a bit. If it compiled with /Ox flag there are no memcpy() call, f variable is used directly.<br />
But it’s easier <strong>to</strong> understand it all considering unoptimized version.<br />
What GCC 4.4.1 with -O3 will do?<br />
; f(float)<br />
public _Z1ff<br />
_Z1ff proc near<br />
var_4 = dword ptr -4<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 4<br />
mov eax, [ebp+arg_0]<br />
or eax, 80000000h ; set minus sign<br />
mov edx, eax<br />
and eax, 807FFFFFh ; leave only significand and exponent in EAX<br />
shr edx, 23 ; prepare exponent<br />
add edx, 2 ; add 2<br />
movzx edx, dl ; clear all bits except 7:0 in EAX<br />
shl edx, 23 ; shift new calculated exponent <strong>to</strong> its place<br />
or eax, edx ; add newe exponent and original value without exponent<br />
77
mov [ebp+var_4], eax<br />
fld [ebp+var_4]<br />
leave<br />
retn<br />
_Z1ff endp<br />
public main<br />
main proc near<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 10h<br />
fld ds:dword_8048614 ; -4.936<br />
fstp qword ptr [esp+8]<br />
mov dword ptr [esp+4], offset asc_8048610 ; "%f\n"<br />
mov dword ptr [esp], 1<br />
call ___printf_chk<br />
xor eax, eax<br />
leave<br />
retn<br />
main endp<br />
The f() function is almost understandable. However, what is interesting, GCC was able <strong>to</strong> calculate<br />
f(1.234) result during compilation stage despite all this hodge-podge with structure fields and prepared<br />
this argument <strong>to</strong> printf() as precalculated!<br />
78
1.16 C++ classes<br />
I placed a C++ classes description here intentionally after structures description, because internally, C++<br />
classes representation is almost the same as structures representation.<br />
Let’s try an example with two variables, couple of construc<strong>to</strong>rs and one method:<br />
#include <br />
class c<br />
{<br />
private:<br />
int v1;<br />
int v2;<br />
public:<br />
c() // c<strong>to</strong>r<br />
{<br />
v1=667;<br />
v2=999;<br />
};<br />
};<br />
c(int a, int b) // c<strong>to</strong>r<br />
{<br />
v1=a;<br />
v2=b;<br />
};<br />
void dump()<br />
{<br />
printf ("%d; %d\n", v1, v2);<br />
};<br />
int main()<br />
{<br />
class c c1;<br />
class c c2(5,6);<br />
};<br />
c1.dump();<br />
c2.dump();<br />
return 0;<br />
Here is how main() function looks like translated in<strong>to</strong> assembler:<br />
_c2$ = -16 ; size = 8<br />
_c1$ = -8 ; size = 8<br />
_main PROC<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 16 ; 00000010H<br />
lea ecx, DWORD PTR _c1$[ebp]<br />
call ??0c@@QAE@XZ ; c::c<br />
push 6<br />
push 5<br />
79
lea ecx, DWORD PTR _c2$[ebp]<br />
call ??0c@@QAE@HH@Z ; c::c<br />
lea ecx, DWORD PTR _c1$[ebp]<br />
call ?dump@c@@QAEXXZ ; c::dump<br />
lea ecx, DWORD PTR _c2$[ebp]<br />
call ?dump@c@@QAEXXZ ; c::dump<br />
xor eax, eax<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
So what’s going on. For each object (instance of class c) 8 bytes allocated, that’s exactly size of 2 variables<br />
s<strong>to</strong>rage.<br />
For c1 a default argumentless construc<strong>to</strong>r ??0c@@QAE@XZ is called. For c2 another construc<strong>to</strong>r ??0c@@QAE@HH@Z<br />
is called and two numbers are passed as arguments.<br />
A pointer <strong>to</strong> object (this in C++ terminology) is passed in ECX register. This is called thiscall 2.5.4 —<br />
a pointer <strong>to</strong> object passing method.<br />
MSVC doing it using ECX register. Needless <strong>to</strong> say, it’s not a standardized method, other compilers could<br />
do it differently, <strong>for</strong> example, via first function argument (like GCC).<br />
Why these functions has so odd names? That’s name mangling 59 .<br />
C++ class may contain several methods sharing the same name but having different arguments — that’s<br />
polymorphism. And of course, different classes may own methods sharing the same name.<br />
Name mangling allows <strong>to</strong> encode class name + method name + all method argument types in one ASCIIstring,<br />
which will be used as internal function name. That’s all because neither linker, nor DLL operation<br />
system loader (mangled names may be among DLL exports as well) knows nothing about C++ or OOP.<br />
dump() function called two times after.<br />
Now let’s see construc<strong>to</strong>rs’ code:<br />
_this$ = -4 ; size = 4<br />
??0c@@QAE@XZ PROC ; c::c, COMDAT<br />
; _this$ = ecx<br />
push ebp<br />
mov ebp, esp<br />
push ecx<br />
mov DWORD PTR _this$[ebp], ecx<br />
mov eax, DWORD PTR _this$[ebp]<br />
mov DWORD PTR [eax], 667 ; 0000029bH<br />
mov ecx, DWORD PTR _this$[ebp]<br />
mov DWORD PTR [ecx+4], 999 ; 000003e7H<br />
mov eax, DWORD PTR _this$[ebp]<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
??0c@@QAE@XZ ENDP ; c::c<br />
_this$ = -4 ; size = 4<br />
_a$ = 8 ; size = 4<br />
_b$ = 12 ; size = 4<br />
??0c@@QAE@HH@Z PROC ; c::c, COMDAT<br />
; _this$ = ecx<br />
push ebp<br />
mov ebp, esp<br />
59 Wikipedia: Name mangling<br />
80
push ecx<br />
mov DWORD PTR _this$[ebp], ecx<br />
mov eax, DWORD PTR _this$[ebp]<br />
mov ecx, DWORD PTR _a$[ebp]<br />
mov DWORD PTR [eax], ecx<br />
mov edx, DWORD PTR _this$[ebp]<br />
mov eax, DWORD PTR _b$[ebp]<br />
mov DWORD PTR [edx+4], eax<br />
mov eax, DWORD PTR _this$[ebp]<br />
mov esp, ebp<br />
pop ebp<br />
ret 8<br />
??0c@@QAE@HH@Z ENDP ; c::c<br />
Construc<strong>to</strong>rs are just functions, they use pointer <strong>to</strong> structure in ECX, moving the pointer in<strong>to</strong> own local<br />
variable, however, it’s not necessary.<br />
Now dump() method:<br />
_this$ = -4 ; size = 4<br />
?dump@c@@QAEXXZ PROC ; c::dump, COMDAT<br />
; _this$ = ecx<br />
push ebp<br />
mov ebp, esp<br />
push ecx<br />
mov DWORD PTR _this$[ebp], ecx<br />
mov eax, DWORD PTR _this$[ebp]<br />
mov ecx, DWORD PTR [eax+4]<br />
push ecx<br />
mov edx, DWORD PTR _this$[ebp]<br />
mov eax, DWORD PTR [edx]<br />
push eax<br />
push OFFSET ??_C@_07NJBDCIEC@?$CFd?$DL?5?$CFd?6?$AA@<br />
call _printf<br />
add esp, 12 ; 0000000cH<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
?dump@c@@QAEXXZ ENDP ; c::dump<br />
Simple enough: dump() taking pointer <strong>to</strong> the structure containing two int’s in ECX, takes two values from<br />
it and passing it in<strong>to</strong> printf().<br />
The code is much shorter if compiled with optimization (/Ox):<br />
??0c@@QAE@XZ PROC ; c::c, COMDAT<br />
; _this$ = ecx<br />
mov eax, ecx<br />
mov DWORD PTR [eax], 667 ; 0000029bH<br />
mov DWORD PTR [eax+4], 999 ; 000003e7H<br />
ret 0<br />
??0c@@QAE@XZ ENDP ; c::c<br />
_a$ = 8 ; size = 4<br />
_b$ = 12 ; size = 4<br />
??0c@@QAE@HH@Z PROC ; c::c, COMDAT<br />
; _this$ = ecx<br />
mov edx, DWORD PTR _b$[esp-4]<br />
81
mov eax, ecx<br />
mov ecx, DWORD PTR _a$[esp-4]<br />
mov DWORD PTR [eax], ecx<br />
mov DWORD PTR [eax+4], edx<br />
ret 8<br />
??0c@@QAE@HH@Z ENDP ; c::c<br />
?dump@c@@QAEXXZ PROC ; c::dump, COMDAT<br />
; _this$ = ecx<br />
mov eax, DWORD PTR [ecx+4]<br />
mov ecx, DWORD PTR [ecx]<br />
push eax<br />
push ecx<br />
push OFFSET ??_C@_07NJBDCIEC@?$CFd?$DL?5?$CFd?6?$AA@<br />
call _printf<br />
add esp, 12 ; 0000000cH<br />
ret 0<br />
?dump@c@@QAEXXZ ENDP ; c::dump<br />
That’s all. One more thing <strong>to</strong> say is that stack pointer after second construc<strong>to</strong>r calling wasn’t corrected<br />
with add esp, X. Please also note that, construc<strong>to</strong>r has ret 8 instead of RET at the end.<br />
That’s all because here used thiscall 2.5.4 calling convention, the method of passing values through the<br />
stack, which is, <strong>to</strong>gether with stdcall 2.5.2 method, offers <strong>to</strong> correct stack <strong>to</strong> callee rather then <strong>to</strong> caller. ret<br />
x instruction adding X <strong>to</strong> ESP, then passes control <strong>to</strong> caller function.<br />
See also section about calling conventions 2.5.<br />
It’s also should be noted that compiler deciding when <strong>to</strong> call construc<strong>to</strong>r and destruc<strong>to</strong>r — but that’s<br />
we already know from C++ language basics.<br />
1.16.1 GCC<br />
It’s almost the same situation in GCC 4.4.1, with few exceptions.<br />
public main<br />
main proc near ; DATA XREF: _start+17<br />
var_20 = dword ptr -20h<br />
var_1C = dword ptr -1Ch<br />
var_18 = dword ptr -18h<br />
var_10 = dword ptr -10h<br />
var_8 = dword ptr -8<br />
push ebp<br />
mov ebp, esp<br />
and esp, 0FFFFFFF0h<br />
sub esp, 20h<br />
lea eax, [esp+20h+var_8]<br />
mov [esp+20h+var_20], eax<br />
call _ZN1cC1Ev<br />
mov [esp+20h+var_18], 6<br />
mov [esp+20h+var_1C], 5<br />
lea eax, [esp+20h+var_10]<br />
mov [esp+20h+var_20], eax<br />
call _ZN1cC1Eii<br />
lea eax, [esp+20h+var_8]<br />
mov [esp+20h+var_20], eax<br />
82
call _ZN1c4dumpEv<br />
lea eax, [esp+20h+var_10]<br />
mov [esp+20h+var_20], eax<br />
call _ZN1c4dumpEv<br />
mov eax, 0<br />
leave<br />
retn<br />
main endp<br />
Here we see another name mangling style, specific <strong>to</strong> GNU 60 standards. It’s also can be noted that<br />
pointer <strong>to</strong> object is passed as first function argument — hiddenly from programmer, of course.<br />
First construc<strong>to</strong>r:<br />
public _ZN1cC1Ev ; weak<br />
_ZN1cC1Ev proc near ; CODE XREF: main+10<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
mov eax, [ebp+arg_0]<br />
mov dword ptr [eax], 667<br />
mov eax, [ebp+arg_0]<br />
mov dword ptr [eax+4], 999<br />
pop ebp<br />
retn<br />
_ZN1cC1Ev endp<br />
What it does is just writes two numbers using pointer passed in first (and sole) argument.<br />
Second construc<strong>to</strong>r:<br />
public _ZN1cC1Eii<br />
_ZN1cC1Eii proc near<br />
arg_0 = dword ptr 8<br />
arg_4 = dword ptr 0Ch<br />
arg_8 = dword ptr 10h<br />
push ebp<br />
mov ebp, esp<br />
mov eax, [ebp+arg_0]<br />
mov edx, [ebp+arg_4]<br />
mov [eax], edx<br />
mov eax, [ebp+arg_0]<br />
mov edx, [ebp+arg_8]<br />
mov [eax+4], edx<br />
pop ebp<br />
retn<br />
_ZN1cC1Eii endp<br />
This is a function, analog of which could be looks like:<br />
void ZN1cC1Eii (int *obj, int a, int b)<br />
{<br />
60 One more document about different compilers name mangling types: http://www.agner.org/optimize/calling_<br />
conventions.pdf<br />
83
*obj=a;<br />
*(obj+1)=b;<br />
};<br />
... and that’s completely predictable.<br />
Now dump() function:<br />
public _ZN1c4dumpEv<br />
_ZN1c4dumpEv proc near<br />
var_18 = dword ptr -18h<br />
var_14 = dword ptr -14h<br />
var_10 = dword ptr -10h<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 18h<br />
mov eax, [ebp+arg_0]<br />
mov edx, [eax+4]<br />
mov eax, [ebp+arg_0]<br />
mov eax, [eax]<br />
mov [esp+18h+var_10], edx<br />
mov [esp+18h+var_14], eax<br />
mov [esp+18h+var_18], offset aDD ; "%d; %d\n"<br />
call _printf<br />
leave<br />
retn<br />
_ZN1c4dumpEv endp<br />
This function in its internal representation has sole argument, used as pointer <strong>to</strong> the object (this).<br />
Thus, if <strong>to</strong> base our judgment on these simple examples, the difference between MSVC and GCC is style<br />
of function names encoding (name mangling) and passing pointer <strong>to</strong> object (via ECX register or via first<br />
argument).<br />
84
1.17 Unions<br />
1.17.1 Pseudo-random number genera<strong>to</strong>r example<br />
If we need float random numbers from 0 <strong>to</strong> 1, the most simplest thing is <strong>to</strong> use random numbers genera<strong>to</strong>r<br />
like Mersenne twister producing random 32-bit values in DWORD <strong>for</strong>m, trans<strong>for</strong>m this value <strong>to</strong> float and<br />
then divide it by RAND_MAX (0xffffffff in our case) — value we got will be in 0..1 interval.<br />
But as we know, division operation is almost always very slow. Will it be possible <strong>to</strong> get rid of it, as in<br />
case of division by multiplication? 1.11<br />
Let’s remember what float number consisted of: sign bit, significand bits and exponent bits. We need<br />
just <strong>to</strong> s<strong>to</strong>re random bits <strong>to</strong> significand bits <strong>for</strong> getting float number!<br />
Exponent cannot be zero (number will be denormalized in this case), so we will s<strong>to</strong>re 01111111 <strong>to</strong><br />
exponent — this mean exponent will be 1. Then fill significand with random bits, set sign bit <strong>to</strong> 0 (which<br />
mean positive number) and voilà. Generated numbers will be in 1 <strong>to</strong> 2 interval, so we also should subtract<br />
1 from it.<br />
Very simple linear congruential random numbers genera<strong>to</strong>r is used in my example 61 , producing 32-bit<br />
numbers. The PRNG initializing by current time in UNIX-style.<br />
Then, float type represented as union — that is the C/C++ construction allowing us <strong>to</strong> interpret piece<br />
of memory differently typed. In our case, we are able <strong>to</strong> create a variable of union type and then access <strong>to</strong><br />
it as it’s float or as it’s uint32_t. It can be said, it’s just a hack. A dirty one.<br />
#include <br />
#include <br />
#include <br />
union uint32_t_float<br />
{<br />
uint32_t i;<br />
float f;<br />
};<br />
// from the Numerical Recipes book<br />
const uint32_t RNG_a=1664525;<br />
const uint32_t RNG_c=1013904223;<br />
int main()<br />
{<br />
uint32_t_float tmp;<br />
};<br />
uint32_t RNG_state=time(NULL); // initial seed<br />
<strong>for</strong> (int i=0; i
__real@3ff0000000000000 DQ 03ff0000000000000r ; 1<br />
tv140 = -4 ; size = 4<br />
_tmp$ = -4 ; size = 4<br />
_main PROC<br />
push ebp<br />
mov ebp, esp<br />
and esp, -64 ; ffffffc0H<br />
sub esp, 56 ; 00000038H<br />
push esi<br />
push edi<br />
push 0<br />
call __time64<br />
add esp, 4<br />
mov esi, eax<br />
mov edi, 100 ; 00000064H<br />
$LN3@main:<br />
; let’s generate random 32-bit number<br />
imul esi, 1664525 ; 0019660dH<br />
add esi, 1013904223 ; 3c6ef35fH<br />
mov eax, esi<br />
; leave bits <strong>for</strong> significand only<br />
and eax, 8388607 ; 007fffffH<br />
; set exponent <strong>to</strong> 1<br />
or eax, 1065353216 ; 3f800000H<br />
; s<strong>to</strong>re this value as int<br />
mov DWORD PTR _tmp$[esp+64], eax<br />
sub esp, 8<br />
; load this value as float<br />
fld DWORD PTR _tmp$[esp+72]<br />
; subtract one from it<br />
fsub QWORD PTR __real@3ff0000000000000<br />
fstp DWORD PTR tv140[esp+72]<br />
fld DWORD PTR tv140[esp+72]<br />
fstp QWORD PTR [esp]<br />
push OFFSET $SG4232<br />
call _printf<br />
add esp, 12 ; 0000000cH<br />
dec edi<br />
jne SHORT $LN3@main<br />
pop edi<br />
86
xor eax, eax<br />
pop esi<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
_main ENDP<br />
_TEXT ENDS<br />
END<br />
GCC producing very same code.<br />
87
1.18 Pointers <strong>to</strong> functions<br />
Pointer <strong>to</strong> function, as any other pointer, is just an address of function beginning in its code segment.<br />
It is often used in callbacks 62 .<br />
Well-known examples are:<br />
∙ qsort() 63 , atexit() 64 from the standard C library;<br />
∙ signals in *NIX OS 65 ;<br />
∙ thread starting: CreateThread() (win32), pthread_create() (POSIX);<br />
∙ a lot of win32 functions, <strong>for</strong> example EnumChildWindows() 66 .<br />
So, qsort() function is a C/C++ standard library quicksort implemenation. The functions is able <strong>to</strong><br />
sort anything, any types of data, if you have a function <strong>for</strong> two elements comparison and qsort() is able <strong>to</strong><br />
call it.<br />
The comparison function can be defined as:<br />
int (*compare)(const void *, const void *)<br />
Let’s use slightly modified example I found here:<br />
/* ex3 Sorting ints with qsort */<br />
#include <br />
#include <br />
int comp(const void * _a, const void * _b)<br />
{<br />
const int *a=(const int *)_a;<br />
const int *b=(const int *)_b;<br />
}<br />
if (*a==*b)<br />
return 0;<br />
else<br />
if (*a < *b)<br />
return -1;<br />
else<br />
return 1;<br />
int main(int argc, char* argv[])<br />
{<br />
int numbers[10]={1892,45,200,-98,4087,5,-12345,1087,88,-100000};<br />
int i;<br />
}<br />
/* Sort the array */<br />
qsort(numbers,10,sizeof(int),comp) ;<br />
<strong>for</strong> (i=0;i
Let’s compile it in MSVC 2010 (I omitted some parts <strong>for</strong> the sake of brefity) with /Ox option:<br />
__a$ = 8 ; size = 4<br />
__b$ = 12 ; size = 4<br />
_comp PROC<br />
mov eax, DWORD PTR __a$[esp-4]<br />
mov ecx, DWORD PTR __b$[esp-4]<br />
mov eax, DWORD PTR [eax]<br />
mov ecx, DWORD PTR [ecx]<br />
cmp eax, ecx<br />
jne SHORT $LN4@comp<br />
xor eax, eax<br />
ret 0<br />
$LN4@comp:<br />
xor edx, edx<br />
cmp eax, ecx<br />
setge dl<br />
lea eax, DWORD PTR [edx+edx-1]<br />
ret 0<br />
_comp ENDP<br />
...<br />
_numbers$ = -44 ; size = 40<br />
_i$ = -4 ; size = 4<br />
_argc$ = 8 ; size = 4<br />
_argv$ = 12 ; size = 4<br />
_main PROC<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 44 ; 0000002cH<br />
mov DWORD PTR _numbers$[ebp], 1892 ; 00000764H<br />
mov DWORD PTR _numbers$[ebp+4], 45 ; 0000002dH<br />
mov DWORD PTR _numbers$[ebp+8], 200 ; 000000c8H<br />
mov DWORD PTR _numbers$[ebp+12], -98 ; ffffff9eH<br />
mov DWORD PTR _numbers$[ebp+16], 4087 ; 00000ff7H<br />
mov DWORD PTR _numbers$[ebp+20], 5<br />
mov DWORD PTR _numbers$[ebp+24], -12345 ; ffffcfc7H<br />
mov DWORD PTR _numbers$[ebp+28], 1087 ; 0000043fH<br />
mov DWORD PTR _numbers$[ebp+32], 88 ; 00000058H<br />
mov DWORD PTR _numbers$[ebp+36], -100000 ; fffe7960H<br />
push OFFSET _comp<br />
push 4<br />
push 10 ; 0000000aH<br />
lea eax, DWORD PTR _numbers$[ebp]<br />
push eax<br />
call _qsort<br />
add esp, 16 ; 00000010H<br />
...<br />
Nothing surprising so far. As a fourth argument, an address of label _comp is passed, that’s just a place<br />
where function comp() located.<br />
How qsort() calling it?<br />
Let’s take a look in<strong>to</strong> this function located in MSVCR80.DLL (a MSVC DLL module with C standard<br />
89
library functions):<br />
.text:7816CBF0 ; void __cdecl qsort(void *, unsigned int, unsigned int, int (__cdecl *)(<br />
const void *, const void *))<br />
.text:7816CBF0 public _qsort<br />
.text:7816CBF0 _qsort proc near<br />
.text:7816CBF0<br />
.text:7816CBF0 lo = dword ptr -104h<br />
.text:7816CBF0 hi = dword ptr -100h<br />
.text:7816CBF0 var_FC = dword ptr -0FCh<br />
.text:7816CBF0 stkptr = dword ptr -0F8h<br />
.text:7816CBF0 lostk = dword ptr -0F4h<br />
.text:7816CBF0 histk = dword ptr -7Ch<br />
.text:7816CBF0 base = dword ptr 4<br />
.text:7816CBF0 num = dword ptr 8<br />
.text:7816CBF0 width = dword ptr 0Ch<br />
.text:7816CBF0 comp = dword ptr 10h<br />
.text:7816CBF0<br />
.text:7816CBF0 sub esp, 100h<br />
....<br />
.text:7816CCE0 loc_7816CCE0: ; CODE XREF: _qsort+B1<br />
.text:7816CCE0 shr eax, 1<br />
.text:7816CCE2 imul eax, ebp<br />
.text:7816CCE5 add eax, ebx<br />
.text:7816CCE7 mov edi, eax<br />
.text:7816CCE9 push edi<br />
.text:7816CCEA push ebx<br />
.text:7816CCEB call [esp+118h+comp]<br />
.text:7816CCF2 add esp, 8<br />
.text:7816CCF5 test eax, eax<br />
.text:7816CCF7 jle short loc_7816CD04<br />
comp — is fourth function argument. Here the control is just passed <strong>to</strong> the address in comp. Be<strong>for</strong>e it,<br />
two arguments prepared <strong>for</strong> comp(). Its result is checked after its execution.<br />
That’s why it’s dangerous <strong>to</strong> use pointers <strong>to</strong> functions. First of all, if you call qsort() with incorrect<br />
pointer <strong>to</strong> function, qsort() may pass control <strong>to</strong> incorrect place, process may crash and this bug will be<br />
hard <strong>to</strong> find.<br />
Second reason is that callback function types should comply strictly, calling wrong function with wrong<br />
arguments of wrong types may lead <strong>to</strong> serious problems, however, process crashing is not a big problem —<br />
big problem is <strong>to</strong> determine a reason of crashing — because compiler may be silent about potential trouble<br />
while compiling.<br />
1.18.1 GCC<br />
Not a big difference:<br />
lea eax, [esp+40h+var_28]<br />
mov [esp+40h+var_40], eax<br />
mov [esp+40h+var_28], 764h<br />
mov [esp+40h+var_24], 2Dh<br />
mov [esp+40h+var_20], 0C8h<br />
mov [esp+40h+var_1C], 0FFFFFF9Eh<br />
mov [esp+40h+var_18], 0FF7h<br />
90
comp() function:<br />
public comp<br />
comp proc near<br />
mov [esp+40h+var_14], 5<br />
mov [esp+40h+var_10], 0FFFFCFC7h<br />
mov [esp+40h+var_C], 43Fh<br />
mov [esp+40h+var_8], 58h<br />
mov [esp+40h+var_4], 0FFFE7960h<br />
mov [esp+40h+var_34], offset comp<br />
mov [esp+40h+var_38], 4<br />
mov [esp+40h+var_3C], 0Ah<br />
call _qsort<br />
arg_0 = dword ptr 8<br />
arg_4 = dword ptr 0Ch<br />
loc_8048458:<br />
push ebp<br />
mov ebp, esp<br />
mov eax, [ebp+arg_4]<br />
mov ecx, [ebp+arg_0]<br />
mov edx, [eax]<br />
xor eax, eax<br />
cmp [ecx], edx<br />
jnz short loc_8048458<br />
pop ebp<br />
retn<br />
setnl al<br />
movzx eax, al<br />
lea eax, [eax+eax-1]<br />
pop ebp<br />
retn<br />
comp endp<br />
qsort() implementation is located in libc.so.6 and it is in fact just a wrapper <strong>for</strong> qsort_r().<br />
It will call then quicksort(), where our defined function will be called via passed pointer:<br />
(File libc.so.6, glibc version — 2.10.1)<br />
.text:0002DDF6 mov edx, [ebp+arg_10]<br />
.text:0002DDF9 mov [esp+4], esi<br />
.text:0002DDFD mov [esp], edi<br />
.text:0002DE00 mov [esp+8], edx<br />
.text:0002DE04 call [ebp+arg_C]<br />
...<br />
91
1.19 SIMD<br />
SIMD is just acronym: Single Instruction, Multiple Data.<br />
As it’s said, it’s multiple data processing using only one instruction.<br />
Just as FPU, that CPU subsystem looks like separate processor inside x86.<br />
SIMD began as MMX in x86. 8 new 64-bit registers appeared: MM0-MM7.<br />
Each MMX register may hold 2 32-bit values, 4 16-bit values or 8 bytes. For example, it is possible <strong>to</strong><br />
add 8 8-bit values (bytes) simultaneously by adding two values in MMX-registers.<br />
One simple example is graphics edi<strong>to</strong>r, representing image as a two dimensional array. When user change<br />
image brightness, the edi<strong>to</strong>r should add some coefficient <strong>to</strong> each pixel value, or <strong>to</strong> subtract. For the sake of<br />
brevity, our image may be grayscale and each pixel defined by one 8-bit byte, then it’s possible <strong>to</strong> change<br />
brightness of 8 pixels simultaneously.<br />
When MMX appeared, these registers was actually located in FPU registers. It was possible <strong>to</strong> use either<br />
FPU or MMX at the same time. One might think, Intel saved on transis<strong>to</strong>rs, but in fact, the reason of<br />
such symbiosis is simpler — older operation system may not aware of additional CPU registers wouldn’t<br />
save them at the context switching, but will save FPU registers. Thus, MMX-enabled CPU + old operation<br />
system + process using MMX = that all will work <strong>to</strong>gether.<br />
SSE — is extension of SIMD registers up <strong>to</strong> 128 bits, now separately from FPU.<br />
AVX — another extension <strong>to</strong> 256 bits.<br />
Now about practical usage.<br />
Of course, memory copying (memcpy), memory comparing (memcmp) and so on.<br />
One more example: we got DES encryption algorithm, it takes 64-bit block, 56-bit key, encrypt block<br />
and produce 64-bit result. DES algorithm may be considered as a very large electronic circuit, with wires<br />
and AND/OR/NOT gates.<br />
Bitslice DES 67 — is an idea of processing group of blocks and keys simultaneously. Let’s say, variable of<br />
type unsigned int on x86 may hold up <strong>to</strong> 32 bits, so, it’s possible <strong>to</strong> s<strong>to</strong>re there intermediate results <strong>for</strong> 32<br />
blocks-keys pairs simultaneously, using 64+56 variables of unsigned int type.<br />
I wrote an utility <strong>to</strong> brute-<strong>for</strong>ce Oracle RDBMS passwords/hashes (ones based on DES), slightly modified<br />
bitslice DES algorithm <strong>for</strong> SSE2 and AVX — now it’s possible <strong>to</strong> encrypt 128 or 256 block-keys pairs<br />
simultaneously.<br />
http://conus.info/utils/ops_SIMD/<br />
1.19.1 Vec<strong>to</strong>rization<br />
Vec<strong>to</strong>rization 68 , <strong>for</strong> example, is when you have a loop taking couple of arrays at input and producing one<br />
array. Loop body takes values from input arrays, do something and put result in<strong>to</strong> output array. It’s<br />
important that there is only one single operation applied <strong>to</strong> each element. Vec<strong>to</strong>rization — is <strong>to</strong> process<br />
several elements simultaneously.<br />
For example:<br />
<strong>for</strong> (i = 0; i < 1024; i++)<br />
{<br />
C[i] = A[i]*B[i];<br />
}<br />
This piece of code takes elements from A and B, multiplies them and save result in<strong>to</strong> C.<br />
If each array element we have is 32-bit int, then it’s possible <strong>to</strong> load 4 elements from A in<strong>to</strong> 128-bit<br />
XMM-register, from B <strong>to</strong> another XMM-registers, and by executing PMULLD (Multiply Packed Signed<br />
Dword Integers and S<strong>to</strong>re Low Result)and PMULHW (Multiply Packed Signed Integers and S<strong>to</strong>re High Result),<br />
it’s possible <strong>to</strong> get 4 64-bit products 69 at once.<br />
Thus, loop body count is 1024/4 instead of 1024, that’s 4 times less and, of course, faster.<br />
67 http://www.darkside.com.au/bitslice/<br />
68 Wikipedia: vec<strong>to</strong>rization<br />
69 multiplication result<br />
92
Some compilers can do vec<strong>to</strong>rization au<strong>to</strong>matically in some simple cases, <strong>for</strong> example, Intel C++ 70 .<br />
I wrote tiny function:<br />
int f (int sz, int *ar1, int *ar2, int *ar3)<br />
{<br />
<strong>for</strong> (int i=0; i
lea ecx, ds:0[edx*4]<br />
cmp esi, ecx<br />
jb loc_143<br />
loc_55: ; CODE XREF: f(int,int *,int *,int *)+34<br />
cmp eax, [esp+10h+ar1]<br />
jbe short loc_67<br />
mov esi, [esp+10h+ar1]<br />
sub esi, eax<br />
neg esi<br />
cmp ecx, esi<br />
jbe short loc_7F<br />
loc_67: ; CODE XREF: f(int,int *,int *,int *)+59<br />
cmp eax, [esp+10h+ar1]<br />
jnb loc_143<br />
mov esi, [esp+10h+ar1]<br />
sub esi, eax<br />
cmp esi, ecx<br />
jb loc_143<br />
loc_7F: ; CODE XREF: f(int,int *,int *,int *)+65<br />
mov edi, eax ; edi = ar1<br />
and edi, 0Fh ; is ar1 16-byte aligned?<br />
jz short loc_9A ; yes<br />
test edi, 3<br />
jnz loc_162<br />
neg edi<br />
add edi, 10h<br />
shr edi, 2<br />
loc_9A: ; CODE XREF: f(int,int *,int *,int *)+84<br />
lea ecx, [edi+4]<br />
cmp edx, ecx<br />
jl loc_162<br />
mov ecx, edx<br />
sub ecx, edi<br />
and ecx, 3<br />
neg ecx<br />
add ecx, edx<br />
test edi, edi<br />
jbe short loc_D6<br />
mov ebx, [esp+10h+ar2]<br />
mov [esp+10h+var_10], ecx<br />
mov ecx, [esp+10h+ar1]<br />
xor esi, esi<br />
loc_C1: ; CODE XREF: f(int,int *,int *,int *)+CD<br />
mov edx, [ecx+esi*4]<br />
add edx, [ebx+esi*4]<br />
mov [eax+esi*4], edx<br />
inc esi<br />
cmp esi, edi<br />
jb short loc_C1<br />
94
mov ecx, [esp+10h+var_10]<br />
mov edx, [esp+10h+sz]<br />
loc_D6: ; CODE XREF: f(int,int *,int *,int *)+B2<br />
mov esi, [esp+10h+ar2]<br />
lea esi, [esi+edi*4] ; is ar2+i*4 16-byte aligned?<br />
test esi, 0Fh<br />
jz short loc_109 ; yes!<br />
mov ebx, [esp+10h+ar1]<br />
mov esi, [esp+10h+ar2]<br />
loc_ED: ; CODE XREF: f(int,int *,int *,int *)+105<br />
movdqu xmm1, xmmword ptr [ebx+edi*4]<br />
movdqu xmm0, xmmword ptr [esi+edi*4] ; ar2+i*4 is not 16-byte aligned, so<br />
load it <strong>to</strong> xmm0<br />
paddd xmm1, xmm0<br />
movdqa xmmword ptr [eax+edi*4], xmm1<br />
add edi, 4<br />
cmp edi, ecx<br />
jb short loc_ED<br />
jmp short loc_127<br />
; ---------------------------------------------------------------------------<br />
loc_109: ; CODE XREF: f(int,int *,int *,int *)+E3<br />
mov ebx, [esp+10h+ar1]<br />
mov esi, [esp+10h+ar2]<br />
loc_111: ; CODE XREF: f(int,int *,int *,int *)+125<br />
movdqu xmm0, xmmword ptr [ebx+edi*4]<br />
paddd xmm0, xmmword ptr [esi+edi*4]<br />
movdqa xmmword ptr [eax+edi*4], xmm0<br />
add edi, 4<br />
cmp edi, ecx<br />
jb short loc_111<br />
loc_127: ; CODE XREF: f(int,int *,int *,int *)+107<br />
; f(int,int *,int *,int *)+164<br />
cmp ecx, edx<br />
jnb short loc_15B<br />
mov esi, [esp+10h+ar1]<br />
mov edi, [esp+10h+ar2]<br />
loc_133: ; CODE XREF: f(int,int *,int *,int *)+13F<br />
mov ebx, [esi+ecx*4]<br />
add ebx, [edi+ecx*4]<br />
mov [eax+ecx*4], ebx<br />
inc ecx<br />
cmp ecx, edx<br />
jb short loc_133<br />
jmp short loc_15B<br />
; ---------------------------------------------------------------------------<br />
loc_143: ; CODE XREF: f(int,int *,int *,int *)+17<br />
; f(int,int *,int *,int *)+3A ...<br />
95
mov esi, [esp+10h+ar1]<br />
mov edi, [esp+10h+ar2]<br />
xor ecx, ecx<br />
loc_14D: ; CODE XREF: f(int,int *,int *,int *)+159<br />
mov ebx, [esi+ecx*4]<br />
add ebx, [edi+ecx*4]<br />
mov [eax+ecx*4], ebx<br />
inc ecx<br />
cmp ecx, edx<br />
jb short loc_14D<br />
loc_15B: ; CODE XREF: f(int,int *,int *,int *)+A<br />
; f(int,int *,int *,int *)+129 ...<br />
xor eax, eax<br />
pop ecx<br />
pop ebx<br />
pop esi<br />
pop edi<br />
retn<br />
; ---------------------------------------------------------------------------<br />
loc_162: ; CODE XREF: f(int,int *,int *,int *)+8C<br />
; f(int,int *,int *,int *)+9F<br />
xor ecx, ecx<br />
jmp short loc_127<br />
?f@@YAHHPAH00@Z endp<br />
SSE2-related instructions are:<br />
∙ MOVDQU (Move Unaligned Double Quadword) — it just load 16 bytes from memory in<strong>to</strong> XMM-register.<br />
∙ PADDD (Add Packed Integers) — adding 4 pairs of 32-bit numbers and leaving result in first operand.<br />
By the way, no exception raised in case of overflow and no flags will be set, just low 32-bit of result<br />
will be s<strong>to</strong>red. If one of PADDD operands — address of value in memory, address should be aligned by<br />
16-byte border. If it’s not aligned, exception will be raised 71 .<br />
∙ MOVDQA (Move Aligned Double Quadword) — the same as MOVDQU, but requires address of value in<br />
memory <strong>to</strong> be aligned by 16-bit border. If it’s not aligned, exception will be raised. MOVDQA works<br />
faster than MOVDQU, but requires a<strong>for</strong>esaid.<br />
So, these SSE2-instructions will be executed only in case if there are more 4 pairs <strong>to</strong> work on plus pointer<br />
ar3 is aligned on 16-byte border.<br />
More than that, if ar2 is aligned on 16-byte border <strong>to</strong>o, this piece of code will be executed:<br />
movdqu xmm0, xmmword ptr [ebx+edi*4] ; ar1+i*4<br />
paddd xmm0, xmmword ptr [esi+edi*4] ; ar2+i*4<br />
movdqa xmmword ptr [eax+edi*4], xmm0 ; ar3+i*4<br />
Otherwise, value from ar2 will be loaded <strong>to</strong> XMM0 using MOVDQU, it doesn’t require aligned pointer, but<br />
may work slower:<br />
movdqu xmm1, xmmword ptr [ebx+edi*4] ; ar1+i*4<br />
movdqu xmm0, xmmword ptr [esi+edi*4] ; ar2+i*4 is not 16-byte aligned, so load it <strong>to</strong> xmm0<br />
paddd xmm1, xmm0<br />
movdqa xmmword ptr [eax+edi*4], xmm1 ; ar3+i*4<br />
71 More about data aligning: Wikipedia: Data structure alignment<br />
96
GCC<br />
In all other cases, non-SSE2 code will be executed.<br />
GCC may also vec<strong>to</strong>rize in some simple cases 72 , if <strong>to</strong> use -O3 option and <strong>to</strong> turn on SSE2 support: -msse2.<br />
What we got (GCC 4.4.1):<br />
; f(int, int *, int *, int *)<br />
public _Z1fiPiS_S_<br />
_Z1fiPiS_S_ proc near<br />
var_18 = dword ptr -18h<br />
var_14 = dword ptr -14h<br />
var_10 = dword ptr -10h<br />
arg_0 = dword ptr 8<br />
arg_4 = dword ptr 0Ch<br />
arg_8 = dword ptr 10h<br />
arg_C = dword ptr 14h<br />
push ebp<br />
mov ebp, esp<br />
push edi<br />
push esi<br />
push ebx<br />
sub esp, 0Ch<br />
mov ecx, [ebp+arg_0]<br />
mov esi, [ebp+arg_4]<br />
mov edi, [ebp+arg_8]<br />
mov ebx, [ebp+arg_C]<br />
test ecx, ecx<br />
jle short loc_80484D8<br />
cmp ecx, 6<br />
lea eax, [ebx+10h]<br />
ja short loc_80484E8<br />
loc_80484C1: ; CODE XREF: f(int,int *,int *,int *)+4B<br />
; f(int,int *,int *,int *)+61 ...<br />
xor eax, eax<br />
nop<br />
lea esi, [esi+0]<br />
loc_80484C8: ; CODE XREF: f(int,int *,int *,int *)+36<br />
mov edx, [edi+eax*4]<br />
add edx, [esi+eax*4]<br />
mov [ebx+eax*4], edx<br />
add eax, 1<br />
cmp eax, ecx<br />
jnz short loc_80484C8<br />
loc_80484D8: ; CODE XREF: f(int,int *,int *,int *)+17<br />
; f(int,int *,int *,int *)+A5<br />
add esp, 0Ch<br />
xor eax, eax<br />
72 More about GCC vec<strong>to</strong>rization support: http://gcc.gnu.org/projects/tree-ssa/vec<strong>to</strong>rization.html<br />
97
pop ebx<br />
pop esi<br />
pop edi<br />
pop ebp<br />
retn<br />
; --------------------------------------------------------------------------align<br />
8<br />
loc_80484E8: ; CODE XREF: f(int,int *,int *,int *)+1F<br />
test bl, 0Fh<br />
jnz short loc_80484C1<br />
lea edx, [esi+10h]<br />
cmp ebx, edx<br />
jbe loc_8048578<br />
loc_80484F8: ; CODE XREF: f(int,int *,int *,int *)+E0<br />
lea edx, [edi+10h]<br />
cmp ebx, edx<br />
ja short loc_8048503<br />
cmp edi, eax<br />
jbe short loc_80484C1<br />
loc_8048503: ; CODE XREF: f(int,int *,int *,int *)+5D<br />
mov eax, ecx<br />
shr eax, 2<br />
mov [ebp+var_14], eax<br />
shl eax, 2<br />
test eax, eax<br />
mov [ebp+var_10], eax<br />
jz short loc_8048547<br />
mov [ebp+var_18], ecx<br />
mov ecx, [ebp+var_14]<br />
xor eax, eax<br />
xor edx, edx<br />
nop<br />
loc_8048520: ; CODE XREF: f(int,int *,int *,int *)+9B<br />
movdqu xmm1, xmmword ptr [edi+eax]<br />
movdqu xmm0, xmmword ptr [esi+eax]<br />
add edx, 1<br />
paddd xmm0, xmm1<br />
movdqa xmmword ptr [ebx+eax], xmm0<br />
add eax, 10h<br />
cmp edx, ecx<br />
jb short loc_8048520<br />
mov ecx, [ebp+var_18]<br />
mov eax, [ebp+var_10]<br />
cmp ecx, eax<br />
jz short loc_80484D8<br />
loc_8048547: ; CODE XREF: f(int,int *,int *,int *)+73<br />
lea edx, ds:0[eax*4]<br />
add esi, edx<br />
add edi, edx<br />
98
add ebx, edx<br />
lea esi, [esi+0]<br />
loc_8048558: ; CODE XREF: f(int,int *,int *,int *)+CC<br />
mov edx, [edi]<br />
add eax, 1<br />
add edi, 4<br />
add edx, [esi]<br />
add esi, 4<br />
mov [ebx], edx<br />
add ebx, 4<br />
cmp ecx, eax<br />
jg short loc_8048558<br />
add esp, 0Ch<br />
xor eax, eax<br />
pop ebx<br />
pop esi<br />
pop edi<br />
pop ebp<br />
retn<br />
; ---------------------------------------------------------------------------<br />
loc_8048578: ; CODE XREF: f(int,int *,int *,int *)+52<br />
cmp eax, esi<br />
jnb loc_80484C1<br />
jmp loc_80484F8<br />
_Z1fiPiS_S_ endp<br />
Almost the same, however, not as meticulously as Intel C++ doing it.<br />
1.19.2 SIMD strlen() implementation<br />
It should be noted that SIMD-instructions may be inserted in<strong>to</strong> C/C++ code via special macros 73 . As of<br />
MSVC, some of them are located in intrin.h file.<br />
It is possible <strong>to</strong> implement strlen() function 74 using SIMD-instructions, working 2-2.5 times faster than<br />
usual implementation. This function will load 16 characters in<strong>to</strong> XMM-register and check each against zero.<br />
size_t strlen_sse2(const char *str)<br />
{<br />
register size_t len = 0;<br />
const char *s=str;<br />
bool str_is_aligned=(((unsigned int)str)&0xFFFFFFF0) == (unsigned int)str;<br />
if (str_is_aligned==false)<br />
return strlen (str);<br />
__m128i xmm0 = _mm_setzero_si128();<br />
__m128i xmm1;<br />
int mask = 0;<br />
<strong>for</strong> (;;)<br />
{<br />
xmm1 = _mm_load_si128((__m128i *)s);<br />
73 MSDN: MMX, SSE, and SSE2 Intrinsics<br />
74 strlen() — standard C library function <strong>for</strong> calculating string length<br />
99
}<br />
};<br />
xmm1 = _mm_cmpeq_epi8(xmm1, xmm0);<br />
if ((mask = _mm_movemask_epi8(xmm1)) != 0)<br />
{<br />
unsigned long pos;<br />
_BitScanForward(&pos, mask);<br />
len += (size_t)pos;<br />
break;<br />
}<br />
s += sizeof(__m128i);<br />
len += sizeof(__m128i);<br />
return len;<br />
(the example is based on source code from there).<br />
Let’s compile in MSVC 2010 with /Ox option:<br />
_pos$75552 = -4 ; size = 4<br />
_str$ = 8 ; size = 4<br />
?strlen_sse2@@YAIPBD@Z PROC ; strlen_sse2<br />
push ebp<br />
mov ebp, esp<br />
and esp, -16 ; fffffff0H<br />
mov eax, DWORD PTR _str$[ebp]<br />
sub esp, 12 ; 0000000cH<br />
push esi<br />
mov esi, eax<br />
and esi, -16 ; fffffff0H<br />
xor edx, edx<br />
mov ecx, eax<br />
cmp esi, eax<br />
je SHORT $LN4@strlen_sse<br />
lea edx, DWORD PTR [eax+1]<br />
npad 3<br />
$LL11@strlen_sse:<br />
mov cl, BYTE PTR [eax]<br />
inc eax<br />
test cl, cl<br />
jne SHORT $LL11@strlen_sse<br />
sub eax, edx<br />
pop esi<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
$LN4@strlen_sse:<br />
movdqa xmm1, XMMWORD PTR [eax]<br />
pxor xmm0, xmm0<br />
pcmpeqb xmm1, xmm0<br />
pmovmskb eax, xmm1<br />
test eax, eax<br />
jne SHORT $LN9@strlen_sse<br />
$LL3@strlen_sse:<br />
100
movdqa xmm1, XMMWORD PTR [ecx+16]<br />
add ecx, 16 ; 00000010H<br />
pcmpeqb xmm1, xmm0<br />
add edx, 16 ; 00000010H<br />
pmovmskb eax, xmm1<br />
test eax, eax<br />
je SHORT $LL3@strlen_sse<br />
$LN9@strlen_sse:<br />
bsf eax, eax<br />
mov ecx, eax<br />
mov DWORD PTR _pos$75552[esp+16], eax<br />
lea eax, DWORD PTR [ecx+edx]<br />
pop esi<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
?strlen_sse2@@YAIPBD@Z ENDP ; strlen_sse2<br />
First of all, we check str pointer, if it’s aligned by 16-byte border. If not, let’s call usual strlen()<br />
implementation.<br />
Then, load next 16 bytes in<strong>to</strong> XMM1 register using MOVDQA instruction.<br />
Observant reader might ask, why MOVDQU cannot be used here, because it can load data from the memory<br />
regardless the fact if the pointer aligned or not.<br />
Yes, it might be done in this way: if pointer is aligned, load data using MOVDQA, if not — use slower<br />
MOVDQU.<br />
But here we are may stick in<strong>to</strong> hard <strong>to</strong> notice problem:<br />
In Windows NT line of operation systems 75 , but not limited <strong>to</strong> it, memory allocated by pages of 4 KiB<br />
(4096 bytes). Each win32-process have ostensibly 4 GiB, but in fact, only some parts of address space are<br />
connected <strong>to</strong> real physical memory. If the process accessing <strong>to</strong> the absent memory block, exception will be<br />
raised. That’s how virtual memory works 76 .<br />
So, some function loading 16 bytes at once, may step over a border of allocated memory block. Let’s<br />
consider, OS allocated 8192 (0x2000) bytes at the address 0x008c0000. Thus, the block is the bytes starting<br />
from address 0x008c0000 <strong>to</strong> 0x008c1fff inclusive.<br />
After that block, that is, starting from address 0x008c2008 there are nothing at all, e.g., OS not allocated<br />
any memory there. Attempt <strong>to</strong> access a memory starting from that address will raise exception.<br />
And let’s consider, the program holding some string containing 5 characters almost at the end of block,<br />
and that’s not a crime.<br />
0x008c1ff8 ’h’<br />
0x008c1ff9 ’e’<br />
0x008c1ffa ’l’<br />
0x008c1ffb ’l’<br />
0x008c1ffc ’o’<br />
0x008c1ffd ’\x00’<br />
0x008c1ffe random noise<br />
0x008c1fff random noise<br />
So, in usual conditions the program calling strlen() passing it a pointer <strong>to</strong> string ’hello’ lying in<br />
memory at address 0x008c1ff8. strlen() will read one byte at a time until 0x008c1ffd, where zero-byte, and<br />
so here it will s<strong>to</strong>p working.<br />
Now if we implement own strlen() reading 16 byte at once, starting at any address, will it be alligned<br />
or not, MOVDQU may attempt <strong>to</strong> load 16 bytes at once at address 0x008c1ff8 up <strong>to</strong> 0x008c2008, and then<br />
exception will be raised. That’s the situation <strong>to</strong> be avoided, of course.<br />
75 Windows NT, 2000, XP, Vista, 7, 8<br />
76 http://en.wikipedia.org/wiki/Page_(computer_memory)<br />
101
So then we’ll work only with the addresses aligned by 16 byte border, what in combination with a<br />
knowledge of operation system page size is usually aligned by 16 byte <strong>to</strong>o, give us some warranty our<br />
function will not read from unallocated memory.<br />
Let’s back <strong>to</strong> our function.<br />
_mm_setzero_si128() — is a macro generating pxor xmm0, xmm0 — instruction just clear /XMMZERO<br />
register<br />
_mm_load_si128() — is a macro <strong>for</strong> MOVDQA, it just loading 16 bytes from the address in XMM1.<br />
_mm_cmpeq_epi8() — is a macro <strong>for</strong> PCMPEQB, is an instruction comparing two XMM-registers bytewise.<br />
And if some byte was equals <strong>to</strong> other, there will be 0xff at this place in result or 0 if otherwise.<br />
For example.<br />
XMM1: 11223344556677880000000000000000<br />
XMM0: 11ab3444007877881111111111111111<br />
After pcmpeqb xmm1, xmm0 execution, XMM1 register will contain:<br />
XMM1: ff0000ff0000ffff0000000000000000<br />
In our case, this instruction comparing each 16-byte block with the block of 16 zero-bytes, was set in<br />
XMM0 by pxor xmm0, xmm0.<br />
The next macro is _mm_movemask_epi8() — that is PMOVMSKB instruction.<br />
It is very useful if <strong>to</strong> use it with PCMPEQB.<br />
pmovmskb eax, xmm1<br />
This instruction will set first EAX bit in<strong>to</strong> 1 if most significant bit of the first byte in XMM1 is 1. In other<br />
words, if first byte of XMM1 register is 0xff, first EAX bit will be set <strong>to</strong> 1 <strong>to</strong>o.<br />
If second byte in XMM1 register is 0xff, then second EAX bit will be set <strong>to</strong> 1 <strong>to</strong>o. In other words, the<br />
instruction is answer <strong>to</strong> the question which bytes in XMM1 are 0xff? And will prepare 16 bits in EAX. Other<br />
EAX bits will be cleared.<br />
By the way, do not <strong>for</strong>get about this feature of our algorithm:<br />
There might be 16 bytes on input like hello\x00garbage\x00ab<br />
It’s a ’hello’ string, terminating zero after and some random noise in memory.<br />
If we load these 16 bytes in<strong>to</strong> XMM1 and compare them with zeroed XMM0, we will get something like (I<br />
use here order from MSB 77 <strong>to</strong> LSB 78 ):<br />
XMM1: 0000ff00000000000000ff0000000000<br />
This mean, the instruction found two zero bytes, and that’s not surprising.<br />
PMOVMSKB in our case will prepare EAX like (in binary representation): 0010000000100000b.<br />
Obviously, our function should consider only first zero and ignore others.<br />
The next instruction — BSF (Bit Scan Forward). This instruction find first bit set <strong>to</strong> 1 and s<strong>to</strong>res its<br />
position in<strong>to</strong> first operand.<br />
EAX=0010000000100000b<br />
After bsf eax, eax instruction execution, EAX will contain 5, this mean, 1 found at 5th bit position<br />
(starting from zero).<br />
MSVC has a macro <strong>for</strong> this instruction: _BitScanForward.<br />
Now it’s simple. If zero byte found, its position added <strong>to</strong> what we already counted and now we have<br />
ready <strong>to</strong> return result.<br />
Almost all.<br />
By the way, it’s also should be noted, MSVC compiler emitted two loop bodies side by side, <strong>for</strong> optimization.<br />
By the way, SSE 4.2 (appeared in Intel Core i7) offers more instructions where these string manipulations<br />
might be even easier: http://www.strchr.com/strcmp_and_strlen_using_sse_4.2<br />
77 most significant bit<br />
78 least significant bit<br />
102
1.20 x86-64<br />
It’s a 64-bit extension <strong>to</strong> x86-architecture.<br />
From the <strong>reverse</strong> engineer’s perspective, most important differences are:<br />
∙ Almost all registers (except FPU and SIMD) are extended <strong>to</strong> 64 bits and got r- prefix. 8 additional<br />
registers added. Now general purpose registers are: rax, rbx, rcx, rdx, rbp, rsp, rsi, rdi, r8, r9,<br />
r10, r11, r12, r13, r14, r15.<br />
It’s still possible <strong>to</strong> access <strong>to</strong> older register parts as usual. For example, it’s possible <strong>to</strong> access lower<br />
32-bit part of RAX using EAX.<br />
New r8-r15 registers also has its lower parts: r8d-r15d (lower 32-bit parts), r8w-r15w (lower 16-bit<br />
parts), r8b-r15b (lower 8-bit parts).<br />
SIMD-registers number are doubled: from 8 <strong>to</strong> 16: XMM0-XMM15.<br />
∙ In Win64, function calling convention is slightly different, somewhat resembling fastcall 2.5.3. First 4<br />
arguments s<strong>to</strong>red in RCX, RDX, R8, R9 registers, others — in stack. Caller function should also allocate<br />
32 bytes so the callee may save there 4 first arguments and use these registers <strong>for</strong> own needs. Short<br />
functions may use arguments just from registers, but larger may save their values in<strong>to</strong> stack.<br />
See also section about calling conventions 2.5.<br />
∙ C int type is still 32-bit <strong>for</strong> compatibility.<br />
∙ All pointers are 64-bit now.<br />
Since now registers number are doubled, compilers has more space now <strong>for</strong> maneuvering calling register<br />
allocation 79 . What it meanings <strong>for</strong> us, emitted code will contain less local variables.<br />
For example, function calculating first S-box of DES encryption algorithm, it processing 32/64/128/256<br />
values at once (depending on DES_type type (uint32, uint64, SSE2 or AVX)) using bitslice DES method<br />
(read more about this method here 1.19):<br />
/*<br />
* Generated S-box files.<br />
*<br />
* This software may be modified, redistributed, and used <strong>for</strong> any purpose,<br />
* so long as its origin is acknowledged.<br />
*<br />
* Produced by Matthew Kwan - March 1998<br />
*/<br />
#ifdef _WIN64<br />
#define DES_type unsigned __int64<br />
#else<br />
#define DES_type unsigned int<br />
#endif<br />
void<br />
s1 (<br />
DES_type a1,<br />
DES_type a2,<br />
DES_type a3,<br />
DES_type a4,<br />
DES_type a5,<br />
DES_type a6,<br />
79 assigning variables <strong>to</strong> registers<br />
103
) {<br />
DES_type *out1,<br />
DES_type *out2,<br />
DES_type *out3,<br />
DES_type *out4<br />
DES_type x1, x2, x3, x4, x5, x6, x7, x8;<br />
DES_type x9, x10, x11, x12, x13, x14, x15, x16;<br />
DES_type x17, x18, x19, x20, x21, x22, x23, x24;<br />
DES_type x25, x26, x27, x28, x29, x30, x31, x32;<br />
DES_type x33, x34, x35, x36, x37, x38, x39, x40;<br />
DES_type x41, x42, x43, x44, x45, x46, x47, x48;<br />
DES_type x49, x50, x51, x52, x53, x54, x55, x56;<br />
x1 = a3 & ~a5;<br />
x2 = x1 ^ a4;<br />
x3 = a3 & ~a4;<br />
x4 = x3 | a5;<br />
x5 = a6 & x4;<br />
x6 = x2 ^ x5;<br />
x7 = a4 & ~a5;<br />
x8 = a3 ^ a4;<br />
x9 = a6 & ~x8;<br />
x10 = x7 ^ x9;<br />
x11 = a2 | x10;<br />
x12 = x6 ^ x11;<br />
x13 = a5 ^ x5;<br />
x14 = x13 & x8;<br />
x15 = a5 & ~a4;<br />
x16 = x3 ^ x14;<br />
x17 = a6 | x16;<br />
x18 = x15 ^ x17;<br />
x19 = a2 | x18;<br />
x20 = x14 ^ x19;<br />
x21 = a1 & x20;<br />
x22 = x12 ^ ~x21;<br />
*out2 ^= x22;<br />
x23 = x1 | x5;<br />
x24 = x23 ^ x8;<br />
x25 = x18 & ~x2;<br />
x26 = a2 & ~x25;<br />
x27 = x24 ^ x26;<br />
x28 = x6 | x7;<br />
x29 = x28 ^ x25;<br />
x30 = x9 ^ x24;<br />
x31 = x18 & ~x30;<br />
x32 = a2 & x31;<br />
x33 = x29 ^ x32;<br />
x34 = a1 & x33;<br />
x35 = x27 ^ x34;<br />
*out4 ^= x35;<br />
x36 = a3 & x28;<br />
x37 = x18 & ~x36;<br />
x38 = a2 | x3;<br />
x39 = x37 ^ x38;<br />
104
}<br />
x40 = a3 | x31;<br />
x41 = x24 & ~x37;<br />
x42 = x41 | x3;<br />
x43 = x42 & ~a2;<br />
x44 = x40 ^ x43;<br />
x45 = a1 & ~x44;<br />
x46 = x39 ^ ~x45;<br />
*out1 ^= x46;<br />
x47 = x33 & ~x9;<br />
x48 = x47 ^ x39;<br />
x49 = x4 ^ x36;<br />
x50 = x49 & ~x5;<br />
x51 = x42 | x18;<br />
x52 = x51 ^ a5;<br />
x53 = a2 & ~x52;<br />
x54 = x50 ^ x53;<br />
x55 = a1 | x54;<br />
x56 = x48 ^ ~x55;<br />
*out3 ^= x56;<br />
There is a lot of local variables. Of course, not them all will be in local stack. Let’s compile it with<br />
MSVC 2008 with /Ox option:<br />
PUBLIC _s1<br />
; Function compile flags: /Ogtpy<br />
_TEXT SEGMENT<br />
_x6$ = -20 ; size = 4<br />
_x3$ = -16 ; size = 4<br />
_x1$ = -12 ; size = 4<br />
_x8$ = -8 ; size = 4<br />
_x4$ = -4 ; size = 4<br />
_a1$ = 8 ; size = 4<br />
_a2$ = 12 ; size = 4<br />
_a3$ = 16 ; size = 4<br />
_x33$ = 20 ; size = 4<br />
_x7$ = 20 ; size = 4<br />
_a4$ = 20 ; size = 4<br />
_a5$ = 24 ; size = 4<br />
tv326 = 28 ; size = 4<br />
_x36$ = 28 ; size = 4<br />
_x28$ = 28 ; size = 4<br />
_a6$ = 28 ; size = 4<br />
_out1$ = 32 ; size = 4<br />
_x24$ = 36 ; size = 4<br />
_out2$ = 36 ; size = 4<br />
_out3$ = 40 ; size = 4<br />
_out4$ = 44 ; size = 4<br />
_s1 PROC<br />
sub esp, 20 ; 00000014H<br />
mov edx, DWORD PTR _a5$[esp+16]<br />
push ebx<br />
mov ebx, DWORD PTR _a4$[esp+20]<br />
push ebp<br />
push esi<br />
105
mov esi, DWORD PTR _a3$[esp+28]<br />
push edi<br />
mov edi, ebx<br />
not edi<br />
mov ebp, edi<br />
and edi, DWORD PTR _a5$[esp+32]<br />
mov ecx, edx<br />
not ecx<br />
and ebp, esi<br />
mov eax, ecx<br />
and eax, esi<br />
and ecx, ebx<br />
mov DWORD PTR _x1$[esp+36], eax<br />
xor eax, ebx<br />
mov esi, ebp<br />
or esi, edx<br />
mov DWORD PTR _x4$[esp+36], esi<br />
and esi, DWORD PTR _a6$[esp+32]<br />
mov DWORD PTR _x7$[esp+32], ecx<br />
mov edx, esi<br />
xor edx, eax<br />
mov DWORD PTR _x6$[esp+36], edx<br />
mov edx, DWORD PTR _a3$[esp+32]<br />
xor edx, ebx<br />
mov ebx, esi<br />
xor ebx, DWORD PTR _a5$[esp+32]<br />
mov DWORD PTR _x8$[esp+36], edx<br />
and ebx, edx<br />
mov ecx, edx<br />
mov edx, ebx<br />
xor edx, ebp<br />
or edx, DWORD PTR _a6$[esp+32]<br />
not ecx<br />
and ecx, DWORD PTR _a6$[esp+32]<br />
xor edx, edi<br />
mov edi, edx<br />
or edi, DWORD PTR _a2$[esp+32]<br />
mov DWORD PTR _x3$[esp+36], ebp<br />
mov ebp, DWORD PTR _a2$[esp+32]<br />
xor edi, ebx<br />
and edi, DWORD PTR _a1$[esp+32]<br />
mov ebx, ecx<br />
xor ebx, DWORD PTR _x7$[esp+32]<br />
not edi<br />
or ebx, ebp<br />
xor edi, ebx<br />
mov ebx, edi<br />
mov edi, DWORD PTR _out2$[esp+32]<br />
xor ebx, DWORD PTR [edi]<br />
not eax<br />
xor ebx, DWORD PTR _x6$[esp+36]<br />
and eax, edx<br />
mov DWORD PTR [edi], ebx<br />
mov ebx, DWORD PTR _x7$[esp+32]<br />
106
or ebx, DWORD PTR _x6$[esp+36]<br />
mov edi, esi<br />
or edi, DWORD PTR _x1$[esp+36]<br />
mov DWORD PTR _x28$[esp+32], ebx<br />
xor edi, DWORD PTR _x8$[esp+36]<br />
mov DWORD PTR _x24$[esp+32], edi<br />
xor edi, ecx<br />
not edi<br />
and edi, edx<br />
mov ebx, edi<br />
and ebx, ebp<br />
xor ebx, DWORD PTR _x28$[esp+32]<br />
xor ebx, eax<br />
not eax<br />
mov DWORD PTR _x33$[esp+32], ebx<br />
and ebx, DWORD PTR _a1$[esp+32]<br />
and eax, ebp<br />
xor eax, ebx<br />
mov ebx, DWORD PTR _out4$[esp+32]<br />
xor eax, DWORD PTR [ebx]<br />
xor eax, DWORD PTR _x24$[esp+32]<br />
mov DWORD PTR [ebx], eax<br />
mov eax, DWORD PTR _x28$[esp+32]<br />
and eax, DWORD PTR _a3$[esp+32]<br />
mov ebx, DWORD PTR _x3$[esp+36]<br />
or edi, DWORD PTR _a3$[esp+32]<br />
mov DWORD PTR _x36$[esp+32], eax<br />
not eax<br />
and eax, edx<br />
or ebx, ebp<br />
xor ebx, eax<br />
not eax<br />
and eax, DWORD PTR _x24$[esp+32]<br />
not ebp<br />
or eax, DWORD PTR _x3$[esp+36]<br />
not esi<br />
and ebp, eax<br />
or eax, edx<br />
xor eax, DWORD PTR _a5$[esp+32]<br />
mov edx, DWORD PTR _x36$[esp+32]<br />
xor edx, DWORD PTR _x4$[esp+36]<br />
xor ebp, edi<br />
mov edi, DWORD PTR _out1$[esp+32]<br />
not eax<br />
and eax, DWORD PTR _a2$[esp+32]<br />
not ebp<br />
and ebp, DWORD PTR _a1$[esp+32]<br />
and edx, esi<br />
xor eax, edx<br />
or eax, DWORD PTR _a1$[esp+32]<br />
not ebp<br />
xor ebp, DWORD PTR [edi]<br />
not ecx<br />
and ecx, DWORD PTR _x33$[esp+32]<br />
107
xor ebp, ebx<br />
not eax<br />
mov DWORD PTR [edi], ebp<br />
xor eax, ecx<br />
mov ecx, DWORD PTR _out3$[esp+32]<br />
xor eax, DWORD PTR [ecx]<br />
pop edi<br />
pop esi<br />
xor eax, ebx<br />
pop ebp<br />
mov DWORD PTR [ecx], eax<br />
pop ebx<br />
add esp, 20 ; 00000014H<br />
ret 0<br />
_s1 ENDP<br />
5 variables was allocated in local stack by compiler.<br />
Now let’s try the same thing in 64-bit version of MSVC 2008:<br />
a1$ = 56<br />
a2$ = 64<br />
a3$ = 72<br />
a4$ = 80<br />
x36$1$ = 88<br />
a5$ = 88<br />
a6$ = 96<br />
out1$ = 104<br />
out2$ = 112<br />
out3$ = 120<br />
out4$ = 128<br />
s1 PROC<br />
$LN3:<br />
mov QWORD PTR [rsp+24], rbx<br />
mov QWORD PTR [rsp+32], rbp<br />
mov QWORD PTR [rsp+16], rdx<br />
mov QWORD PTR [rsp+8], rcx<br />
push rsi<br />
push rdi<br />
push r12<br />
push r13<br />
push r14<br />
push r15<br />
mov r15, QWORD PTR a5$[rsp]<br />
mov rcx, QWORD PTR a6$[rsp]<br />
mov rbp, r8<br />
mov r10, r9<br />
mov rax, r15<br />
mov rdx, rbp<br />
not rax<br />
xor rdx, r9<br />
not r10<br />
mov r11, rax<br />
and rax, r9<br />
mov rsi, r10<br />
mov QWORD PTR x36$1$[rsp], rax<br />
108
and r11, r8<br />
and rsi, r8<br />
and r10, r15<br />
mov r13, rdx<br />
mov rbx, r11<br />
xor rbx, r9<br />
mov r9, QWORD PTR a2$[rsp]<br />
mov r12, rsi<br />
or r12, r15<br />
not r13<br />
and r13, rcx<br />
mov r14, r12<br />
and r14, rcx<br />
mov rax, r14<br />
mov r8, r14<br />
xor r8, rbx<br />
xor rax, r15<br />
not rbx<br />
and rax, rdx<br />
mov rdi, rax<br />
xor rdi, rsi<br />
or rdi, rcx<br />
xor rdi, r10<br />
and rbx, rdi<br />
mov rcx, rdi<br />
or rcx, r9<br />
xor rcx, rax<br />
mov rax, r13<br />
xor rax, QWORD PTR x36$1$[rsp]<br />
and rcx, QWORD PTR a1$[rsp]<br />
or rax, r9<br />
not rcx<br />
xor rcx, rax<br />
mov rax, QWORD PTR out2$[rsp]<br />
xor rcx, QWORD PTR [rax]<br />
xor rcx, r8<br />
mov QWORD PTR [rax], rcx<br />
mov rax, QWORD PTR x36$1$[rsp]<br />
mov rcx, r14<br />
or rax, r8<br />
or rcx, r11<br />
mov r11, r9<br />
xor rcx, rdx<br />
mov QWORD PTR x36$1$[rsp], rax<br />
mov r8, rsi<br />
mov rdx, rcx<br />
xor rdx, r13<br />
not rdx<br />
and rdx, rdi<br />
mov r10, rdx<br />
and r10, r9<br />
xor r10, rax<br />
xor r10, rbx<br />
not rbx<br />
109
and rbx, r9<br />
mov rax, r10<br />
and rax, QWORD PTR a1$[rsp]<br />
xor rbx, rax<br />
mov rax, QWORD PTR out4$[rsp]<br />
xor rbx, QWORD PTR [rax]<br />
xor rbx, rcx<br />
mov QWORD PTR [rax], rbx<br />
mov rbx, QWORD PTR x36$1$[rsp]<br />
and rbx, rbp<br />
mov r9, rbx<br />
not r9<br />
and r9, rdi<br />
or r8, r11<br />
mov rax, QWORD PTR out1$[rsp]<br />
xor r8, r9<br />
not r9<br />
and r9, rcx<br />
or rdx, rbp<br />
mov rbp, QWORD PTR [rsp+80]<br />
or r9, rsi<br />
xor rbx, r12<br />
mov rcx, r11<br />
not rcx<br />
not r14<br />
not r13<br />
and rcx, r9<br />
or r9, rdi<br />
and rbx, r14<br />
xor r9, r15<br />
xor rcx, rdx<br />
mov rdx, QWORD PTR a1$[rsp]<br />
not r9<br />
not rcx<br />
and r13, r10<br />
and r9, r11<br />
and rcx, rdx<br />
xor r9, rbx<br />
mov rbx, QWORD PTR [rsp+72]<br />
not rcx<br />
xor rcx, QWORD PTR [rax]<br />
or r9, rdx<br />
not r9<br />
xor rcx, r8<br />
mov QWORD PTR [rax], rcx<br />
mov rax, QWORD PTR out3$[rsp]<br />
xor r9, r13<br />
xor r9, QWORD PTR [rax]<br />
xor r9, r8<br />
mov QWORD PTR [rax], r9<br />
pop r15<br />
pop r14<br />
pop r13<br />
pop r12<br />
110
pop rdi<br />
pop rsi<br />
ret 0<br />
s1 ENDP<br />
Nothing allocated in local stack by compiler, x36 is synonym <strong>for</strong> a5.<br />
By the way, we can see here, the function saved RCX and RDX registers in allocated by caller space, but<br />
R8 and R9 are not saved but used from the beginning.<br />
By the way, there are CPUs with much more general purpose registers, Itanium, <strong>for</strong> example — 128<br />
registers.<br />
111
Chapter 2<br />
Couple things <strong>to</strong> add<br />
112
2.1 LEA instruction<br />
LEA (Load Effective Address) is instruction intended not <strong>for</strong> values summing but <strong>for</strong> address <strong>for</strong>ming, <strong>for</strong><br />
example, <strong>for</strong> <strong>for</strong>ming address of array element by adding array address, element index, with multiplication<br />
of element size 1 .<br />
Important property of LEA instruction is that it do not alter processor flags.<br />
int f(int a, int b)<br />
{<br />
return a*8+b;<br />
};<br />
MSVC 2010 with /Ox option:<br />
_a$ = 8 ; size = 4<br />
_b$ = 12 ; size = 4<br />
_f PROC<br />
mov eax, DWORD PTR _b$[esp-4]<br />
mov ecx, DWORD PTR _a$[esp-4]<br />
lea eax, DWORD PTR [eax+ecx*8]<br />
ret 0<br />
_f ENDP<br />
1 See also: http://en.wikipedia.org/wiki/Addressing_mode<br />
113
2.2 Function prologue and epilogue<br />
Function prologue is instructions at function start. It is often something like this:<br />
push ebp<br />
mov ebp, esp<br />
sub esp, X<br />
What these instruction do: save EBP register value, set EBP <strong>to</strong> ESP and then allocate space in stack <strong>for</strong><br />
local variables.<br />
EBP value is fixed over a period of function execution and it will be used <strong>for</strong> local variables and arguments<br />
access. One can use ESP, but it changing over time and it is not handy.<br />
Function epilogue annuled allocated space in stack, returning EBP value <strong>to</strong> initial state and returning<br />
control flow <strong>to</strong> callee:<br />
mov esp, ebp<br />
pop ebp<br />
ret 0<br />
Epilogue and prologue can make recursion per<strong>for</strong>mance worse.<br />
For example, once upon a time I wrote a function <strong>to</strong> seek right tree node. As a recursive function it would<br />
look stylish but because some time was spent at each function call <strong>for</strong> prologue/epilogue, it was working<br />
couple of times slower than the implementation without recursion.<br />
By the way, that is the reason of tail call 2 existence: when compiler (or interpreter) trans<strong>for</strong>ms recursion<br />
(with which it’s possible: tail recursion) in<strong>to</strong> iteration <strong>for</strong> efficiency.<br />
2 http://en.wikipedia.org/wiki/Tail_call<br />
114
2.3 npad<br />
It’s an assembler macro <strong>for</strong> label aligning by some specific border.<br />
That’s often need <strong>for</strong> the busy labels <strong>to</strong> where control flow is often passed, <strong>for</strong> example, loop begin. So<br />
the CPU will effectively load data or code from the memory, through memory bus, cache lines, etc.<br />
Taken from listing.inc (MSVC):<br />
By the way, it’s curious example of different NOP variations. All these instructions has no effects at all,<br />
but has different size.<br />
;; LISTING.INC<br />
;;<br />
;; This file contains assembler macros and is included by the files created<br />
;; with the -FA compiler switch <strong>to</strong> be assembled by MASM (Microsoft Macro<br />
;; Assembler).<br />
;;<br />
;; Copyright (c) 1993-2003, Microsoft Corporation. All rights reserved.<br />
;; non destructive nops<br />
npad macro size<br />
if size eq 1<br />
nop<br />
else<br />
if size eq 2<br />
mov edi, edi<br />
else<br />
if size eq 3<br />
; lea ecx, [ecx+00]<br />
DB 8DH, 49H, 00H<br />
else<br />
if size eq 4<br />
; lea esp, [esp+00]<br />
DB 8DH, 64H, 24H, 00H<br />
else<br />
if size eq 5<br />
add eax, DWORD PTR 0<br />
else<br />
if size eq 6<br />
; lea ebx, [ebx+00000000]<br />
DB 8DH, 9BH, 00H, 00H, 00H, 00H<br />
else<br />
if size eq 7<br />
; lea esp, [esp+00000000]<br />
DB 8DH, 0A4H, 24H, 00H, 00H, 00H, 00H<br />
else<br />
if size eq 8<br />
; jmp .+8; .npad 6<br />
DB 0EBH, 06H, 8DH, 9BH, 00H, 00H, 00H, 00H<br />
else<br />
if size eq 9<br />
; jmp .+9; .npad 7<br />
DB 0EBH, 07H, 8DH, 0A4H, 24H, 00H, 00H, 00H, 00H<br />
else<br />
if size eq 10<br />
; jmp .+A; .npad 7; .npad 1<br />
115
00H<br />
DB 0EBH, 08H, 8DH, 0A4H, 24H, 00H, 00H, 00H, 00H, 90H<br />
else<br />
if size eq 11<br />
; jmp .+B; .npad 7; .npad 2<br />
DB 0EBH, 09H, 8DH, 0A4H, 24H, 00H, 00H, 00H, 00H, 8BH, 0FFH<br />
else<br />
if size eq 12<br />
; jmp .+C; .npad 7; .npad 3<br />
DB 0EBH, 0AH, 8DH, 0A4H, 24H, 00H, 00H, 00H, 00H, 8DH, 49H, 00H<br />
else<br />
if size eq 13<br />
; jmp .+D; .npad 7; .npad 4<br />
DB 0EBH, 0BH, 8DH, 0A4H, 24H, 00H, 00H, 00H, 00H, 8DH, 64H, 24H, 00H<br />
else<br />
if size eq 14<br />
; jmp .+E; .npad 7; .npad 5<br />
DB 0EBH, 0CH, 8DH, 0A4H, 24H, 00H, 00H, 00H, 00H, 05H, 00H, 00H, 00H, 00H<br />
else<br />
if size eq 15<br />
; jmp .+F; .npad 7; .npad 6<br />
DB 0EBH, 0DH, 8DH, 0A4H, 24H, 00H, 00H, 00H, 00H, 8DH, 9BH, 00H, 00H, 00H,<br />
else<br />
%out error: unsupported npad size<br />
.err<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endif<br />
endm<br />
116
2.4 Signed number representations<br />
There are several methods of representing signed numbers 3 , but in x86 architecture used “two’s complement”.<br />
Difference between signed and unsigned numbers is that if we represent 0xFFFFFFFE and 0x0000002<br />
as unsigned, then first number (4294967294) is bigger than second (2). If <strong>to</strong> represent them both as signed,<br />
first will be -2, and it is lesser than second (2). That is the reason why conditional jumps 1.7 are present<br />
both <strong>for</strong> signed (<strong>for</strong> example, JG, JL) and unsigned (JA, JBE) operations.<br />
2.4.1 Integer overflow<br />
It is worth noting that incorrect representation of number can lead integer overflow vulnerability.<br />
For example, we have some network service, it receives network packets. In that packets there are also<br />
field where subpacket length is coded. It is 32-bit value. After network packet received, service checking that<br />
field, and if it is larger than, <strong>for</strong> example, some MAX_PACKET_SIZE (let’s say, 10 kilobytes), packet ignored as<br />
incorrect. Comparison is signed. Intruder set this value <strong>to</strong> 0xFFFFFFFF. While comparison, this number<br />
is considered as signed -1 and it’s lesser than 10 kilobytes. No error here. Service would like <strong>to</strong> copy that<br />
subpacket <strong>to</strong> another place in memory and call memcpy (dst, src, 0xFFFFFFFF) function: this operation,<br />
rapidly scratching a lot of inside of process memory.<br />
More about it: http://www.phrack.org/issues.html?issue=60&id=10<br />
3 http://en.wikipedia.org/wiki/Signed_number_representations<br />
117
2.5 Arguments passing methods (calling conventions)<br />
2.5.1 cdecl<br />
This is the most popular method <strong>for</strong> arguments passing <strong>to</strong> functions in C/C++ languages.<br />
Caller pushing arguments <strong>to</strong> stack in <strong>reverse</strong> order: last argument, then penultimate element and finally<br />
— first argument. Caller also should return back ESP <strong>to</strong> its initial state after callee function exit.<br />
push arg3<br />
push arg2<br />
push arg3<br />
call function<br />
add esp, 12 ; return ESP<br />
2.5.2 stdcall<br />
Almost the same thing as cdecl, with the exception that callee set ESP <strong>to</strong> initial state executing RET x<br />
instruction instead of RET, where x = arguments number * sizeof(int) 4 . Caller will not adjust stack<br />
pointer by add esp, x instruction.<br />
push arg3<br />
push arg2<br />
push arg1<br />
call function<br />
function:<br />
... do something ...<br />
ret 12<br />
This method is ubiqui<strong>to</strong>us in win32 standard libraries, but not in win64 (see below about win64).<br />
Variable arguments number functions<br />
printf()-like functions are, probably, the only case of variable arguments functions in C/C++, but it’s<br />
easy <strong>to</strong> illustrate an important difference between cdecl and stdcall with help of it. Let’s start with the idea<br />
that compiler knows argument count of each printf() function calling. However, called printf(), which is<br />
already compiled and located in MSVCRT.DLL (if <strong>to</strong> talk about Windows), do not have in<strong>for</strong>mation about<br />
how much arguments were passed, however it can determine it from <strong>for</strong>mat string. Thus, if printf() would<br />
be stdcall-function and res<strong>to</strong>red stack pointer <strong>to</strong> its initial state by counting number of arguments in <strong>for</strong>mat<br />
string, this could be dangerous situation, when one programmer’s typo may provoke sudden program crash.<br />
Thus it’s not suitable <strong>for</strong> such functions <strong>to</strong> use stdcall, cdecl is better.<br />
2.5.3 fastcall<br />
That’s general naming <strong>for</strong> a method <strong>for</strong> passing some of arguments via registers and all others — via stack.<br />
It worked faster than cdecl/stdcall on older CPUs. It’s not a standardized way, so, different compilers may<br />
do it differently. Of course, if you have two DLLs, one use another, and they are built by different compilers<br />
with fastcall calling conventions, there will be a problems.<br />
Both MSVC and GCC passing first and second argument via ECX and EDX and other arguments via stack.<br />
Caller should res<strong>to</strong>re stack pointer in<strong>to</strong> initial state.<br />
Stack pointer should be res<strong>to</strong>red <strong>to</strong> initial state by callee, like in stdcall.<br />
push arg3<br />
mov edx, arg2<br />
mov ecx, arg1<br />
4 Size of int type variable is 4 in x86 systems and 8 in x64 systems<br />
118
call function<br />
function:<br />
.. do something ..<br />
ret 4<br />
GCC regparm<br />
It’s fastcall evolution 5 is some sense. With the -mregparm option it’s possible <strong>to</strong> set, how many arguments<br />
will be passed via registers. 3 at maximum. Thus, EAX, EDX and ECX registers will be used.<br />
Of course, if number of arguments is less then 3, not all registers 3 will be used.<br />
Caller res<strong>to</strong>res stack pointer <strong>to</strong> its initial state.<br />
2.5.4 thiscall<br />
In C++, it’s a this pointer <strong>to</strong> object passing in<strong>to</strong> function-method.<br />
In MSVC, this is usually passed in ECX register.<br />
In GCC, this pointer is passed as a first function-method argument. Thus it will be seen that internally<br />
all function-methods has extra argument <strong>for</strong> it.<br />
2.5.5 x86-64<br />
win64<br />
The method of arguments passing in Win64 is somewhat resembling <strong>to</strong> fastcall. First 4 arguments are<br />
passed via RCX, RDX, R8, R9, others — via stack. Caller also must prepare a place <strong>for</strong> 32 bytes or 4 64-bit<br />
values, so then callee can save there first 4 arguments. Short functions may use argument values just from<br />
registers, but larger may save its values <strong>for</strong> further use.<br />
Caller also should return stack pointer in<strong>to</strong> initial state.<br />
This calling convention is also used in Windows x86-64 system DLLs (instead if stdcall in win32).<br />
2.5.6 Returning values of float and double type<br />
In all conventions except of Win64, values of type float or double returning via FPU register ST(0).<br />
In Win64, return values of float and double types are returned in XMM0 register instead of ST(0).<br />
5 http://www.ohse.de/uwe/articles/gcc-attributes.html#func-regparm<br />
119
Chapter 3<br />
Finding important/interesting stuff in the<br />
code<br />
Minimalism it’s not a significant feature of modern software.<br />
But not because programmers wrote a lot, but because all libraries are usually linked statically <strong>to</strong> executable<br />
files. If all external libraries were shifted in<strong>to</strong> external DLL files, the world would be different.<br />
Thus, it’s very important <strong>to</strong> determine origin of some function, if it’s from standard library or well-known<br />
library (like Boost 1 , libpng 2 ), and which one — is related <strong>to</strong> what we are trying <strong>to</strong> find in the code.<br />
It’s just absurdly <strong>to</strong> rewrite all code <strong>to</strong> C/C++ <strong>to</strong> find what we looking <strong>for</strong>.<br />
One of the primary <strong>reverse</strong> engineer’s task is <strong>to</strong> find quickly in the code what is needed.<br />
IDA 5 disassembler can search among text strings, byte sequences, constants. It’s even possible <strong>to</strong> export<br />
the code in<strong>to</strong> .lst or .asm text file and then use grep, awk, etc.<br />
When you try <strong>to</strong> understand what some code is doing, this easily could be some open-source library like<br />
libpng. So when you see some constants or text strings looks familiar, it’s always worth <strong>to</strong> google it. And<br />
if you find the opensource project where it’s used, then it will be enough just <strong>to</strong> compare the functions. It<br />
may solve some part of problem.<br />
For example, once upon a time I tried <strong>to</strong> understand how SAP 6.0 network packets compression/decompression<br />
is working. It’s a huge software, but a detailed .PDB with debugging in<strong>for</strong>mation is present, and<br />
that’s cosily. I finally came <strong>to</strong> idea that one of the functions doing decompressing of network packet called<br />
CsDecomprLZC(). Immediately I tried <strong>to</strong> google its name and I quickly found that the function named as<br />
the same is used in MaxDB (it’s open-source SAP project) 3 .<br />
http://www.google.com/search?q=CsDecomprLZC<br />
As<strong>to</strong>undingly, MaxDB and SAP 6.0 software shared the same code <strong>for</strong> network packets compression/decompression.<br />
3.1 Communication with the outer world<br />
First what <strong>to</strong> look on is which functions from operation system API and standard libraries are used.<br />
If the program is divided in<strong>to</strong> main executable file and a group of DLL-files, sometimes, these function’s<br />
names may be helpful.<br />
If we are interesting, what exactly may lead <strong>to</strong> MessageBox() call with specific text, first what we can<br />
try <strong>to</strong> do: find this text in data segment, find references <strong>to</strong> it and find the points from which a control may<br />
be passed <strong>to</strong> MessageBox() call we’re interesting in.<br />
If we are talking about some game and we’re interesting, which events are more or less random in it, we<br />
may try <strong>to</strong> find rand() function or its replacement (like Mersenne twister algorithm) and find a places from<br />
which this function called and most important: how the results are used.<br />
But if it’s not a game, but rand() is used, it’s also interesing, why. There are cases of unexpected rand()<br />
usage in data compression algorithm (<strong>for</strong> encryption imitation): http://blog.yurichev.com/node/44.<br />
1 http://www.boost.org/<br />
2 http://www.libpng.org/pub/png/libpng.html<br />
3 More about it in releval section 7.2<br />
120
3.2 String<br />
Debugging messages are often very helpful if present. In some sense, debugging messages are reporting about<br />
what’s going on in program right now. Often these are printf()-like functions, which writes <strong>to</strong> log-files, and<br />
sometimes, not writing anything but calls are still present, because this build is not debug build but release<br />
one. If local or global variables are dumped in debugging messages, it might be helpful as well because it’s<br />
possible <strong>to</strong> get variable names at least. For example, one of such functions in Oracle RDBMS is ksdwrt().<br />
Sometimes assert() macro presence is useful <strong>to</strong>o: usually, this macro leave in code source file name, line<br />
number and condition.<br />
Meaningful text strings are often helpful. IDA 5 disassembler may show from which function and from<br />
which point this specific string is used. Funny cases sometimes happen.<br />
Paradoxically, but error messages may help us as well. In Oracle RDBMS, errors are reporting using<br />
group of functions. More about it.<br />
It’s possible <strong>to</strong> find very quickly, which functions reporting about errors and in which conditions. By<br />
the way, it’s often a reason why copy-protection systems has inarticulate cryptic error messages or just error<br />
numbers. No one happy when software cracker quickly understand why copy-protection is triggered just by<br />
error message.<br />
3.3 Constants<br />
Some algorithms, especially cryp<strong>to</strong>graphical, use distinct constants, which is easy <strong>to</strong> find in code using IDA 5.<br />
For example, MD5 4 algorithm initializes its own internal variables like:<br />
var int h0 := 0x67452301<br />
var int h1 := 0xEFCDAB89<br />
var int h2 := 0x98BADCFE<br />
var int h3 := 0x10325476<br />
If you find these four constants usage in the code in a row — it’s very high probability this function is<br />
related <strong>to</strong> MD5.<br />
3.3.1 Magic numbers<br />
A lot of file <strong>for</strong>mats defining a standard file header where magic number 5 is used.<br />
For example, all Win32 and MS-DOS executables are started with two characters “MZ” 6 .<br />
At the MIDI-file beginning “MThd” signature must be present. If we have a program that using MIDI-files<br />
<strong>for</strong> something, very likely, it will check MIDI-files <strong>for</strong> validity by checking at least first 4 bytes.<br />
This could be done like:<br />
(buf pointing <strong>to</strong> the beginning of loaded file in<strong>to</strong> memory)<br />
cmp [buf], 0x6468544D ; "MThd"<br />
jnz _error_not_a_MIDI_file<br />
... or by calling function <strong>for</strong> comparing memory blocks memcmp() or any other equivalent code up <strong>to</strong><br />
CMPSB instruction.<br />
When you find such place you already may say where MIDI-file loading is beginning, also, we could see<br />
a location of MIDI-file contents buffer and what is used from that buffer and how.<br />
4 http://en.wikipedia.org/wiki/MD5<br />
5 http://en.wikipedia.org/wiki/Magic_number_(programming)<br />
6 http://en.wikipedia.org/wiki/DOS_MZ_executable<br />
121
DHCP<br />
This applies <strong>to</strong> network pro<strong>to</strong>cols as well. For example, DHCP pro<strong>to</strong>col network packets contains so called<br />
magic cookie: 0x63538263. Any code generating DHCP pro<strong>to</strong>col packets somewhere and somehow should<br />
embed this constant in<strong>to</strong> packet. If we find it in the code we may find where it happen and not only this.<br />
Something that received DHCP packet should check magic cookie, comparing it with the constant.<br />
For example, let’s take dhcpcore.dll file from Windows 7 x64 and search <strong>for</strong> the constant. And we found it,<br />
two times: it seems, that constant is used in two functions eloquently named as DhcpExtractOptionsForValidation()<br />
and DhcpExtractFullOptions():<br />
.rdata:000007FF6483CBE8 dword_7FF6483CBE8 dd 63538263h ; DATA XREF:<br />
DhcpExtractOptionsForValidation+79<br />
.rdata:000007FF6483CBEC dword_7FF6483CBEC dd 63538263h ; DATA XREF:<br />
DhcpExtractFullOptions+97<br />
And the places where these constants accessed:<br />
.text:000007FF6480875F mov eax, [rsi]<br />
.text:000007FF64808761 cmp eax, cs:dword_7FF6483CBE8<br />
.text:000007FF64808767 jnz loc_7FF64817179<br />
And:<br />
.text:000007FF648082C7 mov eax, [r12]<br />
.text:000007FF648082CB cmp eax, cs:dword_7FF6483CBEC<br />
.text:000007FF648082D1 jnz loc_7FF648173AF<br />
3.4 Finding the right instructions<br />
If the program is using FPU instructions and there are very few of them in a code, one can try <strong>to</strong> check each<br />
by debugger.<br />
For example, we may be interesting, how Microsoft Excel calculating <strong>for</strong>mulae entered by user. For<br />
example, division operation.<br />
If <strong>to</strong> load excel.exe (from Office 2010) version 14.0.4756.1000 in<strong>to</strong> IDA 5, then make a full listing and <strong>to</strong><br />
find each FDIV instructions (except ones which use constants as a second operand — obviously, it’s not suits<br />
us):<br />
cat EXCEL.lst | grep fdiv | grep -v dbl_ > EXCEL.fdiv<br />
... then we realizing they are just 144.<br />
We can enter string like =(1/3) in Excel and check each instruction.<br />
Checking each instruction in debugger or tracer 5.0.1 (one may check 4 instruction at a time), it seems,<br />
we are lucky here and sought-<strong>for</strong> instruction is just 14th:<br />
.text:3011E919 DC 33 fdiv qword ptr [ebx]<br />
PID=13944|TID=28744|(0) 0x2f64e919 (Excel.exe!BASE+0x11e919)<br />
EAX=0x02088006 EBX=0x02088018 ECX=0x00000001 EDX=0x00000001<br />
ESI=0x02088000 EDI=0x00544804 EBP=0x0274FA3C ESP=0x0274F9F8<br />
EIP=0x2F64E919<br />
FLAGS=PF IF<br />
FPU ControlWord=IC RC=NEAR PC=64bits PM UM OM ZM DM IM<br />
FPU StatusWord=<br />
FPU ST(0): 1.000000<br />
ST(0) holding first argument (1) and second one is in [ebx].<br />
Next instruction after FDIV writes result in<strong>to</strong> memory:<br />
122
.text:3011E91B DD 1E fstp qword ptr [esi]<br />
If <strong>to</strong> set breakpoint on it, we may see result:<br />
PID=32852|TID=36488|(0) 0x2f40e91b (Excel.exe!BASE+0x11e91b)<br />
EAX=0x00598006 EBX=0x00598018 ECX=0x00000001 EDX=0x00000001<br />
ESI=0x00598000 EDI=0x00294804 EBP=0x026CF93C ESP=0x026CF8F8<br />
EIP=0x2F40E91B<br />
FLAGS=PF IF<br />
FPU ControlWord=IC RC=NEAR PC=64bits PM UM OM ZM DM IM<br />
FPU StatusWord=C1 P<br />
FPU ST(0): 0.333333<br />
Also, as a practical joke, we can modify it on-fly:<br />
gt -l:excel.exe bpx=excel.exe!base+0x11E91B,set(st0,666)<br />
PID=36540|TID=24056|(0) 0x2f40e91b (Excel.exe!BASE+0x11e91b)<br />
EAX=0x00680006 EBX=0x00680018 ECX=0x00000001 EDX=0x00000001<br />
ESI=0x00680000 EDI=0x00395404 EBP=0x0290FD9C ESP=0x0290FD58<br />
EIP=0x2F40E91B<br />
FLAGS=PF IF<br />
FPU ControlWord=IC RC=NEAR PC=64bits PM UM OM ZM DM IM<br />
FPU StatusWord=C1 P<br />
FPU ST(0): 0.333333<br />
Set ST0 register <strong>to</strong> 666.000000<br />
Excel showing 666 in that cell what finally convincing us we find the right place.<br />
Figure 3.1: Practical joke worked<br />
If <strong>to</strong> try the same Excel version, but x64, we’ll see there are only 12 FDIV instructions, and the one we<br />
looking <strong>for</strong> — third.<br />
gt64.exe -l:excel.exe bpx=excel.exe!base+0x1B7FCC,set(st0,666)<br />
It seems, a lot of division operations of float and double types, compiler replaced by SSE-instructions like<br />
DIVSD (DIVSD present here 268 in <strong>to</strong>tal).<br />
3.5 Suspicious code patterns<br />
Modern compilers do not emit LOOP and RCL instructions. On the other hand, these instructions are wellknown<br />
<strong>to</strong> coders who like <strong>to</strong> code in straight assembler. If you spot these, it can be said, with a big probability,<br />
this piece of code is hand-written.<br />
123
Chapter 4<br />
Tasks<br />
There are two questions almost <strong>for</strong> every task, if otherwise isn’t specified:<br />
1) What this function does? Answer in one-sentence <strong>for</strong>m.<br />
2) Rewrite this function in<strong>to</strong> C/C++.<br />
Hints and solutions are in the appendix of this brochure.<br />
4.1 Easy level<br />
4.1.1 Task 1.1<br />
This is standard C library function. Source code taken from OpenWatcom. Compiled in MSVC 2010.<br />
_TEXT SEGMENT<br />
_input$ = 8 ; size = 1<br />
_f PROC<br />
push ebp<br />
mov ebp, esp<br />
movsx eax, BYTE PTR _input$[ebp]<br />
cmp eax, 97 ; 00000061H<br />
jl SHORT $LN1@f<br />
movsx ecx, BYTE PTR _input$[ebp]<br />
cmp ecx, 122 ; 0000007aH<br />
jg SHORT $LN1@f<br />
movsx edx, BYTE PTR _input$[ebp]<br />
sub edx, 32 ; 00000020H<br />
mov BYTE PTR _input$[ebp], dl<br />
$LN1@f:<br />
mov al, BYTE PTR _input$[ebp]<br />
pop ebp<br />
ret 0<br />
_f ENDP<br />
_TEXT ENDS<br />
It is the same code compiled by GCC 4.4.1 with -O3 option (maximum optimization):<br />
_f proc near<br />
input = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
movzx eax, byte ptr [ebp+input]<br />
lea edx, [eax-61h]<br />
124
loc_80483F2:<br />
pop ebp<br />
retn<br />
_f endp<br />
4.1.2 Task 1.2<br />
cmp dl, 19h<br />
ja short loc_80483F2<br />
sub eax, 20h<br />
This is also standard C library function. Source code is taken from OpenWatcom and modified slightly.<br />
Compiled in MSVC 2010 with /Ox optimization flag.<br />
This function also use these standard C functions: isspace() and isdigit().<br />
EXTRN _isdigit:PROC<br />
EXTRN _isspace:PROC<br />
EXTRN ___ptr_check:PROC<br />
; Function compile flags: /Ogtpy<br />
_TEXT SEGMENT<br />
_p$ = 8 ; size = 4<br />
_f PROC<br />
push ebx<br />
push esi<br />
mov esi, DWORD PTR _p$[esp+4]<br />
push edi<br />
push 0<br />
push esi<br />
call ___ptr_check<br />
mov eax, DWORD PTR [esi]<br />
push eax<br />
call _isspace<br />
add esp, 12 ; 0000000cH<br />
test eax, eax<br />
je SHORT $LN6@f<br />
npad 2<br />
$LL7@f:<br />
mov ecx, DWORD PTR [esi+4]<br />
add esi, 4<br />
push ecx<br />
call _isspace<br />
add esp, 4<br />
test eax, eax<br />
jne SHORT $LL7@f<br />
$LN6@f:<br />
mov bl, BYTE PTR [esi]<br />
cmp bl, 43 ; 0000002bH<br />
je SHORT $LN4@f<br />
cmp bl, 45 ; 0000002dH<br />
jne SHORT $LN5@f<br />
$LN4@f:<br />
add esi, 4<br />
$LN5@f:<br />
mov edx, DWORD PTR [esi]<br />
125
push edx<br />
xor edi, edi<br />
call _isdigit<br />
add esp, 4<br />
test eax, eax<br />
je SHORT $LN2@f<br />
$LL3@f:<br />
mov ecx, DWORD PTR [esi]<br />
mov edx, DWORD PTR [esi+4]<br />
add esi, 4<br />
lea eax, DWORD PTR [edi+edi*4]<br />
push edx<br />
lea edi, DWORD PTR [ecx+eax*2-48]<br />
call _isdigit<br />
add esp, 4<br />
test eax, eax<br />
jne SHORT $LL3@f<br />
$LN2@f:<br />
cmp bl, 45 ; 0000002dH<br />
jne SHORT $LN14@f<br />
neg edi<br />
$LN14@f:<br />
mov eax, edi<br />
pop edi<br />
pop esi<br />
pop ebx<br />
ret 0<br />
_f ENDP<br />
_TEXT ENDS<br />
Same code compiled in GCC 4.4.1. This task is sligthly harder because GCC compiled isspace() and<br />
isdigit() functions like inline-functions and inserted their bodies right in<strong>to</strong> code.<br />
_f proc near<br />
var_10 = dword ptr -10h<br />
var_9 = byte ptr -9<br />
input = dword ptr 8<br />
loc_804840C:<br />
loc_8048410:<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 18h<br />
jmp short loc_8048410<br />
add [ebp+input], 4<br />
call ___ctype_b_loc<br />
mov edx, [eax]<br />
mov eax, [ebp+input]<br />
mov eax, [eax]<br />
add eax, eax<br />
lea eax, [edx+eax]<br />
movzx eax, word ptr [eax]<br />
movzx eax, ax<br />
126
loc_8048444:<br />
loc_8048448:<br />
loc_8048451:<br />
loc_8048471:<br />
and eax, 2000h<br />
test eax, eax<br />
jnz short loc_804840C<br />
mov eax, [ebp+input]<br />
mov eax, [eax]<br />
mov [ebp+var_9], al<br />
cmp [ebp+var_9], ’+’<br />
jz short loc_8048444<br />
cmp [ebp+var_9], ’-’<br />
jnz short loc_8048448<br />
add [ebp+input], 4<br />
mov [ebp+var_10], 0<br />
jmp short loc_8048471<br />
mov edx, [ebp+var_10]<br />
mov eax, edx<br />
shl eax, 2<br />
add eax, edx<br />
add eax, eax<br />
mov edx, eax<br />
mov eax, [ebp+input]<br />
mov eax, [eax]<br />
lea eax, [edx+eax]<br />
sub eax, 30h<br />
mov [ebp+var_10], eax<br />
add [ebp+input], 4<br />
call ___ctype_b_loc<br />
mov edx, [eax]<br />
mov eax, [ebp+input]<br />
mov eax, [eax]<br />
add eax, eax<br />
lea eax, [edx+eax]<br />
movzx eax, word ptr [eax]<br />
movzx eax, ax<br />
and eax, 800h<br />
test eax, eax<br />
jnz short loc_8048451<br />
cmp [ebp+var_9], 2Dh<br />
jnz short loc_804849A<br />
neg [ebp+var_10]<br />
loc_804849A:<br />
mov eax, [ebp+var_10]<br />
leave<br />
retn<br />
_f endp<br />
127
4.1.3 Task 1.3<br />
This is standard C function <strong>to</strong>o, actually, two functions working in pair. Source code taken from MSVC<br />
2010 and modified sligthly.<br />
The matter of modification is that this function can work properly in multi-threaded environment, and<br />
I removed its support <strong>for</strong> simplification (or <strong>for</strong> confusion).<br />
Compiled in MSVC 2010 with /Ox flag.<br />
_BSS SEGMENT<br />
_v DD 01H DUP (?)<br />
_BSS ENDS<br />
_TEXT SEGMENT<br />
_s$ = 8 ; size = 4<br />
f1 PROC<br />
push ebp<br />
mov ebp, esp<br />
mov eax, DWORD PTR _s$[ebp]<br />
mov DWORD PTR _v, eax<br />
pop ebp<br />
ret 0<br />
f1 ENDP<br />
_TEXT ENDS<br />
PUBLIC f2<br />
_TEXT SEGMENT<br />
f2 PROC<br />
push ebp<br />
mov ebp, esp<br />
mov eax, DWORD PTR _v<br />
imul eax, 214013 ; 000343fdH<br />
add eax, 2531011 ; 00269ec3H<br />
mov DWORD PTR _v, eax<br />
mov eax, DWORD PTR _v<br />
shr eax, 16 ; 00000010H<br />
and eax, 32767 ; 00007fffH<br />
pop ebp<br />
ret 0<br />
f2 ENDP<br />
_TEXT ENDS<br />
END<br />
Same code compiled in GCC 4.4.1:<br />
public f1<br />
f1 proc near<br />
arg_0 = dword ptr 8<br />
push ebp<br />
mov ebp, esp<br />
mov eax, [ebp+arg_0]<br />
mov ds:v, eax<br />
pop ebp<br />
retn<br />
f1 endp<br />
128
public f2<br />
f2 proc near<br />
push ebp<br />
mov ebp, esp<br />
mov eax, ds:v<br />
imul eax, 343FDh<br />
add eax, 269EC3h<br />
mov ds:v, eax<br />
mov eax, ds:v<br />
shr eax, 10h<br />
and eax, 7FFFh<br />
pop ebp<br />
retn<br />
f2 endp<br />
bss segment dword public ’BSS’ use32<br />
assume cs:_bss<br />
dd ?<br />
bss ends<br />
4.1.4 Task 1.4<br />
This is standard C library function. Source code taken from MSVC 2010. Compiled in MSVC 2010 with<br />
/Ox flag.<br />
PUBLIC _f<br />
_TEXT SEGMENT<br />
_arg1$ = 8 ; size = 4<br />
_arg2$ = 12 ; size = 4<br />
_f PROC<br />
push esi<br />
mov esi, DWORD PTR _arg1$[esp]<br />
push edi<br />
mov edi, DWORD PTR _arg2$[esp+4]<br />
cmp BYTE PTR [edi], 0<br />
mov eax, esi<br />
je SHORT $LN7@f<br />
mov dl, BYTE PTR [esi]<br />
push ebx<br />
test dl, dl<br />
je SHORT $LN4@f<br />
sub esi, edi<br />
npad 6<br />
$LL5@f:<br />
mov ecx, edi<br />
test dl, dl<br />
je SHORT $LN2@f<br />
$LL3@f:<br />
mov dl, BYTE PTR [ecx]<br />
test dl, dl<br />
je SHORT $LN14@f<br />
movsx ebx, BYTE PTR [esi+ecx]<br />
movsx edx, dl<br />
129
sub ebx, edx<br />
jne SHORT $LN2@f<br />
inc ecx<br />
cmp BYTE PTR [esi+ecx], bl<br />
jne SHORT $LL3@f<br />
$LN2@f:<br />
cmp BYTE PTR [ecx], 0<br />
je SHORT $LN14@f<br />
mov dl, BYTE PTR [eax+1]<br />
inc eax<br />
inc esi<br />
test dl, dl<br />
jne SHORT $LL5@f<br />
xor eax, eax<br />
pop ebx<br />
pop edi<br />
pop esi<br />
ret 0<br />
_f ENDP<br />
_TEXT ENDS<br />
END<br />
Same code compiled in GCC 4.4.1:<br />
public f<br />
f proc near<br />
var_C = dword ptr -0Ch<br />
var_8 = dword ptr -8<br />
var_4 = dword ptr -4<br />
arg_0 = dword ptr 8<br />
arg_4 = dword ptr 0Ch<br />
loc_80483F4:<br />
loc_8048402:<br />
push ebp<br />
mov ebp, esp<br />
sub esp, 10h<br />
mov eax, [ebp+arg_0]<br />
mov [ebp+var_4], eax<br />
mov eax, [ebp+arg_4]<br />
movzx eax, byte ptr [eax]<br />
test al, al<br />
jnz short loc_8048443<br />
mov eax, [ebp+arg_0]<br />
jmp short locret_8048453<br />
mov eax, [ebp+var_4]<br />
mov [ebp+var_8], eax<br />
mov eax, [ebp+arg_4]<br />
mov [ebp+var_C], eax<br />
jmp short loc_804840A<br />
add [ebp+var_8], 1<br />
add [ebp+var_C], 1<br />
130
loc_804840A:<br />
loc_804842E:<br />
loc_804843D:<br />
loc_8048443:<br />
loc_8048444:<br />
mov eax, [ebp+var_8]<br />
movzx eax, byte ptr [eax]<br />
test al, al<br />
jz short loc_804842E<br />
mov eax, [ebp+var_C]<br />
movzx eax, byte ptr [eax]<br />
test al, al<br />
jz short loc_804842E<br />
mov eax, [ebp+var_8]<br />
movzx edx, byte ptr [eax]<br />
mov eax, [ebp+var_C]<br />
movzx eax, byte ptr [eax]<br />
cmp dl, al<br />
jz short loc_8048402<br />
mov eax, [ebp+var_C]<br />
movzx eax, byte ptr [eax]<br />
test al, al<br />
jnz short loc_804843D<br />
mov eax, [ebp+var_4]<br />
jmp short locret_8048453<br />
add [ebp+var_4], 1<br />
jmp short loc_8048444<br />
nop<br />
locret_8048453:<br />
leave<br />
retn<br />
f endp<br />
4.1.5 Task 1.5<br />
mov eax, [ebp+var_4]<br />
movzx eax, byte ptr [eax]<br />
test al, al<br />
jnz short loc_80483F4<br />
mov eax, 0<br />
This task is rather on knowledge than on reading code.<br />
The function is taken from OpenWatcom. Compiled in MSVC 2010 with /Ox flag.<br />
_DATA SEGMENT<br />
COMM __v:DWORD<br />
_DATA ENDS<br />
PUBLIC __real@3e45798ee2308c3a<br />
131
PUBLIC __real@4147ffff80000000<br />
PUBLIC __real@4150017ec0000000<br />
PUBLIC _f<br />
EXTRN __fltused:DWORD<br />
CONST SEGMENT<br />
__real@3e45798ee2308c3a DQ 03e45798ee2308c3ar ; 1e-008<br />
__real@4147ffff80000000 DQ 04147ffff80000000r ; 3.14573e+006<br />
__real@4150017ec0000000 DQ 04150017ec0000000r ; 4.19584e+006<br />
CONST ENDS<br />
_TEXT SEGMENT<br />
_v1$ = -16 ; size = 8<br />
_v2$ = -8 ; size = 8<br />
_f PROC<br />
sub esp, 16 ; 00000010H<br />
fld QWORD PTR __real@4150017ec0000000<br />
fstp QWORD PTR _v1$[esp+16]<br />
fld QWORD PTR __real@4147ffff80000000<br />
fstp QWORD PTR _v2$[esp+16]<br />
fld QWORD PTR _v1$[esp+16]<br />
fld QWORD PTR _v1$[esp+16]<br />
fdiv QWORD PTR _v2$[esp+16]<br />
fmul QWORD PTR _v2$[esp+16]<br />
fsubp ST(1), ST(0)<br />
fcomp QWORD PTR __real@3e45798ee2308c3a<br />
fnstsw ax<br />
test ah, 65 ; 00000041H<br />
jne SHORT $LN1@f<br />
or DWORD PTR __v, 1<br />
$LN1@f:<br />
add esp, 16 ; 00000010H<br />
ret 0<br />
_f ENDP<br />
_TEXT ENDS<br />
4.1.6 Task 1.6<br />
Compiled in MSVC 2010 with /Ox option.<br />
PUBLIC _f<br />
; Function compile flags: /Ogtpy<br />
_TEXT SEGMENT<br />
_k0$ = -12 ; size = 4<br />
_k3$ = -8 ; size = 4<br />
_k2$ = -4 ; size = 4<br />
_v$ = 8 ; size = 4<br />
_k1$ = 12 ; size = 4<br />
_k$ = 12 ; size = 4<br />
_f PROC<br />
sub esp, 12 ; 0000000cH<br />
mov ecx, DWORD PTR _v$[esp+8]<br />
mov eax, DWORD PTR [ecx]<br />
mov ecx, DWORD PTR [ecx+4]<br />
push ebx<br />
push esi<br />
132
mov esi, DWORD PTR _k$[esp+16]<br />
push edi<br />
mov edi, DWORD PTR [esi]<br />
mov DWORD PTR _k0$[esp+24], edi<br />
mov edi, DWORD PTR [esi+4]<br />
mov DWORD PTR _k1$[esp+20], edi<br />
mov edi, DWORD PTR [esi+8]<br />
mov esi, DWORD PTR [esi+12]<br />
xor edx, edx<br />
mov DWORD PTR _k2$[esp+24], edi<br />
mov DWORD PTR _k3$[esp+24], esi<br />
lea edi, DWORD PTR [edx+32]<br />
$LL8@f:<br />
mov esi, ecx<br />
shr esi, 5<br />
add esi, DWORD PTR _k1$[esp+20]<br />
mov ebx, ecx<br />
shl ebx, 4<br />
add ebx, DWORD PTR _k0$[esp+24]<br />
sub edx, 1640531527 ; 61c88647H<br />
xor esi, ebx<br />
lea ebx, DWORD PTR [edx+ecx]<br />
xor esi, ebx<br />
add eax, esi<br />
mov esi, eax<br />
shr esi, 5<br />
add esi, DWORD PTR _k3$[esp+24]<br />
mov ebx, eax<br />
shl ebx, 4<br />
add ebx, DWORD PTR _k2$[esp+24]<br />
xor esi, ebx<br />
lea ebx, DWORD PTR [edx+eax]<br />
xor esi, ebx<br />
add ecx, esi<br />
dec edi<br />
jne SHORT $LL8@f<br />
mov edx, DWORD PTR _v$[esp+20]<br />
pop edi<br />
pop esi<br />
mov DWORD PTR [edx], eax<br />
mov DWORD PTR [edx+4], ecx<br />
pop ebx<br />
add esp, 12 ; 0000000cH<br />
ret 0<br />
_f ENDP<br />
4.1.7 Task 1.7<br />
This function is taken from Linux 2.6 kernel.<br />
Compiled in MSVC 2010 with /Ox option:<br />
_table db 000h, 080h, 040h, 0c0h, 020h, 0a0h, 060h, 0e0h<br />
db 010h, 090h, 050h, 0d0h, 030h, 0b0h, 070h, 0f0h<br />
db 008h, 088h, 048h, 0c8h, 028h, 0a8h, 068h, 0e8h<br />
133
db 018h, 098h, 058h, 0d8h, 038h, 0b8h, 078h, 0f8h<br />
db 004h, 084h, 044h, 0c4h, 024h, 0a4h, 064h, 0e4h<br />
db 014h, 094h, 054h, 0d4h, 034h, 0b4h, 074h, 0f4h<br />
db 00ch, 08ch, 04ch, 0cch, 02ch, 0ach, 06ch, 0ech<br />
db 01ch, 09ch, 05ch, 0dch, 03ch, 0bch, 07ch, 0fch<br />
db 002h, 082h, 042h, 0c2h, 022h, 0a2h, 062h, 0e2h<br />
db 012h, 092h, 052h, 0d2h, 032h, 0b2h, 072h, 0f2h<br />
db 00ah, 08ah, 04ah, 0cah, 02ah, 0aah, 06ah, 0eah<br />
db 01ah, 09ah, 05ah, 0dah, 03ah, 0bah, 07ah, 0fah<br />
db 006h, 086h, 046h, 0c6h, 026h, 0a6h, 066h, 0e6h<br />
db 016h, 096h, 056h, 0d6h, 036h, 0b6h, 076h, 0f6h<br />
db 00eh, 08eh, 04eh, 0ceh, 02eh, 0aeh, 06eh, 0eeh<br />
db 01eh, 09eh, 05eh, 0deh, 03eh, 0beh, 07eh, 0feh<br />
db 001h, 081h, 041h, 0c1h, 021h, 0a1h, 061h, 0e1h<br />
db 011h, 091h, 051h, 0d1h, 031h, 0b1h, 071h, 0f1h<br />
db 009h, 089h, 049h, 0c9h, 029h, 0a9h, 069h, 0e9h<br />
db 019h, 099h, 059h, 0d9h, 039h, 0b9h, 079h, 0f9h<br />
db 005h, 085h, 045h, 0c5h, 025h, 0a5h, 065h, 0e5h<br />
db 015h, 095h, 055h, 0d5h, 035h, 0b5h, 075h, 0f5h<br />
db 00dh, 08dh, 04dh, 0cdh, 02dh, 0adh, 06dh, 0edh<br />
db 01dh, 09dh, 05dh, 0ddh, 03dh, 0bdh, 07dh, 0fdh<br />
db 003h, 083h, 043h, 0c3h, 023h, 0a3h, 063h, 0e3h<br />
db 013h, 093h, 053h, 0d3h, 033h, 0b3h, 073h, 0f3h<br />
db 00bh, 08bh, 04bh, 0cbh, 02bh, 0abh, 06bh, 0ebh<br />
db 01bh, 09bh, 05bh, 0dbh, 03bh, 0bbh, 07bh, 0fbh<br />
db 007h, 087h, 047h, 0c7h, 027h, 0a7h, 067h, 0e7h<br />
db 017h, 097h, 057h, 0d7h, 037h, 0b7h, 077h, 0f7h<br />
db 00fh, 08fh, 04fh, 0cfh, 02fh, 0afh, 06fh, 0efh<br />
db 01fh, 09fh, 05fh, 0dfh, 03fh, 0bfh, 07fh, 0ffh<br />
f proc near<br />
arg_0 = dword ptr 4<br />
mov edx, [esp+arg_0]<br />
movzx eax, dl<br />
movzx eax, _table[eax]<br />
mov ecx, edx<br />
shr edx, 8<br />
movzx edx, dl<br />
movzx edx, _table[edx]<br />
shl ax, 8<br />
movzx eax, ax<br />
or eax, edx<br />
shr ecx, 10h<br />
movzx edx, cl<br />
movzx edx, _table[edx]<br />
shr ecx, 8<br />
movzx ecx, cl<br />
movzx ecx, _table[ecx]<br />
shl dx, 8<br />
movzx edx, dx<br />
shl eax, 10h<br />
or edx, ecx<br />
134
or eax, edx<br />
retn<br />
f endp<br />
4.1.8 Task 1.8<br />
Compiled in MSVC 2010 with /O1 option 1 :<br />
_a$ = 8 ; size = 4<br />
_b$ = 12 ; size = 4<br />
_c$ = 16 ; size = 4<br />
?s@@YAXPAN00@Z PROC ; s, COMDAT<br />
mov eax, DWORD PTR _b$[esp-4]<br />
mov ecx, DWORD PTR _a$[esp-4]<br />
mov edx, DWORD PTR _c$[esp-4]<br />
push esi<br />
push edi<br />
sub ecx, eax<br />
sub edx, eax<br />
mov edi, 200 ; 000000c8H<br />
$LL6@s:<br />
push 100 ; 00000064H<br />
pop esi<br />
$LL3@s:<br />
fld QWORD PTR [ecx+eax]<br />
fadd QWORD PTR [eax]<br />
fstp QWORD PTR [edx+eax]<br />
add eax, 8<br />
dec esi<br />
jne SHORT $LL3@s<br />
dec edi<br />
jne SHORT $LL6@s<br />
pop edi<br />
pop esi<br />
ret 0<br />
?s@@YAXPAN00@Z ENDP ; s<br />
4.1.9 Task 1.9<br />
Compiled in MSVC 2010 with /O1 option:<br />
tv315 = -8 ; size = 4<br />
tv291 = -4 ; size = 4<br />
_a$ = 8 ; size = 4<br />
_b$ = 12 ; size = 4<br />
_c$ = 16 ; size = 4<br />
?m@@YAXPAN00@Z PROC ; m, COMDAT<br />
push ebp<br />
mov ebp, esp<br />
push ecx<br />
push ecx<br />
mov edx, DWORD PTR _a$[ebp]<br />
push ebx<br />
1 /O1: minimize space<br />
135
mov ebx, DWORD PTR _c$[ebp]<br />
push esi<br />
mov esi, DWORD PTR _b$[ebp]<br />
sub edx, esi<br />
push edi<br />
sub esi, ebx<br />
mov DWORD PTR tv315[ebp], 100 ; 00000064H<br />
$LL9@m:<br />
mov eax, ebx<br />
mov DWORD PTR tv291[ebp], 300 ; 0000012cH<br />
$LL6@m:<br />
fldz<br />
lea ecx, DWORD PTR [esi+eax]<br />
fstp QWORD PTR [eax]<br />
mov edi, 200 ; 000000c8H<br />
$LL3@m:<br />
dec edi<br />
fld QWORD PTR [ecx+edx]<br />
fmul QWORD PTR [ecx]<br />
fadd QWORD PTR [eax]<br />
fstp QWORD PTR [eax]<br />
jne HORT $LL3@m<br />
add eax, 8<br />
dec DWORD PTR tv291[ebp]<br />
jne SHORT $LL6@m<br />
add ebx, 800 ; 00000320H<br />
dec DWORD PTR tv315[ebp]<br />
jne SHORT $LL9@m<br />
pop edi<br />
pop esi<br />
pop ebx<br />
leave<br />
ret 0<br />
?m@@YAXPAN00@Z ENDP ; m<br />
4.1.10 Task 1.10<br />
If <strong>to</strong> compile this piece of code and run, some number will be printed. Where it came from? Where it came<br />
from if <strong>to</strong> compile it in MSVC with optimization (/Ox)?<br />
#include <br />
int main()<br />
{<br />
printf ("%d\n");<br />
};<br />
return 0;<br />
136
4.2 Middle level<br />
4.2.1 Task 2.1<br />
Well-known algorithm, also included in standard C library. Source code was taken from glibc 2.11.1. Compiled<br />
in GCC 4.4.1 with -Os option (code size optimization). Listing was done by IDA 4.9 disassembler from<br />
ELF-file generated by GCC and linker.<br />
For those who wants use IDA while learning, here you may find .elf and .idb files, .idb can be opened<br />
with freeware IDA 4.9:<br />
http://conus.info/RE-tasks/middle/1/<br />
f proc near<br />
var_150 = dword ptr -150h<br />
var_14C = dword ptr -14Ch<br />
var_13C = dword ptr -13Ch<br />
var_138 = dword ptr -138h<br />
var_134 = dword ptr -134h<br />
var_130 = dword ptr -130h<br />
var_128 = dword ptr -128h<br />
var_124 = dword ptr -124h<br />
var_120 = dword ptr -120h<br />
var_11C = dword ptr -11Ch<br />
var_118 = dword ptr -118h<br />
var_114 = dword ptr -114h<br />
var_110 = dword ptr -110h<br />
var_C = dword ptr -0Ch<br />
arg_0 = dword ptr 8<br />
arg_4 = dword ptr 0Ch<br />
arg_8 = dword ptr 10h<br />
arg_C = dword ptr 14h<br />
arg_10 = dword ptr 18h<br />
push ebp<br />
mov ebp, esp<br />
push edi<br />
push esi<br />
push ebx<br />
sub esp, 14Ch<br />
mov ebx, [ebp+arg_8]<br />
cmp [ebp+arg_4], 0<br />
jz loc_804877D<br />
cmp [ebp+arg_4], 4<br />
lea eax, ds:0[ebx*4]<br />
mov [ebp+var_130], eax<br />
jbe loc_804864C<br />
mov eax, [ebp+arg_4]<br />
mov ecx, ebx<br />
mov esi, [ebp+arg_0]<br />
lea edx, [ebp+var_110]<br />
neg ecx<br />
mov [ebp+var_118], 0<br />
mov [ebp+var_114], 0<br />
dec eax<br />
imul eax, ebx<br />
137
add eax, [ebp+arg_0]<br />
mov [ebp+var_11C], edx<br />
mov [ebp+var_134], ecx<br />
mov [ebp+var_124], eax<br />
lea eax, [ebp+var_118]<br />
mov [ebp+var_14C], eax<br />
mov [ebp+var_120], ebx<br />
loc_8048433: ; CODE XREF: f+28C<br />
mov eax, [ebp+var_124]<br />
xor edx, edx<br />
push edi<br />
push [ebp+arg_10]<br />
sub eax, esi<br />
div [ebp+var_120]<br />
push esi<br />
shr eax, 1<br />
imul eax, [ebp+var_120]<br />
lea edx, [esi+eax]<br />
push edx<br />
mov [ebp+var_138], edx<br />
call [ebp+arg_C]<br />
add esp, 10h<br />
mov edx, [ebp+var_138]<br />
test eax, eax<br />
jns short loc_8048482<br />
xor eax, eax<br />
loc_804846D: ; CODE XREF: f+CC<br />
mov cl, [edx+eax]<br />
mov bl, [esi+eax]<br />
mov [edx+eax], bl<br />
mov [esi+eax], cl<br />
inc eax<br />
cmp [ebp+var_120], eax<br />
jnz short loc_804846D<br />
loc_8048482: ; CODE XREF: f+B5<br />
push ebx<br />
push [ebp+arg_10]<br />
mov [ebp+var_138], edx<br />
push edx<br />
push [ebp+var_124]<br />
call [ebp+arg_C]<br />
mov edx, [ebp+var_138]<br />
add esp, 10h<br />
test eax, eax<br />
jns short loc_80484F6<br />
mov ecx, [ebp+var_124]<br />
xor eax, eax<br />
loc_80484AB: ; CODE XREF: f+10D<br />
movzx edi, byte ptr [edx+eax]<br />
mov bl, [ecx+eax]<br />
138
mov [edx+eax], bl<br />
mov ebx, edi<br />
mov [ecx+eax], bl<br />
inc eax<br />
cmp [ebp+var_120], eax<br />
jnz short loc_80484AB<br />
push ecx<br />
push [ebp+arg_10]<br />
mov [ebp+var_138], edx<br />
push esi<br />
push edx<br />
call [ebp+arg_C]<br />
add esp, 10h<br />
mov edx, [ebp+var_138]<br />
test eax, eax<br />
jns short loc_80484F6<br />
xor eax, eax<br />
loc_80484E1: ; CODE XREF: f+140<br />
mov cl, [edx+eax]<br />
mov bl, [esi+eax]<br />
mov [edx+eax], bl<br />
mov [esi+eax], cl<br />
inc eax<br />
cmp [ebp+var_120], eax<br />
jnz short loc_80484E1<br />
loc_80484F6: ; CODE XREF: f+ED<br />
; f+129<br />
mov eax, [ebp+var_120]<br />
mov edi, [ebp+var_124]<br />
add edi, [ebp+var_134]<br />
lea ebx, [esi+eax]<br />
jmp short loc_8048513<br />
; ---------------------------------------------------------------------------<br />
loc_804850D: ; CODE XREF: f+17B<br />
add ebx, [ebp+var_120]<br />
loc_8048513: ; CODE XREF: f+157<br />
; f+1F9<br />
push eax<br />
push [ebp+arg_10]<br />
mov [ebp+var_138], edx<br />
push edx<br />
push ebx<br />
call [ebp+arg_C]<br />
add esp, 10h<br />
mov edx, [ebp+var_138]<br />
test eax, eax<br />
jns short loc_8048537<br />
jmp short loc_804850D<br />
; ---------------------------------------------------------------------------<br />
139
loc_8048531: ; CODE XREF: f+19D<br />
add edi, [ebp+var_134]<br />
loc_8048537: ; CODE XREF: f+179<br />
push ecx<br />
push [ebp+arg_10]<br />
mov [ebp+var_138], edx<br />
push edi<br />
push edx<br />
call [ebp+arg_C]<br />
add esp, 10h<br />
mov edx, [ebp+var_138]<br />
test eax, eax<br />
js short loc_8048531<br />
cmp ebx, edi<br />
jnb short loc_8048596<br />
xor eax, eax<br />
mov [ebp+var_128], edx<br />
loc_804855F: ; CODE XREF: f+1BE<br />
mov cl, [ebx+eax]<br />
mov dl, [edi+eax]<br />
mov [ebx+eax], dl<br />
mov [edi+eax], cl<br />
inc eax<br />
cmp [ebp+var_120], eax<br />
jnz short loc_804855F<br />
mov edx, [ebp+var_128]<br />
cmp edx, ebx<br />
jnz short loc_8048582<br />
mov edx, edi<br />
jmp short loc_8048588<br />
; ---------------------------------------------------------------------------<br />
loc_8048582: ; CODE XREF: f+1C8<br />
cmp edx, edi<br />
jnz short loc_8048588<br />
mov edx, ebx<br />
loc_8048588: ; CODE XREF: f+1CC<br />
; f+1D0<br />
add ebx, [ebp+var_120]<br />
add edi, [ebp+var_134]<br />
jmp short loc_80485AB<br />
; ---------------------------------------------------------------------------<br />
loc_8048596: ; CODE XREF: f+1A1<br />
jnz short loc_80485AB<br />
mov ecx, [ebp+var_134]<br />
mov eax, [ebp+var_120]<br />
lea edi, [ebx+ecx]<br />
add ebx, eax<br />
jmp short loc_80485B3<br />
; ---------------------------------------------------------------------------<br />
140
loc_80485AB: ; CODE XREF: f+1E0<br />
; f:loc_8048596<br />
cmp ebx, edi<br />
jbe loc_8048513<br />
loc_80485B3: ; CODE XREF: f+1F5<br />
mov eax, edi<br />
sub eax, esi<br />
cmp eax, [ebp+var_130]<br />
ja short loc_80485EB<br />
mov eax, [ebp+var_124]<br />
mov esi, ebx<br />
sub eax, ebx<br />
cmp eax, [ebp+var_130]<br />
ja short loc_8048634<br />
sub [ebp+var_11C], 8<br />
mov edx, [ebp+var_11C]<br />
mov ecx, [edx+4]<br />
mov esi, [edx]<br />
mov [ebp+var_124], ecx<br />
jmp short loc_8048634<br />
; ---------------------------------------------------------------------------<br />
loc_80485EB: ; CODE XREF: f+209<br />
mov edx, [ebp+var_124]<br />
sub edx, ebx<br />
cmp edx, [ebp+var_130]<br />
jbe short loc_804862E<br />
cmp eax, edx<br />
mov edx, [ebp+var_11C]<br />
lea eax, [edx+8]<br />
jle short loc_8048617<br />
mov [edx], esi<br />
mov esi, ebx<br />
mov [edx+4], edi<br />
mov [ebp+var_11C], eax<br />
jmp short loc_8048634<br />
; ---------------------------------------------------------------------------<br />
loc_8048617: ; CODE XREF: f+252<br />
mov ecx, [ebp+var_11C]<br />
mov [ebp+var_11C], eax<br />
mov [ecx], ebx<br />
mov ebx, [ebp+var_124]<br />
mov [ecx+4], ebx<br />
loc_804862E: ; CODE XREF: f+245<br />
mov [ebp+var_124], edi<br />
loc_8048634: ; CODE XREF: f+21B<br />
; f+235 ...<br />
mov eax, [ebp+var_14C]<br />
cmp [ebp+var_11C], eax<br />
141
ja loc_8048433<br />
mov ebx, [ebp+var_120]<br />
loc_804864C: ; CODE XREF: f+2A<br />
mov eax, [ebp+arg_4]<br />
mov ecx, [ebp+arg_0]<br />
add ecx, [ebp+var_130]<br />
dec eax<br />
imul eax, ebx<br />
add eax, [ebp+arg_0]<br />
cmp ecx, eax<br />
mov [ebp+var_120], eax<br />
jbe short loc_804866B<br />
mov ecx, eax<br />
loc_804866B: ; CODE XREF: f+2B3<br />
mov esi, [ebp+arg_0]<br />
mov edi, [ebp+arg_0]<br />
add esi, ebx<br />
mov edx, esi<br />
jmp short loc_80486A3<br />
; ---------------------------------------------------------------------------<br />
loc_8048677: ; CODE XREF: f+2F1<br />
push eax<br />
push [ebp+arg_10]<br />
mov [ebp+var_138], edx<br />
mov [ebp+var_13C], ecx<br />
push edi<br />
push edx<br />
call [ebp+arg_C]<br />
add esp, 10h<br />
mov edx, [ebp+var_138]<br />
mov ecx, [ebp+var_13C]<br />
test eax, eax<br />
jns short loc_80486A1<br />
mov edi, edx<br />
loc_80486A1: ; CODE XREF: f+2E9<br />
add edx, ebx<br />
loc_80486A3: ; CODE XREF: f+2C1<br />
cmp edx, ecx<br />
jbe short loc_8048677<br />
cmp edi, [ebp+arg_0]<br />
jz loc_8048762<br />
xor eax, eax<br />
loc_80486B2: ; CODE XREF: f+313<br />
mov ecx, [ebp+arg_0]<br />
mov dl, [edi+eax]<br />
mov cl, [ecx+eax]<br />
mov [edi+eax], cl<br />
mov ecx, [ebp+arg_0]<br />
142
mov [ecx+eax], dl<br />
inc eax<br />
cmp ebx, eax<br />
jnz short loc_80486B2<br />
jmp loc_8048762<br />
; ---------------------------------------------------------------------------<br />
loc_80486CE: ; CODE XREF: f+3C3<br />
lea edx, [esi+edi]<br />
jmp short loc_80486D5<br />
; ---------------------------------------------------------------------------<br />
loc_80486D3: ; CODE XREF: f+33B<br />
add edx, edi<br />
loc_80486D5: ; CODE XREF: f+31D<br />
push eax<br />
push [ebp+arg_10]<br />
mov [ebp+var_138], edx<br />
push edx<br />
push esi<br />
call [ebp+arg_C]<br />
add esp, 10h<br />
mov edx, [ebp+var_138]<br />
test eax, eax<br />
js short loc_80486D3<br />
add edx, ebx<br />
cmp edx, esi<br />
mov [ebp+var_124], edx<br />
jz short loc_804876F<br />
mov edx, [ebp+var_134]<br />
lea eax, [esi+ebx]<br />
add edx, eax<br />
mov [ebp+var_11C], edx<br />
jmp short loc_804875B<br />
; ---------------------------------------------------------------------------<br />
loc_8048710: ; CODE XREF: f+3AA<br />
mov cl, [eax]<br />
mov edx, [ebp+var_11C]<br />
mov [ebp+var_150], eax<br />
mov byte ptr [ebp+var_130], cl<br />
mov ecx, eax<br />
jmp short loc_8048733<br />
; ---------------------------------------------------------------------------<br />
loc_8048728: ; CODE XREF: f+391<br />
mov al, [edx+ebx]<br />
mov [ecx], al<br />
mov ecx, [ebp+var_128]<br />
loc_8048733: ; CODE XREF: f+372<br />
mov [ebp+var_128], edx<br />
add edx, edi<br />
143
mov eax, edx<br />
sub eax, edi<br />
cmp [ebp+var_124], eax<br />
jbe short loc_8048728<br />
mov dl, byte ptr [ebp+var_130]<br />
mov eax, [ebp+var_150]<br />
mov [ecx], dl<br />
dec [ebp+var_11C]<br />
loc_804875B: ; CODE XREF: f+35A<br />
dec eax<br />
cmp eax, esi<br />
jnb short loc_8048710<br />
jmp short loc_804876F<br />
; ---------------------------------------------------------------------------<br />
loc_8048762: ; CODE XREF: f+2F6<br />
; f+315<br />
mov edi, ebx<br />
neg edi<br />
lea ecx, [edi-1]<br />
mov [ebp+var_134], ecx<br />
loc_804876F: ; CODE XREF: f+347<br />
; f+3AC<br />
add esi, ebx<br />
cmp esi, [ebp+var_120]<br />
jbe loc_80486CE<br />
loc_804877D: ; CODE XREF: f+13<br />
lea esp, [ebp-0Ch]<br />
pop ebx<br />
pop esi<br />
pop edi<br />
pop ebp<br />
retn<br />
f endp<br />
4.3 crackme / keygenme<br />
Couple of my keygenmes 2 :<br />
http://crackmes.de/users/yonkie/<br />
2 program which imitates fictional software protection, <strong>for</strong> which one need <strong>to</strong> make a keys/licenses genera<strong>to</strong>r<br />
144
Chapter 5<br />
Tools<br />
∙ IDA as disassembler. Older freeware version is available <strong>for</strong> downloading: http://www.hex-rays.com/<br />
idapro/idadownfreeware.htm.<br />
∙ Microsoft Visual Studio Express 1 : Stripped-down Visual Studio version, convenient <strong>for</strong> simple expreiments.<br />
∙ Hiew 2 <strong>for</strong> small modifications of code in binary files.<br />
5.0.1 Debugger<br />
tracer 3 instead of debugger.<br />
I s<strong>to</strong>pped <strong>to</strong> use debugger eventually, because all I need from it is <strong>to</strong> spot some function’s arguments<br />
while execution, or registers’ state at some point. To load debugger each time is <strong>to</strong>o much, so I wrote a small<br />
utility tracer. It has console-interface, working from command-line, allow <strong>to</strong> intercept function execution,<br />
set breakpoints at arbitrary places, spot registers’ state, modify it, etc.<br />
However, as <strong>for</strong> learning, it’s highly advisable <strong>to</strong> trace code in debugger manually, watch how register’s<br />
state changing (<strong>for</strong> example, classic SoftICE, OllyDbg, WinDbg highlighting changed registers), flags, data,<br />
change them manually, watch reaction, etc.<br />
1 http://www.microsoft.com/express/Downloads/<br />
2 http://www.hiew.ru/<br />
3 http://conus.info/gt/<br />
145
Chapter 6<br />
Books/blogs worth reading<br />
6.1 Books<br />
6.1.1 Windows<br />
∙ Windows R○ Internals (Mark E. Russinovich and David A. Solomon with Alex Ionescu) 1<br />
6.1.2 C/C++<br />
∙ C++ language standard: ISO/IEC 14882:2003 2<br />
6.1.3 x86 / x86-64<br />
∙ Intel manuals: http://www.intel.com/products/processor/manuals/<br />
∙ AMD manuals: http://developer.amd.com/documentation/guides/Pages/default.aspx#manuals<br />
6.2 Blogs<br />
6.2.1 Windows<br />
∙ Microsoft: Raymond Chen<br />
∙ http://www.nynaeve.net/<br />
1 http://www.microsoft.com/learning/en/us/book.aspx?ID=12069&locale=en-us<br />
2 http://www.iso.org/iso/catalogue_detail.htm?csnumber=38110<br />
146
Chapter 7<br />
More examples<br />
147
7.1 “QR9”: Rubik’s cube inspired amateur cryp<strong>to</strong>-algorithm<br />
Sometimes amateur cryp<strong>to</strong>systems appear <strong>to</strong> be pretty bizarre.<br />
I was asked <strong>to</strong> revese engineer an amateur cryp<strong>to</strong>algorithm of some data crypting utility, source code of<br />
which was lost 1 .<br />
Here is also IDA 5 exported listing from original crypting utility:<br />
.text:00541000 set_bit proc near ; CODE XREF: rotate1+42<br />
.text:00541000 ; rotate2+42 ...<br />
.text:00541000<br />
.text:00541000 arg_0 = dword ptr 4<br />
.text:00541000 arg_4 = dword ptr 8<br />
.text:00541000 arg_8 = dword ptr 0Ch<br />
.text:00541000 arg_C = byte ptr 10h<br />
.text:00541000<br />
.text:00541000 mov al, [esp+arg_C]<br />
.text:00541004 mov ecx, [esp+arg_8]<br />
.text:00541008 push esi<br />
.text:00541009 mov esi, [esp+4+arg_0]<br />
.text:0054100D test al, al<br />
.text:0054100F mov eax, [esp+4+arg_4]<br />
.text:00541013 mov dl, 1<br />
.text:00541015 jz short loc_54102B<br />
.text:00541017 shl dl, cl<br />
.text:00541019 mov cl, cube64[eax+esi*8]<br />
.text:00541020 or cl, dl<br />
.text:00541022 mov cube64[eax+esi*8], cl<br />
.text:00541029 pop esi<br />
.text:0054102A retn<br />
.text:0054102B ;<br />
---------------------------------------------------------------------------<br />
.text:0054102B<br />
.text:0054102B loc_54102B: ; CODE XREF: set_bit+15<br />
.text:0054102B shl dl, cl<br />
.text:0054102D mov cl, cube64[eax+esi*8]<br />
.text:00541034 not dl<br />
.text:00541036 and cl, dl<br />
.text:00541038 mov cube64[eax+esi*8], cl<br />
.text:0054103F pop esi<br />
.text:00541040 retn<br />
.text:00541040 set_bit endp<br />
.text:00541040<br />
.text:00541040 ;<br />
---------------------------------------------------------------------------<br />
.text:00541041 align 10h<br />
.text:00541050<br />
.text:00541050 ; =============== S U B R O U T I N E<br />
=======================================<br />
.text:00541050<br />
.text:00541050<br />
.text:00541050 get_bit proc near ; CODE XREF: rotate1+16<br />
.text:00541050 ; rotate2+16 ...<br />
.text:00541050<br />
1 I also got permit from cus<strong>to</strong>mer <strong>to</strong> publish the algorithm details<br />
148
.text:00541050 arg_0 = dword ptr 4<br />
.text:00541050 arg_4 = dword ptr 8<br />
.text:00541050 arg_8 = byte ptr 0Ch<br />
.text:00541050<br />
.text:00541050 mov eax, [esp+arg_4]<br />
.text:00541054 mov ecx, [esp+arg_0]<br />
.text:00541058 mov al, cube64[eax+ecx*8]<br />
.text:0054105F mov cl, [esp+arg_8]<br />
.text:00541063 shr al, cl<br />
.text:00541065 and al, 1<br />
.text:00541067 retn<br />
.text:00541067 get_bit endp<br />
.text:00541067<br />
.text:00541067 ;<br />
---------------------------------------------------------------------------<br />
.text:00541068 align 10h<br />
.text:00541070<br />
.text:00541070 ; =============== S U B R O U T I N E<br />
=======================================<br />
.text:00541070<br />
.text:00541070<br />
.text:00541070 rotate1 proc near ; CODE XREF:<br />
rotate_all_with_password+8E<br />
.text:00541070<br />
.text:00541070 internal_array_64= byte ptr -40h<br />
.text:00541070 arg_0 = dword ptr 4<br />
.text:00541070<br />
.text:00541070 sub esp, 40h<br />
.text:00541073 push ebx<br />
.text:00541074 push ebp<br />
.text:00541075 mov ebp, [esp+48h+arg_0]<br />
.text:00541079 push esi<br />
.text:0054107A push edi<br />
.text:0054107B xor edi, edi ; EDI is loop1 counter<br />
.text:0054107D lea ebx, [esp+50h+internal_array_64]<br />
.text:00541081<br />
.text:00541081 first_loop1_begin: ; CODE XREF: rotate1+2E<br />
.text:00541081 xor esi, esi ; ESI is loop2 counter<br />
.text:00541083<br />
.text:00541083 first_loop2_begin: ; CODE XREF: rotate1+25<br />
.text:00541083 push ebp ; arg_0<br />
.text:00541084 push esi<br />
.text:00541085 push edi<br />
.text:00541086 call get_bit<br />
.text:0054108B add esp, 0Ch<br />
.text:0054108E mov [ebx+esi], al ; s<strong>to</strong>re <strong>to</strong> internal array<br />
.text:00541091 inc esi<br />
.text:00541092 cmp esi, 8<br />
.text:00541095 jl short first_loop2_begin<br />
.text:00541097 inc edi<br />
.text:00541098 add ebx, 8<br />
.text:0054109B cmp edi, 8<br />
.text:0054109E jl short first_loop1_begin<br />
.text:005410A0 lea ebx, [esp+50h+internal_array_64]<br />
149
.text:005410A4 mov edi, 7 ; EDI is loop1 counter, initial<br />
state is 7<br />
.text:005410A9<br />
.text:005410A9 second_loop1_begin: ; CODE XREF: rotate1+57<br />
.text:005410A9 xor esi, esi ; ESI is loop2 counter<br />
.text:005410AB<br />
.text:005410AB second_loop2_begin: ; CODE XREF: rotate1+4E<br />
.text:005410AB mov al, [ebx+esi] ; value from internal array<br />
.text:005410AE push eax<br />
.text:005410AF push ebp ; arg_0<br />
.text:005410B0 push edi<br />
.text:005410B1 push esi<br />
.text:005410B2 call set_bit<br />
.text:005410B7 add esp, 10h<br />
.text:005410BA inc esi ; increment loop2 counter<br />
.text:005410BB cmp esi, 8<br />
.text:005410BE jl short second_loop2_begin<br />
.text:005410C0 dec edi ; decrement loop2 counter<br />
.text:005410C1 add ebx, 8<br />
.text:005410C4 cmp edi, 0FFFFFFFFh<br />
.text:005410C7 jg short second_loop1_begin<br />
.text:005410C9 pop edi<br />
.text:005410CA pop esi<br />
.text:005410CB pop ebp<br />
.text:005410CC pop ebx<br />
.text:005410CD add esp, 40h<br />
.text:005410D0 retn<br />
.text:005410D0 rotate1 endp<br />
.text:005410D0<br />
.text:005410D0 ;<br />
---------------------------------------------------------------------------<br />
.text:005410D1 align 10h<br />
.text:005410E0<br />
.text:005410E0 ; =============== S U B R O U T I N E<br />
=======================================<br />
.text:005410E0<br />
.text:005410E0<br />
.text:005410E0 rotate2 proc near ; CODE XREF:<br />
rotate_all_with_password+7A<br />
.text:005410E0<br />
.text:005410E0 internal_array_64= byte ptr -40h<br />
.text:005410E0 arg_0 = dword ptr 4<br />
.text:005410E0<br />
.text:005410E0 sub esp, 40h<br />
.text:005410E3 push ebx<br />
.text:005410E4 push ebp<br />
.text:005410E5 mov ebp, [esp+48h+arg_0]<br />
.text:005410E9 push esi<br />
.text:005410EA push edi<br />
.text:005410EB xor edi, edi ; loop1 counter<br />
.text:005410ED lea ebx, [esp+50h+internal_array_64]<br />
.text:005410F1<br />
.text:005410F1 loc_5410F1: ; CODE XREF: rotate2+2E<br />
.text:005410F1 xor esi, esi ; loop2 counter<br />
150
.text:005410F3<br />
.text:005410F3 loc_5410F3: ; CODE XREF: rotate2+25<br />
.text:005410F3 push esi ; loop2<br />
.text:005410F4 push edi ; loop1<br />
.text:005410F5 push ebp ; arg_0<br />
.text:005410F6 call get_bit<br />
.text:005410FB add esp, 0Ch<br />
.text:005410FE mov [ebx+esi], al ; s<strong>to</strong>re <strong>to</strong> internal array<br />
.text:00541101 inc esi ; increment loop1 counter<br />
.text:00541102 cmp esi, 8<br />
.text:00541105 jl short loc_5410F3<br />
.text:00541107 inc edi ; increment loop2 counter<br />
.text:00541108 add ebx, 8<br />
.text:0054110B cmp edi, 8<br />
.text:0054110E jl short loc_5410F1<br />
.text:00541110 lea ebx, [esp+50h+internal_array_64]<br />
.text:00541114 mov edi, 7 ; loop1 counter is initial state 7<br />
.text:00541119<br />
.text:00541119 loc_541119: ; CODE XREF: rotate2+57<br />
.text:00541119 xor esi, esi ; loop2 counter<br />
.text:0054111B<br />
.text:0054111B loc_54111B: ; CODE XREF: rotate2+4E<br />
.text:0054111B mov al, [ebx+esi] ; get byte from internal array<br />
.text:0054111E push eax<br />
.text:0054111F push edi ; loop1 counter<br />
.text:00541120 push esi ; loop2 counter<br />
.text:00541121 push ebp ; arg_0<br />
.text:00541122 call set_bit<br />
.text:00541127 add esp, 10h<br />
.text:0054112A inc esi ; increment loop2 counter<br />
.text:0054112B cmp esi, 8<br />
.text:0054112E jl short loc_54111B<br />
.text:00541130 dec edi ; decrement loop2 counter<br />
.text:00541131 add ebx, 8<br />
.text:00541134 cmp edi, 0FFFFFFFFh<br />
.text:00541137 jg short loc_541119<br />
.text:00541139 pop edi<br />
.text:0054113A pop esi<br />
.text:0054113B pop ebp<br />
.text:0054113C pop ebx<br />
.text:0054113D add esp, 40h<br />
.text:00541140 retn<br />
.text:00541140 rotate2 endp<br />
.text:00541140<br />
.text:00541140 ;<br />
---------------------------------------------------------------------------<br />
.text:00541141 align 10h<br />
.text:00541150<br />
.text:00541150 ; =============== S U B R O U T I N E<br />
=======================================<br />
.text:00541150<br />
.text:00541150<br />
.text:00541150 rotate3 proc near ; CODE XREF:<br />
rotate_all_with_password+66<br />
151
.text:00541150<br />
.text:00541150 var_40 = byte ptr -40h<br />
.text:00541150 arg_0 = dword ptr 4<br />
.text:00541150<br />
.text:00541150 sub esp, 40h<br />
.text:00541153 push ebx<br />
.text:00541154 push ebp<br />
.text:00541155 mov ebp, [esp+48h+arg_0]<br />
.text:00541159 push esi<br />
.text:0054115A push edi<br />
.text:0054115B xor edi, edi<br />
.text:0054115D lea ebx, [esp+50h+var_40]<br />
.text:00541161<br />
.text:00541161 loc_541161: ; CODE XREF: rotate3+2E<br />
.text:00541161 xor esi, esi<br />
.text:00541163<br />
.text:00541163 loc_541163: ; CODE XREF: rotate3+25<br />
.text:00541163 push esi<br />
.text:00541164 push ebp<br />
.text:00541165 push edi<br />
.text:00541166 call get_bit<br />
.text:0054116B add esp, 0Ch<br />
.text:0054116E mov [ebx+esi], al<br />
.text:00541171 inc esi<br />
.text:00541172 cmp esi, 8<br />
.text:00541175 jl short loc_541163<br />
.text:00541177 inc edi<br />
.text:00541178 add ebx, 8<br />
.text:0054117B cmp edi, 8<br />
.text:0054117E jl short loc_541161<br />
.text:00541180 xor ebx, ebx<br />
.text:00541182 lea edi, [esp+50h+var_40]<br />
.text:00541186<br />
.text:00541186 loc_541186: ; CODE XREF: rotate3+54<br />
.text:00541186 mov esi, 7<br />
.text:0054118B<br />
.text:0054118B loc_54118B: ; CODE XREF: rotate3+4E<br />
.text:0054118B mov al, [edi]<br />
.text:0054118D push eax<br />
.text:0054118E push ebx<br />
.text:0054118F push ebp<br />
.text:00541190 push esi<br />
.text:00541191 call set_bit<br />
.text:00541196 add esp, 10h<br />
.text:00541199 inc edi<br />
.text:0054119A dec esi<br />
.text:0054119B cmp esi, 0FFFFFFFFh<br />
.text:0054119E jg short loc_54118B<br />
.text:005411A0 inc ebx<br />
.text:005411A1 cmp ebx, 8<br />
.text:005411A4 jl short loc_541186<br />
.text:005411A6 pop edi<br />
.text:005411A7 pop esi<br />
.text:005411A8 pop ebp<br />
152
.text:005411A9 pop ebx<br />
.text:005411AA add esp, 40h<br />
.text:005411AD retn<br />
.text:005411AD rotate3 endp<br />
.text:005411AD<br />
.text:005411AD ;<br />
---------------------------------------------------------------------------<br />
.text:005411AE align 10h<br />
.text:005411B0<br />
.text:005411B0 ; =============== S U B R O U T I N E<br />
=======================================<br />
.text:005411B0<br />
.text:005411B0<br />
.text:005411B0 rotate_all_with_password proc near ; CODE XREF: crypt+1F<br />
.text:005411B0 ; decrypt+36<br />
.text:005411B0<br />
.text:005411B0 arg_0 = dword ptr 4<br />
.text:005411B0 arg_4 = dword ptr 8<br />
.text:005411B0<br />
.text:005411B0 mov eax, [esp+arg_0]<br />
.text:005411B4 push ebp<br />
.text:005411B5 mov ebp, eax<br />
.text:005411B7 cmp byte ptr [eax], 0<br />
.text:005411BA jz exit<br />
.text:005411C0 push ebx<br />
.text:005411C1 mov ebx, [esp+8+arg_4]<br />
.text:005411C5 push esi<br />
.text:005411C6 push edi<br />
.text:005411C7<br />
.text:005411C7 loop_begin: ; CODE XREF:<br />
rotate_all_with_password+9F<br />
.text:005411C7 movsx eax, byte ptr [ebp+0]<br />
.text:005411CB push eax ; C<br />
.text:005411CC call _<strong>to</strong>lower<br />
.text:005411D1 add esp, 4<br />
.text:005411D4 cmp al, ’a’<br />
.text:005411D6 jl short next_character_in_password<br />
.text:005411D8 cmp al, ’z’<br />
.text:005411DA jg short next_character_in_password<br />
.text:005411DC movsx ecx, al<br />
.text:005411DF sub ecx, ’a’<br />
.text:005411E2 cmp ecx, 24<br />
.text:005411E5 jle short skip_subtracting<br />
.text:005411E7 sub ecx, 24<br />
.text:005411EA<br />
.text:005411EA skip_subtracting: ; CODE XREF:<br />
rotate_all_with_password+35<br />
.text:005411EA mov eax, 55555556h<br />
.text:005411EF imul ecx<br />
.text:005411F1 mov eax, edx<br />
.text:005411F3 shr eax, 1Fh<br />
.text:005411F6 add edx, eax<br />
.text:005411F8 mov eax, ecx<br />
.text:005411FA mov esi, edx<br />
153
.text:005411FC mov ecx, 3<br />
.text:00541201 cdq<br />
.text:00541202 idiv ecx<br />
.text:00541204 sub edx, 0<br />
.text:00541207 jz short call_rotate1<br />
.text:00541209 dec edx<br />
.text:0054120A jz short call_rotate2<br />
.text:0054120C dec edx<br />
.text:0054120D jnz short next_character_in_password<br />
.text:0054120F test ebx, ebx<br />
.text:00541211 jle short next_character_in_password<br />
.text:00541213 mov edi, ebx<br />
.text:00541215<br />
.text:00541215 call_rotate3: ; CODE XREF:<br />
rotate_all_with_password+6F<br />
.text:00541215 push esi<br />
.text:00541216 call rotate3<br />
.text:0054121B add esp, 4<br />
.text:0054121E dec edi<br />
.text:0054121F jnz short call_rotate3<br />
.text:00541221 jmp short next_character_in_password<br />
.text:00541223 ;<br />
---------------------------------------------------------------------------<br />
.text:00541223<br />
.text:00541223 call_rotate2: ; CODE XREF:<br />
rotate_all_with_password+5A<br />
.text:00541223 test ebx, ebx<br />
.text:00541225 jle short next_character_in_password<br />
.text:00541227 mov edi, ebx<br />
.text:00541229<br />
.text:00541229 loc_541229: ; CODE XREF:<br />
rotate_all_with_password+83<br />
.text:00541229 push esi<br />
.text:0054122A call rotate2<br />
.text:0054122F add esp, 4<br />
.text:00541232 dec edi<br />
.text:00541233 jnz short loc_541229<br />
.text:00541235 jmp short next_character_in_password<br />
.text:00541237 ;<br />
---------------------------------------------------------------------------<br />
.text:00541237<br />
.text:00541237 call_rotate1: ; CODE XREF:<br />
rotate_all_with_password+57<br />
.text:00541237 test ebx, ebx<br />
.text:00541239 jle short next_character_in_password<br />
.text:0054123B mov edi, ebx<br />
.text:0054123D<br />
.text:0054123D loc_54123D: ; CODE XREF:<br />
rotate_all_with_password+97<br />
.text:0054123D push esi<br />
.text:0054123E call rotate1<br />
.text:00541243 add esp, 4<br />
.text:00541246 dec edi<br />
.text:00541247 jnz short loc_54123D<br />
154
.text:00541249<br />
.text:00541249 next_character_in_password: ; CODE XREF:<br />
rotate_all_with_password+26<br />
.text:00541249 ; rotate_all_with_password+2A ...<br />
.text:00541249 mov al, [ebp+1]<br />
.text:0054124C inc ebp<br />
.text:0054124D test al, al<br />
.text:0054124F jnz loop_begin<br />
.text:00541255 pop edi<br />
.text:00541256 pop esi<br />
.text:00541257 pop ebx<br />
.text:00541258<br />
.text:00541258 exit: ; CODE XREF:<br />
rotate_all_with_password+A<br />
.text:00541258 pop ebp<br />
.text:00541259 retn<br />
.text:00541259 rotate_all_with_password endp<br />
.text:00541259<br />
.text:00541259 ;<br />
---------------------------------------------------------------------------<br />
.text:0054125A align 10h<br />
.text:00541260<br />
.text:00541260 ; =============== S U B R O U T I N E<br />
=======================================<br />
.text:00541260<br />
.text:00541260<br />
.text:00541260 crypt proc near ; CODE XREF: crypt_file+8A<br />
.text:00541260<br />
.text:00541260 arg_0 = dword ptr 4<br />
.text:00541260 arg_4 = dword ptr 8<br />
.text:00541260 arg_8 = dword ptr 0Ch<br />
.text:00541260<br />
.text:00541260 push ebx<br />
.text:00541261 mov ebx, [esp+4+arg_0]<br />
.text:00541265 push ebp<br />
.text:00541266 push esi<br />
.text:00541267 push edi<br />
.text:00541268 xor ebp, ebp<br />
.text:0054126A<br />
.text:0054126A loc_54126A: ; CODE XREF: crypt+41<br />
.text:0054126A mov eax, [esp+10h+arg_8]<br />
.text:0054126E mov ecx, 10h<br />
.text:00541273 mov esi, ebx<br />
.text:00541275 mov edi, offset cube64<br />
.text:0054127A push 1<br />
.text:0054127C push eax<br />
.text:0054127D rep movsd<br />
.text:0054127F call rotate_all_with_password<br />
.text:00541284 mov eax, [esp+18h+arg_4]<br />
.text:00541288 mov edi, ebx<br />
.text:0054128A add ebp, 40h<br />
.text:0054128D add esp, 8<br />
.text:00541290 mov ecx, 10h<br />
.text:00541295 mov esi, offset cube64<br />
155
.text:0054129A add ebx, 40h<br />
.text:0054129D cmp ebp, eax<br />
.text:0054129F rep movsd<br />
.text:005412A1 jl short loc_54126A<br />
.text:005412A3 pop edi<br />
.text:005412A4 pop esi<br />
.text:005412A5 pop ebp<br />
.text:005412A6 pop ebx<br />
.text:005412A7 retn<br />
.text:005412A7 crypt endp<br />
.text:005412A7<br />
.text:005412A7 ;<br />
---------------------------------------------------------------------------<br />
.text:005412A8 align 10h<br />
.text:005412B0<br />
.text:005412B0 ; =============== S U B R O U T I N E<br />
=======================================<br />
.text:005412B0<br />
.text:005412B0<br />
.text:005412B0 ; int __cdecl decrypt(int, int, void *Src)<br />
.text:005412B0 decrypt proc near ; CODE XREF: decrypt_file+99<br />
.text:005412B0<br />
.text:005412B0 arg_0 = dword ptr 4<br />
.text:005412B0 arg_4 = dword ptr 8<br />
.text:005412B0 Src = dword ptr 0Ch<br />
.text:005412B0<br />
.text:005412B0 mov eax, [esp+Src]<br />
.text:005412B4 push ebx<br />
.text:005412B5 push ebp<br />
.text:005412B6 push esi<br />
.text:005412B7 push edi<br />
.text:005412B8 push eax ; Src<br />
.text:005412B9 call __strdup<br />
.text:005412BE push eax ; Str<br />
.text:005412BF mov [esp+18h+Src], eax<br />
.text:005412C3 call __strrev<br />
.text:005412C8 mov ebx, [esp+18h+arg_0]<br />
.text:005412CC add esp, 8<br />
.text:005412CF xor ebp, ebp<br />
.text:005412D1<br />
.text:005412D1 loc_5412D1: ; CODE XREF: decrypt+58<br />
.text:005412D1 mov ecx, 10h<br />
.text:005412D6 mov esi, ebx<br />
.text:005412D8 mov edi, offset cube64<br />
.text:005412DD push 3<br />
.text:005412DF rep movsd<br />
.text:005412E1 mov ecx, [esp+14h+Src]<br />
.text:005412E5 push ecx<br />
.text:005412E6 call rotate_all_with_password<br />
.text:005412EB mov eax, [esp+18h+arg_4]<br />
.text:005412EF mov edi, ebx<br />
.text:005412F1 add ebp, 40h<br />
.text:005412F4 add esp, 8<br />
.text:005412F7 mov ecx, 10h<br />
156
.text:005412FC mov esi, offset cube64<br />
.text:00541301 add ebx, 40h<br />
.text:00541304 cmp ebp, eax<br />
.text:00541306 rep movsd<br />
.text:00541308 jl short loc_5412D1<br />
.text:0054130A mov edx, [esp+10h+Src]<br />
.text:0054130E push edx ; Memory<br />
.text:0054130F call _free<br />
.text:00541314 add esp, 4<br />
.text:00541317 pop edi<br />
.text:00541318 pop esi<br />
.text:00541319 pop ebp<br />
.text:0054131A pop ebx<br />
.text:0054131B retn<br />
.text:0054131B decrypt endp<br />
.text:0054131B<br />
.text:0054131B ;<br />
---------------------------------------------------------------------------<br />
.text:0054131C align 10h<br />
.text:00541320<br />
.text:00541320 ; =============== S U B R O U T I N E<br />
=======================================<br />
.text:00541320<br />
.text:00541320<br />
.text:00541320 ; int __cdecl crypt_file(int Str, char *Filename, int password)<br />
.text:00541320 crypt_file proc near ; CODE XREF: _main+42<br />
.text:00541320<br />
.text:00541320 Str = dword ptr 4<br />
.text:00541320 Filename = dword ptr 8<br />
.text:00541320 password = dword ptr 0Ch<br />
.text:00541320<br />
.text:00541320 mov eax, [esp+Str]<br />
.text:00541324 push ebp<br />
.text:00541325 push offset Mode ; "rb"<br />
.text:0054132A push eax ; Filename<br />
.text:0054132B call _fopen ; open file<br />
.text:00541330 mov ebp, eax<br />
.text:00541332 add esp, 8<br />
.text:00541335 test ebp, ebp<br />
.text:00541337 jnz short loc_541348<br />
.text:00541339 push offset Format ; "Cannot open input file!\n"<br />
.text:0054133E call _printf<br />
.text:00541343 add esp, 4<br />
.text:00541346 pop ebp<br />
.text:00541347 retn<br />
.text:00541348 ;<br />
---------------------------------------------------------------------------<br />
.text:00541348<br />
.text:00541348 loc_541348: ; CODE XREF: crypt_file+17<br />
.text:00541348 push ebx<br />
.text:00541349 push esi<br />
.text:0054134A push edi<br />
.text:0054134B push 2 ; Origin<br />
.text:0054134D push 0 ; Offset<br />
157
.text:0054134F push ebp ; File<br />
.text:00541350 call _fseek<br />
.text:00541355 push ebp ; File<br />
.text:00541356 call _ftell ; get file size<br />
.text:0054135B push 0 ; Origin<br />
.text:0054135D push 0 ; Offset<br />
.text:0054135F push ebp ; File<br />
.text:00541360 mov [esp+2Ch+Str], eax<br />
.text:00541364 call _fseek ; rewind <strong>to</strong> start<br />
.text:00541369 mov esi, [esp+2Ch+Str]<br />
.text:0054136D and esi, 0FFFFFFC0h ; reset all lowest 6 bits<br />
.text:00541370 add esi, 40h ; align size <strong>to</strong> 64-byte border<br />
.text:00541373 push esi ; Size<br />
.text:00541374 call _malloc<br />
.text:00541379 mov ecx, esi<br />
.text:0054137B mov ebx, eax ; allocated buffer pointer -> <strong>to</strong><br />
EBX<br />
.text:0054137D mov edx, ecx<br />
.text:0054137F xor eax, eax<br />
.text:00541381 mov edi, ebx<br />
.text:00541383 push ebp ; File<br />
.text:00541384 shr ecx, 2<br />
.text:00541387 rep s<strong>to</strong>sd<br />
.text:00541389 mov ecx, edx<br />
.text:0054138B push 1 ; Count<br />
.text:0054138D and ecx, 3<br />
.text:00541390 rep s<strong>to</strong>sb ; memset (buffer, 0, aligned_size)<br />
.text:00541392 mov eax, [esp+38h+Str]<br />
.text:00541396 push eax ; ElementSize<br />
.text:00541397 push ebx ; DstBuf<br />
.text:00541398 call _fread ; read file<br />
.text:0054139D push ebp ; File<br />
.text:0054139E call _fclose<br />
.text:005413A3 mov ecx, [esp+44h+password]<br />
.text:005413A7 push ecx ; password<br />
.text:005413A8 push esi ; aligned size<br />
.text:005413A9 push ebx ; buffer<br />
.text:005413AA call crypt ; do crypt<br />
.text:005413AF mov edx, [esp+50h+Filename]<br />
.text:005413B3 add esp, 40h<br />
.text:005413B6 push offset aWb ; "wb"<br />
.text:005413BB push edx ; Filename<br />
.text:005413BC call _fopen<br />
.text:005413C1 mov edi, eax<br />
.text:005413C3 push edi ; File<br />
.text:005413C4 push 1 ; Count<br />
.text:005413C6 push 3 ; Size<br />
.text:005413C8 push offset aQr9 ; "QR9"<br />
.text:005413CD call _fwrite ; write file signature<br />
.text:005413D2 push edi ; File<br />
.text:005413D3 push 1 ; Count<br />
.text:005413D5 lea eax, [esp+30h+Str]<br />
.text:005413D9 push 4 ; Size<br />
.text:005413DB push eax ; Str<br />
158
.text:005413DC call _fwrite ; write original file size<br />
.text:005413E1 push edi ; File<br />
.text:005413E2 push 1 ; Count<br />
.text:005413E4 push esi ; Size<br />
.text:005413E5 push ebx ; Str<br />
.text:005413E6 call _fwrite ; write crypted file<br />
.text:005413EB push edi ; File<br />
.text:005413EC call _fclose<br />
.text:005413F1 push ebx ; Memory<br />
.text:005413F2 call _free<br />
.text:005413F7 add esp, 40h<br />
.text:005413FA pop edi<br />
.text:005413FB pop esi<br />
.text:005413FC pop ebx<br />
.text:005413FD pop ebp<br />
.text:005413FE retn<br />
.text:005413FE crypt_file endp<br />
.text:005413FE<br />
.text:005413FE ;<br />
---------------------------------------------------------------------------<br />
.text:005413FF align 10h<br />
.text:00541400<br />
.text:00541400 ; =============== S U B R O U T I N E<br />
=======================================<br />
.text:00541400<br />
.text:00541400<br />
.text:00541400 ; int __cdecl decrypt_file(char *Filename, int, void *Src)<br />
.text:00541400 decrypt_file proc near ; CODE XREF: _main+6E<br />
.text:00541400<br />
.text:00541400 Filename = dword ptr 4<br />
.text:00541400 arg_4 = dword ptr 8<br />
.text:00541400 Src = dword ptr 0Ch<br />
.text:00541400<br />
.text:00541400 mov eax, [esp+Filename]<br />
.text:00541404 push ebx<br />
.text:00541405 push ebp<br />
.text:00541406 push esi<br />
.text:00541407 push edi<br />
.text:00541408 push offset aRb ; "rb"<br />
.text:0054140D push eax ; Filename<br />
.text:0054140E call _fopen<br />
.text:00541413 mov esi, eax<br />
.text:00541415 add esp, 8<br />
.text:00541418 test esi, esi<br />
.text:0054141A jnz short loc_54142E<br />
.text:0054141C push offset aCannotOpenIn_0 ; "Cannot open input file!\n<br />
"<br />
.text:00541421 call _printf<br />
.text:00541426 add esp, 4<br />
.text:00541429 pop edi<br />
.text:0054142A pop esi<br />
.text:0054142B pop ebp<br />
.text:0054142C pop ebx<br />
.text:0054142D retn<br />
159
.text:0054142E ;<br />
---------------------------------------------------------------------------<br />
.text:0054142E<br />
.text:0054142E loc_54142E: ; CODE XREF: decrypt_file+1A<br />
.text:0054142E push 2 ; Origin<br />
.text:00541430 push 0 ; Offset<br />
.text:00541432 push esi ; File<br />
.text:00541433 call _fseek<br />
.text:00541438 push esi ; File<br />
.text:00541439 call _ftell<br />
.text:0054143E push 0 ; Origin<br />
.text:00541440 push 0 ; Offset<br />
.text:00541442 push esi ; File<br />
.text:00541443 mov ebp, eax<br />
.text:00541445 call _fseek<br />
.text:0054144A push ebp ; Size<br />
.text:0054144B call _malloc<br />
.text:00541450 push esi ; File<br />
.text:00541451 mov ebx, eax<br />
.text:00541453 push 1 ; Count<br />
.text:00541455 push ebp ; ElementSize<br />
.text:00541456 push ebx ; DstBuf<br />
.text:00541457 call _fread<br />
.text:0054145C push esi ; File<br />
.text:0054145D call _fclose<br />
.text:00541462 add esp, 34h<br />
.text:00541465 mov ecx, 3<br />
.text:0054146A mov edi, offset aQr9_0 ; "QR9"<br />
.text:0054146F mov esi, ebx<br />
.text:00541471 xor edx, edx<br />
.text:00541473 repe cmpsb<br />
.text:00541475 jz short loc_541489<br />
.text:00541477 push offset aFileIsNotCrypt ; "File is not crypted!\n"<br />
.text:0054147C call _printf<br />
.text:00541481 add esp, 4<br />
.text:00541484 pop edi<br />
.text:00541485 pop esi<br />
.text:00541486 pop ebp<br />
.text:00541487 pop ebx<br />
.text:00541488 retn<br />
.text:00541489 ;<br />
---------------------------------------------------------------------------<br />
.text:00541489<br />
.text:00541489 loc_541489: ; CODE XREF: decrypt_file+75<br />
.text:00541489 mov eax, [esp+10h+Src]<br />
.text:0054148D mov edi, [ebx+3]<br />
.text:00541490 add ebp, 0FFFFFFF9h<br />
.text:00541493 lea esi, [ebx+7]<br />
.text:00541496 push eax ; Src<br />
.text:00541497 push ebp ; int<br />
.text:00541498 push esi ; int<br />
.text:00541499 call decrypt<br />
.text:0054149E mov ecx, [esp+1Ch+arg_4]<br />
.text:005414A2 push offset aWb_0 ; "wb"<br />
160
.text:005414A7 push ecx ; Filename<br />
.text:005414A8 call _fopen<br />
.text:005414AD mov ebp, eax<br />
.text:005414AF push ebp ; File<br />
.text:005414B0 push 1 ; Count<br />
.text:005414B2 push edi ; Size<br />
.text:005414B3 push esi ; Str<br />
.text:005414B4 call _fwrite<br />
.text:005414B9 push ebp ; File<br />
.text:005414BA call _fclose<br />
.text:005414BF push ebx ; Memory<br />
.text:005414C0 call _free<br />
.text:005414C5 add esp, 2Ch<br />
.text:005414C8 pop edi<br />
.text:005414C9 pop esi<br />
.text:005414CA pop ebp<br />
.text:005414CB pop ebx<br />
.text:005414CC retn<br />
.text:005414CC decrypt_file endp<br />
All function and label names are given by me while analysis.<br />
I started from <strong>to</strong>p. Here is a function taking two file names and password.<br />
.text:00541320 ; int __cdecl crypt_file(int Str, char *Filename, int password)<br />
.text:00541320 crypt_file proc near<br />
.text:00541320<br />
.text:00541320 Str = dword ptr 4<br />
.text:00541320 Filename = dword ptr 8<br />
.text:00541320 password = dword ptr 0Ch<br />
.text:00541320<br />
Open file and report error in case of error:<br />
.text:00541320 mov eax, [esp+Str]<br />
.text:00541324 push ebp<br />
.text:00541325 push offset Mode ; "rb"<br />
.text:0054132A push eax ; Filename<br />
.text:0054132B call _fopen ; open file<br />
.text:00541330 mov ebp, eax<br />
.text:00541332 add esp, 8<br />
.text:00541335 test ebp, ebp<br />
.text:00541337 jnz short loc_541348<br />
.text:00541339 push offset Format ; "Cannot open input file!\n"<br />
.text:0054133E call _printf<br />
.text:00541343 add esp, 4<br />
.text:00541346 pop ebp<br />
.text:00541347 retn<br />
.text:00541348 ;<br />
---------------------------------------------------------------------------<br />
.text:00541348<br />
.text:00541348 loc_541348:<br />
Get file size via fseek()/ftell():<br />
.text:00541348 push ebx<br />
.text:00541349 push esi<br />
161
.text:0054134A push edi<br />
.text:0054134B push 2 ; Origin<br />
.text:0054134D push 0 ; Offset<br />
.text:0054134F push ebp ; File<br />
; move current file position <strong>to</strong> the end<br />
.text:00541350 call _fseek<br />
.text:00541355 push ebp ; File<br />
.text:00541356 call _ftell ; get current file position<br />
.text:0054135B push 0 ; Origin<br />
.text:0054135D push 0 ; Offset<br />
.text:0054135F push ebp ; File<br />
.text:00541360 mov [esp+2Ch+Str], eax<br />
; move current file position <strong>to</strong> the start<br />
.text:00541364 call _fseek<br />
This piece of code calculates file size aligned <strong>to</strong> 64-byte border. This is because this cryp<strong>to</strong>algorithm<br />
works with only 64-byte blocks. Its operation is pretty simple: divide file size by 64, <strong>for</strong>get about remainder<br />
and add 1, then multiple by 64. The following code removes remainder as if value was already divided by 64<br />
and adds 64. It is almost the same.<br />
.text:00541369 mov esi, [esp+2Ch+Str]<br />
.text:0054136D and esi, 0FFFFFFC0h ; reset all lowest 6 bits<br />
.text:00541370 add esi, 40h ; align size <strong>to</strong> 64-byte border<br />
Allocate buffer with aligned size:<br />
.text:00541373 push esi ; Size<br />
.text:00541374 call _malloc<br />
Call memset(), e,g, clear allocated buffer 2 .<br />
.text:00541379 mov ecx, esi<br />
.text:0054137B mov ebx, eax ; allocated buffer pointer -> <strong>to</strong> EBX<br />
.text:0054137D mov edx, ecx<br />
.text:0054137F xor eax, eax<br />
.text:00541381 mov edi, ebx<br />
.text:00541383 push ebp ; File<br />
.text:00541384 shr ecx, 2<br />
.text:00541387 rep s<strong>to</strong>sd<br />
.text:00541389 mov ecx, edx<br />
.text:0054138B push 1 ; Count<br />
.text:0054138D and ecx, 3<br />
.text:00541390 rep s<strong>to</strong>sb ; memset (buffer, 0, aligned_size)<br />
Read file via standard C function fread().<br />
.text:00541392 mov eax, [esp+38h+Str]<br />
.text:00541396 push eax ; ElementSize<br />
.text:00541397 push ebx ; DstBuf<br />
.text:00541398 call _fread ; read file<br />
.text:0054139D push ebp ; File<br />
.text:0054139E call _fclose<br />
Call crypt(). This function takes buffer, buffer size (aligned) and password string.<br />
2 malloc() + memset() could be replaced by calloc()<br />
162
.text:005413A3 mov ecx, [esp+44h+password]<br />
.text:005413A7 push ecx ; password<br />
.text:005413A8 push esi ; aligned size<br />
.text:005413A9 push ebx ; buffer<br />
.text:005413AA call crypt ; do crypt<br />
Create output file. By the way, developer <strong>for</strong>got <strong>to</strong> check if it is was created correctly! File opening result<br />
is being checked though.<br />
.text:005413AF mov edx, [esp+50h+Filename]<br />
.text:005413B3 add esp, 40h<br />
.text:005413B6 push offset aWb ; "wb"<br />
.text:005413BB push edx ; Filename<br />
.text:005413BC call _fopen<br />
.text:005413C1 mov edi, eax<br />
Newly created file handle is in EDI register now. Write signature “QR9”.<br />
.text:005413C3 push edi ; File<br />
.text:005413C4 push 1 ; Count<br />
.text:005413C6 push 3 ; Size<br />
.text:005413C8 push offset aQr9 ; "QR9"<br />
.text:005413CD call _fwrite ; write file signature<br />
Write actual file size (not aligned):<br />
.text:005413D2 push edi ; File<br />
.text:005413D3 push 1 ; Count<br />
.text:005413D5 lea eax, [esp+30h+Str]<br />
.text:005413D9 push 4 ; Size<br />
.text:005413DB push eax ; Str<br />
.text:005413DC call _fwrite ; write original file size<br />
Write crypted buffer:<br />
.text:005413E1 push edi ; File<br />
.text:005413E2 push 1 ; Count<br />
.text:005413E4 push esi ; Size<br />
.text:005413E5 push ebx ; Str<br />
.text:005413E6 call _fwrite ; write crypted file<br />
Close file and free allocated buffer:<br />
.text:005413EB push edi ; File<br />
.text:005413EC call _fclose<br />
.text:005413F1 push ebx ; Memory<br />
.text:005413F2 call _free<br />
.text:005413F7 add esp, 40h<br />
.text:005413FA pop edi<br />
.text:005413FB pop esi<br />
.text:005413FC pop ebx<br />
.text:005413FD pop ebp<br />
.text:005413FE retn<br />
.text:005413FE crypt_file endp<br />
Here is reconstructed C-code:<br />
163
void crypt_file(char *fin, char* fout, char *pw)<br />
{<br />
FILE *f;<br />
int flen, flen_aligned;<br />
BYTE *buf;<br />
};<br />
f=fopen(fin, "rb");<br />
if (f==NULL)<br />
{<br />
printf ("Cannot open input file!\n");<br />
return;<br />
};<br />
fseek (f, 0, SEEK_END);<br />
flen=ftell (f);<br />
fseek (f, 0, SEEK_SET);<br />
flen_aligned=(flen&0xFFFFFFC0)+0x40;<br />
buf=(BYTE*)malloc (flen_aligned);<br />
memset (buf, 0, flen_aligned);<br />
fread (buf, flen, 1, f);<br />
fclose (f);<br />
crypt (buf, flen_aligned, pw);<br />
f=fopen(fout, "wb");<br />
fwrite ("QR9", 3, 1, f);<br />
fwrite (&flen, 4, 1, f);<br />
fwrite (buf, flen_aligned, 1, f);<br />
fclose (f);<br />
free (buf);<br />
Decrypting procedure is almost the same:<br />
.text:00541400 ; int __cdecl decrypt_file(char *Filename, int, void *Src)<br />
.text:00541400 decrypt_file proc near<br />
.text:00541400<br />
.text:00541400 Filename = dword ptr 4<br />
.text:00541400 arg_4 = dword ptr 8<br />
.text:00541400 Src = dword ptr 0Ch<br />
.text:00541400<br />
.text:00541400 mov eax, [esp+Filename]<br />
.text:00541404 push ebx<br />
.text:00541405 push ebp<br />
.text:00541406 push esi<br />
.text:00541407 push edi<br />
164
.text:00541408 push offset aRb ; "rb"<br />
.text:0054140D push eax ; Filename<br />
.text:0054140E call _fopen<br />
.text:00541413 mov esi, eax<br />
.text:00541415 add esp, 8<br />
.text:00541418 test esi, esi<br />
.text:0054141A jnz short loc_54142E<br />
.text:0054141C push offset aCannotOpenIn_0 ; "Cannot open input file!\n<br />
"<br />
.text:00541421 call _printf<br />
.text:00541426 add esp, 4<br />
.text:00541429 pop edi<br />
.text:0054142A pop esi<br />
.text:0054142B pop ebp<br />
.text:0054142C pop ebx<br />
.text:0054142D retn<br />
.text:0054142E ;<br />
---------------------------------------------------------------------------<br />
.text:0054142E<br />
.text:0054142E loc_54142E:<br />
.text:0054142E push 2 ; Origin<br />
.text:00541430 push 0 ; Offset<br />
.text:00541432 push esi ; File<br />
.text:00541433 call _fseek<br />
.text:00541438 push esi ; File<br />
.text:00541439 call _ftell<br />
.text:0054143E push 0 ; Origin<br />
.text:00541440 push 0 ; Offset<br />
.text:00541442 push esi ; File<br />
.text:00541443 mov ebp, eax<br />
.text:00541445 call _fseek<br />
.text:0054144A push ebp ; Size<br />
.text:0054144B call _malloc<br />
.text:00541450 push esi ; File<br />
.text:00541451 mov ebx, eax<br />
.text:00541453 push 1 ; Count<br />
.text:00541455 push ebp ; ElementSize<br />
.text:00541456 push ebx ; DstBuf<br />
.text:00541457 call _fread<br />
.text:0054145C push esi ; File<br />
.text:0054145D call _fclose<br />
Check signature (first 3 bytes):<br />
.text:00541462 add esp, 34h<br />
.text:00541465 mov ecx, 3<br />
.text:0054146A mov edi, offset aQr9_0 ; "QR9"<br />
.text:0054146F mov esi, ebx<br />
.text:00541471 xor edx, edx<br />
.text:00541473 repe cmpsb<br />
.text:00541475 jz short loc_541489<br />
Report an error if signature is absent:<br />
.text:00541477 push offset aFileIsNotCrypt ; "File is not crypted!\n"<br />
.text:0054147C call _printf<br />
165
.text:00541481 add esp, 4<br />
.text:00541484 pop edi<br />
.text:00541485 pop esi<br />
.text:00541486 pop ebp<br />
.text:00541487 pop ebx<br />
.text:00541488 retn<br />
.text:00541489 ;<br />
---------------------------------------------------------------------------<br />
.text:00541489<br />
.text:00541489 loc_541489:<br />
Call decrypt().<br />
.text:00541489 mov eax, [esp+10h+Src]<br />
.text:0054148D mov edi, [ebx+3]<br />
.text:00541490 add ebp, 0FFFFFFF9h<br />
.text:00541493 lea esi, [ebx+7]<br />
.text:00541496 push eax ; Src<br />
.text:00541497 push ebp ; int<br />
.text:00541498 push esi ; int<br />
.text:00541499 call decrypt<br />
.text:0054149E mov ecx, [esp+1Ch+arg_4]<br />
.text:005414A2 push offset aWb_0 ; "wb"<br />
.text:005414A7 push ecx ; Filename<br />
.text:005414A8 call _fopen<br />
.text:005414AD mov ebp, eax<br />
.text:005414AF push ebp ; File<br />
.text:005414B0 push 1 ; Count<br />
.text:005414B2 push edi ; Size<br />
.text:005414B3 push esi ; Str<br />
.text:005414B4 call _fwrite<br />
.text:005414B9 push ebp ; File<br />
.text:005414BA call _fclose<br />
.text:005414BF push ebx ; Memory<br />
.text:005414C0 call _free<br />
.text:005414C5 add esp, 2Ch<br />
.text:005414C8 pop edi<br />
.text:005414C9 pop esi<br />
.text:005414CA pop ebp<br />
.text:005414CB pop ebx<br />
.text:005414CC retn<br />
.text:005414CC decrypt_file endp<br />
Here is reconstructed C-code:<br />
void decrypt_file(char *fin, char* fout, char *pw)<br />
{<br />
FILE *f;<br />
int real_flen, flen;<br />
BYTE *buf;<br />
f=fopen(fin, "rb");<br />
if (f==NULL)<br />
{<br />
printf ("Cannot open input file!\n");<br />
166
};<br />
};<br />
return;<br />
fseek (f, 0, SEEK_END);<br />
flen=ftell (f);<br />
fseek (f, 0, SEEK_SET);<br />
buf=(BYTE*)malloc (flen);<br />
fread (buf, flen, 1, f);<br />
fclose (f);<br />
if (memcmp (buf, "QR9", 3)!=0)<br />
{<br />
printf ("File is not crypted!\n");<br />
return;<br />
};<br />
memcpy (&real_flen, buf+3, 4);<br />
decrypt (buf+(3+4), flen-(3+4), pw);<br />
f=fopen(fout, "wb");<br />
fwrite (buf+(3+4), real_flen, 1, f);<br />
fclose (f);<br />
free (buf);<br />
OK, now let’s go deeper.<br />
Function crypt():<br />
.text:00541260 crypt proc near<br />
.text:00541260<br />
.text:00541260 arg_0 = dword ptr 4<br />
.text:00541260 arg_4 = dword ptr 8<br />
.text:00541260 arg_8 = dword ptr 0Ch<br />
.text:00541260<br />
.text:00541260 push ebx<br />
.text:00541261 mov ebx, [esp+4+arg_0]<br />
.text:00541265 push ebp<br />
.text:00541266 push esi<br />
.text:00541267 push edi<br />
.text:00541268 xor ebp, ebp<br />
.text:0054126A<br />
.text:0054126A loc_54126A:<br />
This piece of code copies part of input buffer <strong>to</strong> internal array I named later “cube64”. The size is in ECX<br />
register. MOVSD means move 32-bit dword, so, 16 of 32-bit dwords are exactly 64 bytes.<br />
.text:0054126A mov eax, [esp+10h+arg_8]<br />
.text:0054126E mov ecx, 10h<br />
.text:00541273 mov esi, ebx ; EBX is pointer within input buffer<br />
167
.text:00541275 mov edi, offset cube64<br />
.text:0054127A push 1<br />
.text:0054127C push eax<br />
.text:0054127D rep movsd<br />
Call rotate_all_with_password():<br />
.text:0054127F call rotate_all_with_password<br />
Copy crypted contents back from “cube64” <strong>to</strong> buffer:<br />
.text:00541284 mov eax, [esp+18h+arg_4]<br />
.text:00541288 mov edi, ebx<br />
.text:0054128A add ebp, 40h<br />
.text:0054128D add esp, 8<br />
.text:00541290 mov ecx, 10h<br />
.text:00541295 mov esi, offset cube64<br />
.text:0054129A add ebx, 40h ; add 64 <strong>to</strong> input buffer pointer<br />
.text:0054129D cmp ebp, eax ; EBP contain ammount of crypted data.<br />
.text:0054129F rep movsd<br />
If EBP is not bigger that input argument size, then continue <strong>to</strong> next block.<br />
.text:005412A1 jl short loc_54126A<br />
.text:005412A3 pop edi<br />
.text:005412A4 pop esi<br />
.text:005412A5 pop ebp<br />
.text:005412A6 pop ebx<br />
.text:005412A7 retn<br />
.text:005412A7 crypt endp<br />
Reconstructed crypt() function:<br />
void crypt (BYTE *buf, int sz, char *pw)<br />
{<br />
int i=0;<br />
};<br />
do<br />
{<br />
}<br />
while (i
.text:005411B5 mov ebp, eax<br />
Check <strong>for</strong> character in password. If it is zero, exit:<br />
.text:005411B7 cmp byte ptr [eax], 0<br />
.text:005411BA jz exit<br />
.text:005411C0 push ebx<br />
.text:005411C1 mov ebx, [esp+8+arg_4]<br />
.text:005411C5 push esi<br />
.text:005411C6 push edi<br />
.text:005411C7<br />
.text:005411C7 loop_begin:<br />
Call <strong>to</strong>lower(), standard C function.<br />
.text:005411C7 movsx eax, byte ptr [ebp+0]<br />
.text:005411CB push eax ; C<br />
.text:005411CC call _<strong>to</strong>lower<br />
.text:005411D1 add esp, 4<br />
Hmm, if password contains non-alphabetical latin character, it is skipped! Indeed, if we run crypting<br />
utility and try non-alphabetical latin characters in password, they seem <strong>to</strong> be ignored.<br />
.text:005411D4 cmp al, ’a’<br />
.text:005411D6 jl short next_character_in_password<br />
.text:005411D8 cmp al, ’z’<br />
.text:005411DA jg short next_character_in_password<br />
.text:005411DC movsx ecx, al<br />
Subtract “a” value (97) from character.<br />
.text:005411DF sub ecx, ’a’ ; 97<br />
After subtracting, we’ll get 0 <strong>for</strong> “a” here, 1 <strong>for</strong> “b”, etc. And 25 <strong>for</strong> “z”.<br />
.text:005411E2 cmp ecx, 24<br />
.text:005411E5 jle short skip_subtracting<br />
.text:005411E7 sub ecx, 24<br />
It seems, “y” and “z” are exceptional characters <strong>to</strong>o. After that piece of code, “y” becomes 0 and “z” —<br />
1. This means, 26 latin alphabet symbols will become values in range 0..23, (24 in <strong>to</strong>tal).<br />
.text:005411EA<br />
.text:005411EA skip_subtracting: ; CODE XREF:<br />
rotate_all_with_password+35<br />
This is actually division via multiplication. Read more about it in “Division by 9” section 1.11.<br />
The code actually divides password character value by 3.<br />
.text:005411EA mov eax, 55555556h<br />
.text:005411EF imul ecx<br />
.text:005411F1 mov eax, edx<br />
.text:005411F3 shr eax, 1Fh<br />
.text:005411F6 add edx, eax<br />
.text:005411F8 mov eax, ecx<br />
.text:005411FA mov esi, edx<br />
.text:005411FC mov ecx, 3<br />
.text:00541201 cdq<br />
.text:00541202 idiv ecx<br />
169
EDX is the remainder of division.<br />
.text:00541204 sub edx, 0<br />
.text:00541207 jz short call_rotate1 ; if remainder is zero, go <strong>to</strong> rotate1<br />
.text:00541209 dec edx<br />
.text:0054120A jz short call_rotate2 ; .. it it is 1, go <strong>to</strong> rotate2<br />
.text:0054120C dec edx<br />
.text:0054120D jnz short next_character_in_password<br />
.text:0054120F test ebx, ebx<br />
.text:00541211 jle short next_character_in_password<br />
.text:00541213 mov edi, ebx<br />
If remainder is 2, call rotate3(). EDI is a second argument of rotate_all_with_password(). As I<br />
already wrote, 1 is <strong>for</strong> crypting operations and 3 is <strong>for</strong> decrypting. So, here is a loop. When crypting,<br />
rotate1/2/3 will be called the same number of times as given in the first argument.<br />
.text:00541215 call_rotate3:<br />
.text:00541215 push esi<br />
.text:00541216 call rotate3<br />
.text:0054121B add esp, 4<br />
.text:0054121E dec edi<br />
.text:0054121F jnz short call_rotate3<br />
.text:00541221 jmp short next_character_in_password<br />
.text:00541223<br />
.text:00541223 call_rotate2:<br />
.text:00541223 test ebx, ebx<br />
.text:00541225 jle short next_character_in_password<br />
.text:00541227 mov edi, ebx<br />
.text:00541229<br />
.text:00541229 loc_541229:<br />
.text:00541229 push esi<br />
.text:0054122A call rotate2<br />
.text:0054122F add esp, 4<br />
.text:00541232 dec edi<br />
.text:00541233 jnz short loc_541229<br />
.text:00541235 jmp short next_character_in_password<br />
.text:00541237<br />
.text:00541237 call_rotate1:<br />
.text:00541237 test ebx, ebx<br />
.text:00541239 jle short next_character_in_password<br />
.text:0054123B mov edi, ebx<br />
.text:0054123D<br />
.text:0054123D loc_54123D:<br />
.text:0054123D push esi<br />
.text:0054123E call rotate1<br />
.text:00541243 add esp, 4<br />
.text:00541246 dec edi<br />
.text:00541247 jnz short loc_54123D<br />
.text:00541249<br />
Fetch next character from password string.<br />
.text:00541249 next_character_in_password:<br />
.text:00541249 mov al, [ebp+1]<br />
Increment character pointer within password string:<br />
170
.text:0054124C inc ebp<br />
.text:0054124D test al, al<br />
.text:0054124F jnz loop_begin<br />
.text:00541255 pop edi<br />
.text:00541256 pop esi<br />
.text:00541257 pop ebx<br />
.text:00541258<br />
.text:00541258 exit:<br />
.text:00541258 pop ebp<br />
.text:00541259 retn<br />
.text:00541259 rotate_all_with_password endp<br />
Here is reconstructed C code:<br />
void rotate_all (char *pwd, int v)<br />
{<br />
char *p=pwd;<br />
};<br />
while (*p)<br />
{<br />
char c=*p;<br />
int q;<br />
};<br />
c=<strong>to</strong>lower (c);<br />
if (c>=’a’ && c24)<br />
q-=24;<br />
};<br />
p++;<br />
int quotient=q/3;<br />
int remainder=q % 3;<br />
switch (remainder)<br />
{<br />
case 0: <strong>for</strong> (int i=0; i
.text:00541050<br />
.text:00541050 mov eax, [esp+arg_4]<br />
.text:00541054 mov ecx, [esp+arg_0]<br />
.text:00541058 mov al, cube64[eax+ecx*8]<br />
.text:0054105F mov cl, [esp+arg_8]<br />
.text:00541063 shr al, cl<br />
.text:00541065 and al, 1<br />
.text:00541067 retn<br />
.text:00541067 get_bit endp<br />
... in other words: calculate an index in the array cube64: arg_4 + arg_0 * 8. Then shift a byte from<br />
an array by arg_8 bits right. Isolate lowest bit and return it.<br />
Let’s see another function, set_bit():<br />
.text:00541000 set_bit proc near<br />
.text:00541000<br />
.text:00541000 arg_0 = dword ptr 4<br />
.text:00541000 arg_4 = dword ptr 8<br />
.text:00541000 arg_8 = dword ptr 0Ch<br />
.text:00541000 arg_C = byte ptr 10h<br />
.text:00541000<br />
.text:00541000 mov al, [esp+arg_C]<br />
.text:00541004 mov ecx, [esp+arg_8]<br />
.text:00541008 push esi<br />
.text:00541009 mov esi, [esp+4+arg_0]<br />
.text:0054100D test al, al<br />
.text:0054100F mov eax, [esp+4+arg_4]<br />
.text:00541013 mov dl, 1<br />
.text:00541015 jz short loc_54102B<br />
DL is 1 here. Shift left it by arg_8. For example, if arg_8 is 4, DL register value became 0x10 or 1000<br />
in binary <strong>for</strong>m.<br />
.text:00541017 shl dl, cl<br />
.text:00541019 mov cl, cube64[eax+esi*8]<br />
Get bit from array and explicitly set one.<br />
.text:00541020 or cl, dl<br />
S<strong>to</strong>re it back:<br />
.text:00541022 mov cube64[eax+esi*8], cl<br />
.text:00541029 pop esi<br />
.text:0054102A retn<br />
.text:0054102B ;<br />
---------------------------------------------------------------------------<br />
.text:0054102B<br />
.text:0054102B loc_54102B:<br />
.text:0054102B shl dl, cl<br />
If arg_C is not zero...<br />
.text:0054102D mov cl, cube64[eax+esi*8]<br />
... invert DL. For example, if DL state after shift was 0x10 or 1000 in binary <strong>for</strong>m, there will be 0xEF<br />
after NOT instruction or 11101111 in binary <strong>for</strong>m.<br />
.text:00541034 not dl<br />
172
This instruction clears bit, in other words, it saves all bits in CL which are also set in DL except those<br />
in DL which are cleared. This means that if DL is, <strong>for</strong> example, 11101111 in binary <strong>for</strong>m, all bits will be<br />
saved except 5th (counting from lowest bit).<br />
.text:00541036 and cl, dl<br />
S<strong>to</strong>re it back:<br />
.text:00541038 mov cube64[eax+esi*8], cl<br />
.text:0054103F pop esi<br />
.text:00541040 retn<br />
.text:00541040 set_bit endp<br />
It is almost the same as get_bit(), except, if arg_C is zero, the function clears specific bit in array, or<br />
sets it otherwise.<br />
We also know that array size is 64. First two arguments both in set_bit() and get_bit() could be seen<br />
as 2D cooridnates. Then array will be 8*8 matrix.<br />
Here is C representation of what we already know:<br />
#define IS_SET(flag, bit) ((flag) & (bit))<br />
#define SET_BIT(var, bit) ((var) |= (bit))<br />
#define REMOVE_BIT(var, bit) ((var) &= ~(bit))<br />
char cube[8][8];<br />
void set_bit (int x, int y, int shift, int bit)<br />
{<br />
if (bit)<br />
SET_BIT (cube[x][y], 1
EBX is a pointer <strong>to</strong> internal array:<br />
.text:0054107D lea ebx, [esp+50h+internal_array_64]<br />
.text:00541081<br />
Two nested loops are here:<br />
.text:00541081 first_loop1_begin:<br />
.text:00541081 xor esi, esi ; ESI is loop 2 counter<br />
.text:00541083<br />
.text:00541083 first_loop2_begin:<br />
.text:00541083 push ebp ; arg_0<br />
.text:00541084 push esi ; loop 1 counter<br />
.text:00541085 push edi ; loop 2 counter<br />
.text:00541086 call get_bit<br />
.text:0054108B add esp, 0Ch<br />
.text:0054108E mov [ebx+esi], al ; s<strong>to</strong>re <strong>to</strong> internal array<br />
.text:00541091 inc esi ; increment loop 1 counter<br />
.text:00541092 cmp esi, 8<br />
.text:00541095 jl short first_loop2_begin<br />
.text:00541097 inc edi ; increment loop 2 counter<br />
.text:00541098 add ebx, 8 ; increment internal array pointer by 8 at each<br />
loop 1 iteration<br />
.text:0054109B cmp edi, 8<br />
.text:0054109E jl short first_loop1_begin<br />
... we see that both loop counters are in range 0..7. Also, they are used as first and second arguments of<br />
get_bit(). Third argument of get_bit() is the only argument of rotate1(). What get_bit() returns, is<br />
being placed in<strong>to</strong> internal array.<br />
Prepare pointer <strong>to</strong> internal array again:<br />
.text:005410A0 lea ebx, [esp+50h+internal_array_64]<br />
.text:005410A4 mov edi, 7 ; EDI is loop 1 counter, initial state is 7<br />
.text:005410A9<br />
.text:005410A9 second_loop1_begin:<br />
.text:005410A9 xor esi, esi ; ESI is loop 2 counter<br />
.text:005410AB<br />
.text:005410AB second_loop2_begin:<br />
.text:005410AB mov al, [ebx+esi] ; value from internal array<br />
.text:005410AE push eax<br />
.text:005410AF push ebp ; arg_0<br />
.text:005410B0 push edi ; loop 1 counter<br />
.text:005410B1 push esi ; loop 2 counter<br />
.text:005410B2 call set_bit<br />
.text:005410B7 add esp, 10h<br />
.text:005410BA inc esi ; increment loop 2 counter<br />
.text:005410BB cmp esi, 8<br />
.text:005410BE jl short second_loop2_begin<br />
.text:005410C0 dec edi ; decrement loop 2 counter<br />
.text:005410C1 add ebx, 8 ; increment pointer in internal array<br />
.text:005410C4 cmp edi, 0FFFFFFFFh<br />
.text:005410C7 jg short second_loop1_begin<br />
.text:005410C9 pop edi<br />
.text:005410CA pop esi<br />
.text:005410CB pop ebp<br />
.text:005410CC pop ebx<br />
174
.text:005410CD add esp, 40h<br />
.text:005410D0 retn<br />
.text:005410D0 rotate1 endp<br />
... this code is placing contents from internal array <strong>to</strong> cube global array via set_bit() function, but, in<br />
different order! Now loop 1 counter is in range 7 <strong>to</strong> 0, decrementing at each iteration!<br />
C code representation looks like:<br />
void rotate1 (int v)<br />
{<br />
bool tmp[8][8]; // internal array<br />
int i, j;<br />
};<br />
<strong>for</strong> (i=0; i
.text:00541114 mov edi, 7 ; loop 1 counter is initial state 7<br />
.text:00541119<br />
.text:00541119 loc_541119:<br />
.text:00541119 xor esi, esi ; loop 2 counter<br />
.text:0054111B<br />
.text:0054111B loc_54111B:<br />
.text:0054111B mov al, [ebx+esi] ; get byte from internal array<br />
.text:0054111E push eax<br />
.text:0054111F push edi ; loop 1 counter<br />
.text:00541120 push esi ; loop 2 counter<br />
.text:00541121 push ebp ; arg_0<br />
.text:00541122 call set_bit<br />
.text:00541127 add esp, 10h<br />
.text:0054112A inc esi ; increment loop 2 counter<br />
.text:0054112B cmp esi, 8<br />
.text:0054112E jl short loc_54111B<br />
.text:00541130 dec edi ; decrement loop 2 counter<br />
.text:00541131 add ebx, 8<br />
.text:00541134 cmp edi, 0FFFFFFFFh<br />
.text:00541137 jg short loc_541119<br />
.text:00541139 pop edi<br />
.text:0054113A pop esi<br />
.text:0054113B pop ebp<br />
.text:0054113C pop ebx<br />
.text:0054113D add esp, 40h<br />
.text:00541140 retn<br />
.text:00541140 rotate2 endp<br />
It is almost the same, except of different order of arguments of get_bit() and set_bit(). Let’s rewrite<br />
it in C-like code:<br />
void rotate2 (int v)<br />
{<br />
bool tmp[8][8]; // internal array<br />
int i, j;<br />
};<br />
<strong>for</strong> (i=0; i
};<br />
<strong>for</strong> (i=0; i
Now it is much simpler <strong>to</strong> understand: each password character defines side (one of three) and plane (one<br />
of 8). 3*8 = 24, that’s why two last characters of latin alphabet are remapped <strong>to</strong> fit an alphabet of exactly<br />
24 elements.<br />
The algorithm is clearly weak: in case of short passwords, one can see, that in crypted file there are some<br />
original bytes of original file in binary files edi<strong>to</strong>r.<br />
Here is reconstructed whole source code:<br />
#include <br />
#include <br />
#include <br />
#define IS_SET(flag, bit) ((flag) & (bit))<br />
#define SET_BIT(var, bit) ((var) |= (bit))<br />
#define REMOVE_BIT(var, bit) ((var) &= ~(bit))<br />
static BYTE cube[8][8];<br />
void set_bit (int x, int y, int z, bool bit)<br />
{<br />
if (bit)<br />
SET_BIT (cube[x][y], 1
};<br />
tmp[y][z]=get_bit (row, y, z);<br />
<strong>for</strong> (y=0; y
{<br />
};<br />
int i=0;<br />
do<br />
{<br />
}<br />
while (i
};<br />
fread (buf, flen, 1, f);<br />
fclose (f);<br />
crypt (buf, flen_aligned, pw);<br />
f=fopen(fout, "wb");<br />
fwrite ("QR9", 3, 1, f);<br />
fwrite (&flen, 4, 1, f);<br />
fwrite (buf, flen_aligned, 1, f);<br />
fclose (f);<br />
free (buf);<br />
void decrypt_file(char *fin, char* fout, char *pw)<br />
{<br />
FILE *f;<br />
int real_flen, flen;<br />
BYTE *buf;<br />
f=fopen(fin, "rb");<br />
if (f==NULL)<br />
{<br />
printf ("Cannot open input file!\n");<br />
return;<br />
};<br />
fseek (f, 0, SEEK_END);<br />
flen=ftell (f);<br />
fseek (f, 0, SEEK_SET);<br />
buf=(BYTE*)malloc (flen);<br />
fread (buf, flen, 1, f);<br />
fclose (f);<br />
if (memcmp (buf, "QR9", 3)!=0)<br />
{<br />
printf ("File is not crypted!\n");<br />
return;<br />
};<br />
memcpy (&real_flen, buf+3, 4);<br />
decrypt (buf+(3+4), flen-(3+4), pw);<br />
f=fopen(fout, "wb");<br />
181
};<br />
fwrite (buf+(3+4), real_flen, 1, f);<br />
fclose (f);<br />
free (buf);<br />
// run: input output 0/1 password<br />
// 0 <strong>for</strong> encrypt, 1 <strong>for</strong> decrypt<br />
int main(int argc, char *argv[])<br />
{<br />
if (argc!=5)<br />
{<br />
printf ("Incorrect parameters!\n");<br />
return 1;<br />
};<br />
};<br />
if (strcmp (argv[3], "0")==0)<br />
crypt_file (argv[1], argv[2], argv[4]);<br />
else<br />
if (strcmp (argv[3], "1")==0)<br />
decrypt_file (argv[1], argv[2], argv[4]);<br />
else<br />
printf ("Wrong param %s\n", argv[3]);<br />
return 0;<br />
182
7.2 SAP client network traffic compression case<br />
(Tracing connection between TDW_NOCOMPRESS SAPGUI 3 environment variable <strong>to</strong> nagging pop-up<br />
window and actual data compression routine.)<br />
It’s known that network traffic between SAPGUI and SAP is not crypted by default, it’s rather compressed<br />
(read here and here).<br />
It’s also known that by setting environment variable TDW_NOCOMPRESS <strong>to</strong> 1, it’s possible <strong>to</strong> turn<br />
network packets compression off.<br />
But you will see a nagging pop-up windows that cannot be closed:<br />
Figure 7.1: Screenshot<br />
Let’s see, if we can remove that window somehow.<br />
But be<strong>for</strong>e this, let’s see what we already know. First: we know that environment variable TDW_NOCOMPRESS<br />
is checked somewhere inside of SAPGUI client. Second: string like “data compression switched off” must be<br />
present somewhere <strong>to</strong>o. With the help of FAR file manager I found that both of these strings are s<strong>to</strong>red in<br />
the SAPguilib.dll file.<br />
So let’s open SAPguilib.dll in IDA 5 and search <strong>for</strong> “TDW_NOCOMPRESS” string. Yes, it is present<br />
and there is only one reference <strong>to</strong> it.<br />
We see the following piece of code (all file offsets are valid <strong>for</strong> SAPGUI 720 win32, SAPguilib.dll file<br />
version 7200,1,0,9009):<br />
.text:6440D51B lea eax, [ebp+2108h+var_211C]<br />
.text:6440D51E push eax ; int<br />
.text:6440D51F push offset aTdw_nocompress ; "TDW_NOCOMPRESS"<br />
.text:6440D524 mov byte ptr [edi+15h], 0<br />
.text:6440D528 call chk_env<br />
.text:6440D52D pop ecx<br />
.text:6440D52E pop ecx<br />
.text:6440D52F push offset byte_64443AF8<br />
.text:6440D534 lea ecx, [ebp+2108h+var_211C]<br />
; demangled name: int ATL::CStringT::Compare(char const *)const<br />
.text:6440D537 call ds:mfc90_1603<br />
3 SAP GUI client<br />
183
.text:6440D53D test eax, eax<br />
.text:6440D53F jz short loc_6440D55A<br />
.text:6440D541 lea ecx, [ebp+2108h+var_211C]<br />
; demangled name: const char* ATL::CSimpleStringT::opera<strong>to</strong>r PCXSTR<br />
.text:6440D544 call ds:mfc90_910<br />
.text:6440D54A push eax ; Str<br />
.text:6440D54B call ds:a<strong>to</strong>i<br />
.text:6440D551 test eax, eax<br />
.text:6440D553 setnz al<br />
.text:6440D556 pop ecx<br />
.text:6440D557 mov [edi+15h], al<br />
String returned by chk_env() via second argument is then handled by MFC string functions and then<br />
a<strong>to</strong>i() 4 is called. After that, numerical value is s<strong>to</strong>red <strong>to</strong> edi+15h.<br />
Also, take a look on<strong>to</strong> chk_env() function (I gave a name <strong>to</strong> it):<br />
.text:64413F20 ; int __cdecl chk_env(char *VarName, int)<br />
.text:64413F20 chk_env proc near<br />
.text:64413F20<br />
.text:64413F20 DstSize = dword ptr -0Ch<br />
.text:64413F20 var_8 = dword ptr -8<br />
.text:64413F20 DstBuf = dword ptr -4<br />
.text:64413F20 VarName = dword ptr 8<br />
.text:64413F20 arg_4 = dword ptr 0Ch<br />
.text:64413F20<br />
.text:64413F20 push ebp<br />
.text:64413F21 mov ebp, esp<br />
.text:64413F23 sub esp, 0Ch<br />
.text:64413F26 mov [ebp+DstSize], 0<br />
.text:64413F2D mov [ebp+DstBuf], 0<br />
.text:64413F34 push offset unk_6444C88C<br />
.text:64413F39 mov ecx, [ebp+arg_4]<br />
; (demangled name) ATL::CStringT::opera<strong>to</strong>r=(char const *)<br />
.text:64413F3C call ds:mfc90_820<br />
.text:64413F42 mov eax, [ebp+VarName]<br />
.text:64413F45 push eax ; VarName<br />
.text:64413F46 mov ecx, [ebp+DstSize]<br />
.text:64413F49 push ecx ; DstSize<br />
.text:64413F4A mov edx, [ebp+DstBuf]<br />
.text:64413F4D push edx ; DstBuf<br />
.text:64413F4E lea eax, [ebp+DstSize]<br />
.text:64413F51 push eax ; ReturnSize<br />
.text:64413F52 call ds:getenv_s<br />
.text:64413F58 add esp, 10h<br />
.text:64413F5B mov [ebp+var_8], eax<br />
.text:64413F5E cmp [ebp+var_8], 0<br />
.text:64413F62 jz short loc_64413F68<br />
.text:64413F64 xor eax, eax<br />
.text:64413F66 jmp short loc_64413FBC<br />
.text:64413F68 ;<br />
---------------------------------------------------------------------------<br />
4 standard C library function, coverting number in string in<strong>to</strong> number<br />
184
.text:64413F68<br />
.text:64413F68 loc_64413F68:<br />
.text:64413F68 cmp [ebp+DstSize], 0<br />
.text:64413F6C jnz short loc_64413F72<br />
.text:64413F6E xor eax, eax<br />
.text:64413F70 jmp short loc_64413FBC<br />
.text:64413F72 ;<br />
---------------------------------------------------------------------------<br />
.text:64413F72<br />
.text:64413F72 loc_64413F72:<br />
.text:64413F72 mov ecx, [ebp+DstSize]<br />
.text:64413F75 push ecx<br />
.text:64413F76 mov ecx, [ebp+arg_4]<br />
; demangled name: ATL::CSimpleStringT::Preallocate(int)<br />
.text:64413F79 call ds:mfc90_2691<br />
.text:64413F7F mov [ebp+DstBuf], eax<br />
.text:64413F82 mov edx, [ebp+VarName]<br />
.text:64413F85 push edx ; VarName<br />
.text:64413F86 mov eax, [ebp+DstSize]<br />
.text:64413F89 push eax ; DstSize<br />
.text:64413F8A mov ecx, [ebp+DstBuf]<br />
.text:64413F8D push ecx ; DstBuf<br />
.text:64413F8E lea edx, [ebp+DstSize]<br />
.text:64413F91 push edx ; ReturnSize<br />
.text:64413F92 call ds:getenv_s<br />
.text:64413F98 add esp, 10h<br />
.text:64413F9B mov [ebp+var_8], eax<br />
.text:64413F9E push 0FFFFFFFFh<br />
.text:64413FA0 mov ecx, [ebp+arg_4]<br />
; demangled name: ATL::CSimpleStringT::ReleaseBuffer(int)<br />
.text:64413FA3 call ds:mfc90_5835<br />
.text:64413FA9 cmp [ebp+var_8], 0<br />
.text:64413FAD jz short loc_64413FB3<br />
.text:64413FAF xor eax, eax<br />
.text:64413FB1 jmp short loc_64413FBC<br />
.text:64413FB3 ;<br />
---------------------------------------------------------------------------<br />
.text:64413FB3<br />
.text:64413FB3 loc_64413FB3:<br />
.text:64413FB3 mov ecx, [ebp+arg_4]<br />
; demangled name: const char* ATL::CSimpleStringT::opera<strong>to</strong>r PCXSTR<br />
.text:64413FB6 call ds:mfc90_910<br />
.text:64413FBC<br />
.text:64413FBC loc_64413FBC:<br />
.text:64413FBC<br />
.text:64413FBC mov esp, ebp<br />
.text:64413FBE pop ebp<br />
.text:64413FBF retn<br />
.text:64413FBF chk_env endp<br />
185
Yes. getenv_s() 5 function is Microsoft security-enhanced version of getenv() 6 .<br />
There are also some MFC string manipulations.<br />
Lots of other environment variables are checked as well. Here is a list of all variables being checked and<br />
what SAPGUI could write <strong>to</strong> trace log when logging is turned on:<br />
DPTRACE “GUI-OPTION: Trace set <strong>to</strong> %d”<br />
TDW_HEXDUMP “GUI-OPTION: Hexdump enabled”<br />
TDW_WORKDIR “GUI-OPTION: working direc<strong>to</strong>ry ‘%s ´ ’’<br />
TDW_SPLASHSRCEENOFF “GUI-OPTION: Splash Screen Off” / “GUI-OPTION: Splash Screen On”<br />
TDW_REPLYTIMEOUT “GUI-OPTION: reply timeout %d milliseconds”<br />
TDW_PLAYBACKTIMEOUT “GUI-OPTION: PlaybackTimeout set <strong>to</strong> %d milliseconds”<br />
TDW_NOCOMPRESS “GUI-OPTION: no compression read”<br />
TDW_EXPERT “GUI-OPTION: expert mode”<br />
TDW_PLAYBACKPROGRESS “GUI-OPTION: PlaybackProgress”<br />
TDW_PLAYBACKNETTRAFFIC “GUI-OPTION: PlaybackNetTraffic”<br />
TDW_PLAYLOG “GUI-OPTION: /PlayLog is YES, file %s”<br />
TDW_PLAYTIME “GUI-OPTION: /PlayTime set <strong>to</strong> %d milliseconds”<br />
TDW_LOGFILE “GUI-OPTION: TDW_LOGFILE ‘%s ´ ’’<br />
TDW_WAN “GUI-OPTION: WAN - low speed connection enabled”<br />
TDW_FULLMENU “GUI-OPTION: FullMenu enabled”<br />
SAP_CP / SAP_CODEPAGE “GUI-OPTION: SAP_CODEPAGE ‘%d ´ ’’<br />
UPDOWNLOAD_CP “GUI-OPTION: UPDOWNLOAD_CP ‘%d ´ ’’<br />
SNC_PARTNERNAME “GUI-OPTION: SNC name ‘%s ´ ’’<br />
SNC_QOP “GUI-OPTION: SNC_QOP ‘%s ´ ’’<br />
SNC_LIB “GUI-OPTION: SNC is set <strong>to</strong>: %s”<br />
SAPGUI_INPLACE “GUI-OPTION: environment variable SAPGUI_INPLACE is on”<br />
Settings <strong>for</strong> each variable are written <strong>to</strong> the array via pointer in EDI register. EDI is being set be<strong>for</strong>e that<br />
function call:<br />
.text:6440EE00 lea edi, [ebp+2884h+var_2884] ; options here like +0x15<br />
...<br />
.text:6440EE03 lea ecx, [esi+24h]<br />
.text:6440EE06 call load_command_line<br />
.text:6440EE0B mov edi, eax<br />
.text:6440EE0D xor ebx, ebx<br />
.text:6440EE0F cmp edi, ebx<br />
.text:6440EE11 jz short loc_6440EE42<br />
.text:6440EE13 push edi<br />
.text:6440EE14 push offset aSapguiS<strong>to</strong>ppedA ; "Sapgui s<strong>to</strong>pped after<br />
commandline interp"...<br />
.text:6440EE19 push dword_644F93E8<br />
.text:6440EE1F call FEWTraceError<br />
Now, can we find “data record mode switched on” string? Yes, and here is the only reference in function<br />
CDwsGui::PrepareInfoWindow(). How do I know class/method names? There is a lot of special debugging<br />
calls writing <strong>to</strong> log-files like:<br />
.text:64405160 push dword ptr [esi+2854h]<br />
.text:64405166 push offset aCdwsguiPrepare ; "\nCDwsGui::<br />
PrepareInfoWindow: sapgui env"...<br />
.text:6440516B push dword ptr [esi+2848h]<br />
5 http://msdn.microsoft.com/en-us/library/tb2sfw2z(VS.80).aspx<br />
6 Standard C library returning environment variable<br />
186
.text:64405171 call dbg<br />
.text:64405176 add esp, 0Ch<br />
... or:<br />
.text:6440237A push eax<br />
.text:6440237B push offset aCclientStart_6 ; "CClient::Start: set<br />
shortcut user <strong>to</strong> ’\%"...<br />
.text:64402380 push dword ptr [edi+4]<br />
.text:64402383 call dbg<br />
.text:64402388 add esp, 0Ch<br />
It’s very useful.<br />
So let’s see contents of that nagging pop-up window function:<br />
.text:64404F4F CDwsGui__PrepareInfoWindow proc near<br />
.text:64404F4F<br />
.text:64404F4F pvParam = byte ptr -3Ch<br />
.text:64404F4F var_38 = dword ptr -38h<br />
.text:64404F4F var_34 = dword ptr -34h<br />
.text:64404F4F rc = tagRECT ptr -2Ch<br />
.text:64404F4F cy = dword ptr -1Ch<br />
.text:64404F4F h = dword ptr -18h<br />
.text:64404F4F var_14 = dword ptr -14h<br />
.text:64404F4F var_10 = dword ptr -10h<br />
.text:64404F4F var_4 = dword ptr -4<br />
.text:64404F4F<br />
.text:64404F4F push 30h<br />
.text:64404F51 mov eax, offset loc_64438E00<br />
.text:64404F56 call __EH_prolog3<br />
.text:64404F5B mov esi, ecx ; ECX is pointer <strong>to</strong> object<br />
.text:64404F5D xor ebx, ebx<br />
.text:64404F5F lea ecx, [ebp+var_14]<br />
.text:64404F62 mov [ebp+var_10], ebx<br />
; demangled name: ATL::CStringT(void)<br />
.text:64404F65 call ds:mfc90_316<br />
.text:64404F6B mov [ebp+var_4], ebx<br />
.text:64404F6E lea edi, [esi+2854h]<br />
.text:64404F74 push offset aEnvironmentInf ; "Environment in<strong>for</strong>mation:\<br />
n"<br />
.text:64404F79 mov ecx, edi<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r=(char const *)<br />
.text:64404F7B call ds:mfc90_820<br />
.text:64404F81 cmp [esi+38h], ebx<br />
.text:64404F84 mov ebx, ds:mfc90_2539<br />
.text:64404F8A jbe short loc_64404FA9<br />
.text:64404F8C push dword ptr [esi+34h]<br />
.text:64404F8F lea eax, [ebp+var_14]<br />
.text:64404F92 push offset aWorkingDirec<strong>to</strong> ; "working direc<strong>to</strong>ry: ’\%s’\<br />
n"<br />
.text:64404F97 push eax<br />
; demangled name: ATL::CStringT::Format(char const *,...)<br />
.text:64404F98 call ebx ; mfc90_2539<br />
187
.text:64404F9A add esp, 0Ch<br />
.text:64404F9D lea eax, [ebp+var_14]<br />
.text:64404FA0 push eax<br />
.text:64404FA1 mov ecx, edi<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r+=(class ATL::CSimpleStringT const &)<br />
.text:64404FA3 call ds:mfc90_941<br />
.text:64404FA9<br />
.text:64404FA9 loc_64404FA9:<br />
.text:64404FA9 mov eax, [esi+38h]<br />
.text:64404FAC test eax, eax<br />
.text:64404FAE jbe short loc_64404FD3<br />
.text:64404FB0 push eax<br />
.text:64404FB1 lea eax, [ebp+var_14]<br />
.text:64404FB4 push offset aTraceLevelDAct ; "trace level \%d activated<br />
\n"<br />
.text:64404FB9 push eax<br />
; demangled name: ATL::CStringT::Format(char const *,...)<br />
.text:64404FBA call ebx ; mfc90_2539<br />
.text:64404FBC add esp, 0Ch<br />
.text:64404FBF lea eax, [ebp+var_14]<br />
.text:64404FC2 push eax<br />
.text:64404FC3 mov ecx, edi<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r+=(class ATL::CSimpleStringT const &)<br />
.text:64404FC5 call ds:mfc90_941<br />
.text:64404FCB xor ebx, ebx<br />
.text:64404FCD inc ebx<br />
.text:64404FCE mov [ebp+var_10], ebx<br />
.text:64404FD1 jmp short loc_64404FD6<br />
.text:64404FD3 ;<br />
---------------------------------------------------------------------------<br />
.text:64404FD3<br />
.text:64404FD3 loc_64404FD3:<br />
.text:64404FD3 xor ebx, ebx<br />
.text:64404FD5 inc ebx<br />
.text:64404FD6<br />
.text:64404FD6 loc_64404FD6:<br />
.text:64404FD6 cmp [esi+38h], ebx<br />
.text:64404FD9 jbe short loc_64404FF1<br />
.text:64404FDB cmp dword ptr [esi+2978h], 0<br />
.text:64404FE2 jz short loc_64404FF1<br />
.text:64404FE4 push offset aHexdumpInTrace ; "hexdump in trace<br />
activated\n"<br />
.text:64404FE9 mov ecx, edi<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r+=(char const *)<br />
.text:64404FEB call ds:mfc90_945<br />
.text:64404FF1<br />
.text:64404FF1 loc_64404FF1:<br />
.text:64404FF1<br />
.text:64404FF1 cmp byte ptr [esi+78h], 0<br />
.text:64404FF5 jz short loc_64405007<br />
188
.text:64404FF7 push offset aLoggingActivat ; "logging activated\n"<br />
.text:64404FFC mov ecx, edi<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r+=(char const *)<br />
.text:64404FFE call ds:mfc90_945<br />
.text:64405004 mov [ebp+var_10], ebx<br />
.text:64405007<br />
.text:64405007 loc_64405007:<br />
.text:64405007 cmp byte ptr [esi+3Dh], 0<br />
.text:6440500B jz short bypass<br />
.text:6440500D push offset aDataCompressio ; "data compression switched<br />
off\n"<br />
.text:64405012 mov ecx, edi<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r+=(char const *)<br />
.text:64405014 call ds:mfc90_945<br />
.text:6440501A mov [ebp+var_10], ebx<br />
.text:6440501D<br />
.text:6440501D bypass:<br />
.text:6440501D mov eax, [esi+20h]<br />
.text:64405020 test eax, eax<br />
.text:64405022 jz short loc_6440503A<br />
.text:64405024 cmp dword ptr [eax+28h], 0<br />
.text:64405028 jz short loc_6440503A<br />
.text:6440502A push offset aDataRecordMode ; "data record mode switched<br />
on\n"<br />
.text:6440502F mov ecx, edi<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r+=(char const *)<br />
.text:64405031 call ds:mfc90_945<br />
.text:64405037 mov [ebp+var_10], ebx<br />
.text:6440503A<br />
.text:6440503A loc_6440503A:<br />
.text:6440503A<br />
.text:6440503A mov ecx, edi<br />
.text:6440503C cmp [ebp+var_10], ebx<br />
.text:6440503F jnz loc_64405142<br />
.text:64405045 push offset aForMaximumData ; "\nFor maximum data<br />
security delete\nthe s"...<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r+=(char const *)<br />
.text:6440504A call ds:mfc90_945<br />
.text:64405050 xor edi, edi<br />
.text:64405052 push edi ; fWinIni<br />
.text:64405053 lea eax, [ebp+pvParam]<br />
.text:64405056 push eax ; pvParam<br />
.text:64405057 push edi ; uiParam<br />
.text:64405058 push 30h ; uiAction<br />
.text:6440505A call ds:SystemParametersInfoA<br />
.text:64405060 mov eax, [ebp+var_34]<br />
.text:64405063 cmp eax, 1600<br />
.text:64405068 jle short loc_64405072<br />
.text:6440506A cdq<br />
.text:6440506B sub eax, edx<br />
189
.text:6440506D sar eax, 1<br />
.text:6440506F mov [ebp+var_34], eax<br />
.text:64405072<br />
.text:64405072 loc_64405072:<br />
.text:64405072 push edi ; hWnd<br />
.text:64405073 mov [ebp+cy], 0A0h<br />
.text:6440507A call ds:GetDC<br />
.text:64405080 mov [ebp+var_10], eax<br />
.text:64405083 mov ebx, 12Ch<br />
.text:64405088 cmp eax, edi<br />
.text:6440508A jz loc_64405113<br />
.text:64405090 push 11h ; i<br />
.text:64405092 call ds:GetS<strong>to</strong>ckObject<br />
.text:64405098 mov edi, ds:SelectObject<br />
.text:6440509E push eax ; h<br />
.text:6440509F push [ebp+var_10] ; hdc<br />
.text:644050A2 call edi ; SelectObject<br />
.text:644050A4 and [ebp+rc.left], 0<br />
.text:644050A8 and [ebp+rc.<strong>to</strong>p], 0<br />
.text:644050AC mov [ebp+h], eax<br />
.text:644050AF push 401h ; <strong>for</strong>mat<br />
.text:644050B4 lea eax, [ebp+rc]<br />
.text:644050B7 push eax ; lprc<br />
.text:644050B8 lea ecx, [esi+2854h]<br />
.text:644050BE mov [ebp+rc.right], ebx<br />
.text:644050C1 mov [ebp+rc.bot<strong>to</strong>m], 0B4h<br />
; demangled name: ATL::CSimpleStringT::GetLength(void)<br />
.text:644050C8 call ds:mfc90_3178<br />
.text:644050CE push eax ; cchText<br />
.text:644050CF lea ecx, [esi+2854h]<br />
; demangled name: const char* ATL::CSimpleStringT::opera<strong>to</strong>r PCXSTR<br />
.text:644050D5 call ds:mfc90_910<br />
.text:644050DB push eax ; lpchText<br />
.text:644050DC push [ebp+var_10] ; hdc<br />
.text:644050DF call ds:DrawTextA<br />
.text:644050E5 push 4 ; nIndex<br />
.text:644050E7 call ds:GetSystemMetrics<br />
.text:644050ED mov ecx, [ebp+rc.bot<strong>to</strong>m]<br />
.text:644050F0 sub ecx, [ebp+rc.<strong>to</strong>p]<br />
.text:644050F3 cmp [ebp+h], 0<br />
.text:644050F7 lea eax, [eax+ecx+28h]<br />
.text:644050FB mov [ebp+cy], eax<br />
.text:644050FE jz short loc_64405108<br />
.text:64405100 push [ebp+h] ; h<br />
.text:64405103 push [ebp+var_10] ; hdc<br />
.text:64405106 call edi ; SelectObject<br />
.text:64405108<br />
.text:64405108 loc_64405108:<br />
.text:64405108 push [ebp+var_10] ; hDC<br />
.text:6440510B push 0 ; hWnd<br />
.text:6440510D call ds:ReleaseDC<br />
.text:64405113<br />
190
.text:64405113 loc_64405113:<br />
.text:64405113 mov eax, [ebp+var_38]<br />
.text:64405116 push 80h ; uFlags<br />
.text:6440511B push [ebp+cy] ; cy<br />
.text:6440511E inc eax<br />
.text:6440511F push ebx ; cx<br />
.text:64405120 push eax ; Y<br />
.text:64405121 mov eax, [ebp+var_34]<br />
.text:64405124 add eax, 0FFFFFED4h<br />
.text:64405129 cdq<br />
.text:6440512A sub eax, edx<br />
.text:6440512C sar eax, 1<br />
.text:6440512E push eax ; X<br />
.text:6440512F push 0 ; hWndInsertAfter<br />
.text:64405131 push dword ptr [esi+285Ch] ; hWnd<br />
.text:64405137 call ds:SetWindowPos<br />
.text:6440513D xor ebx, ebx<br />
.text:6440513F inc ebx<br />
.text:64405140 jmp short loc_6440514D<br />
.text:64405142 ;<br />
---------------------------------------------------------------------------<br />
.text:64405142<br />
.text:64405142 loc_64405142:<br />
.text:64405142 push offset byte_64443AF8<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r=(char const *)<br />
.text:64405147 call ds:mfc90_820<br />
.text:6440514D<br />
.text:6440514D loc_6440514D:<br />
.text:6440514D cmp dword_6450B970, ebx<br />
.text:64405153 jl short loc_64405188<br />
.text:64405155 call sub_6441C910<br />
.text:6440515A mov dword_644F858C, ebx<br />
.text:64405160 push dword ptr [esi+2854h]<br />
.text:64405166 push offset aCdwsguiPrepare ; "\nCDwsGui::<br />
PrepareInfoWindow: sapgui env"...<br />
.text:6440516B push dword ptr [esi+2848h]<br />
.text:64405171 call dbg<br />
.text:64405176 add esp, 0Ch<br />
.text:64405179 mov dword_644F858C, 2<br />
.text:64405183 call sub_6441C920<br />
.text:64405188<br />
.text:64405188 loc_64405188:<br />
.text:64405188 or [ebp+var_4], 0FFFFFFFFh<br />
.text:6440518C lea ecx, [ebp+var_14]<br />
; demangled name: ATL::CStringT::~CStringT()<br />
.text:6440518F call ds:mfc90_601<br />
.text:64405195 call __EH_epilog3<br />
.text:6440519A retn<br />
.text:6440519A CDwsGui__PrepareInfoWindow endp<br />
ECX at function start gets pointer <strong>to</strong> object (because it is thiscall 2.5.4-type of function). In our case,<br />
that object obviously has class type CDwsGui. Depends of option turned on in the object, specific message<br />
191
part will be concatenated <strong>to</strong> resulting message.<br />
If value at this+0x3D address is not zero, compression is off:<br />
.text:64405007 loc_64405007:<br />
.text:64405007 cmp byte ptr [esi+3Dh], 0<br />
.text:6440500B jz short bypass<br />
.text:6440500D push offset aDataCompressio ; "data compression switched<br />
off\n"<br />
.text:64405012 mov ecx, edi<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r+=(char const *)<br />
.text:64405014 call ds:mfc90_945<br />
.text:6440501A mov [ebp+var_10], ebx<br />
.text:6440501D<br />
.text:6440501D bypass:<br />
It is interesting, that finally, var_10 variable state defines whether the message is <strong>to</strong> be shown at all:<br />
.text:6440503C cmp [ebp+var_10], ebx<br />
.text:6440503F jnz exit ; bypass drawing<br />
; add strings "For maximum data security delete" / "the setting(s) as soon as possible !":<br />
.text:64405045 push offset aForMaximumData ; "\nFor maximum data<br />
security delete\nthe s"...<br />
.text:6440504A call ds:mfc90_945 ; ATL::CStringT::opera<strong>to</strong>r+=(char const<br />
*)<br />
.text:64405050 xor edi, edi<br />
.text:64405052 push edi ; fWinIni<br />
.text:64405053 lea eax, [ebp+pvParam]<br />
.text:64405056 push eax ; pvParam<br />
.text:64405057 push edi ; uiParam<br />
.text:64405058 push 30h ; uiAction<br />
.text:6440505A call ds:SystemParametersInfoA<br />
.text:64405060 mov eax, [ebp+var_34]<br />
.text:64405063 cmp eax, 1600<br />
.text:64405068 jle short loc_64405072<br />
.text:6440506A cdq<br />
.text:6440506B sub eax, edx<br />
.text:6440506D sar eax, 1<br />
.text:6440506F mov [ebp+var_34], eax<br />
.text:64405072<br />
.text:64405072 loc_64405072:<br />
start drawing:<br />
.text:64405072 push edi ; hWnd<br />
.text:64405073 mov [ebp+cy], 0A0h<br />
.text:6440507A call ds:GetDC<br />
Let’s check our theory on practice.<br />
JNZ at this line ...<br />
.text:6440503F jnz exit ; bypass drawing<br />
... replace it with just JMP, and get SAPGUI working without that nagging pop-up window appearing!<br />
192
Now let’s dig deeper and find connection between 0x15 offset in load_command_line() (I gave that name<br />
<strong>to</strong> that function) and this+0x3D variable in CDwsGui::PrepareInfoWindow. Are we sure that the value is<br />
the same?<br />
I’m starting <strong>to</strong> search <strong>for</strong> all occurences of 0x15 value in code. For some small programs like SAPGUI, it<br />
sometimes works. Here is the first occurence I got:<br />
.text:64404C19 sub_64404C19 proc near<br />
.text:64404C19<br />
.text:64404C19 arg_0 = dword ptr 4<br />
.text:64404C19<br />
.text:64404C19 push ebx<br />
.text:64404C1A push ebp<br />
.text:64404C1B push esi<br />
.text:64404C1C push edi<br />
.text:64404C1D mov edi, [esp+10h+arg_0]<br />
.text:64404C21 mov eax, [edi]<br />
.text:64404C23 mov esi, ecx ; ESI/ECX are pointers <strong>to</strong> some unknown<br />
object.<br />
.text:64404C25 mov [esi], eax<br />
.text:64404C27 mov eax, [edi+4]<br />
.text:64404C2A mov [esi+4], eax<br />
.text:64404C2D mov eax, [edi+8]<br />
.text:64404C30 mov [esi+8], eax<br />
.text:64404C33 lea eax, [edi+0Ch]<br />
.text:64404C36 push eax<br />
.text:64404C37 lea ecx, [esi+0Ch]<br />
; demangled name: ATL::CStringT::opera<strong>to</strong>r=(class ATL::CStringT ... &)<br />
.text:64404C3A call ds:mfc90_817<br />
.text:64404C40 mov eax, [edi+10h]<br />
.text:64404C43 mov [esi+10h], eax<br />
.text:64404C46 mov al, [edi+14h]<br />
.text:64404C49 mov [esi+14h], al<br />
.text:64404C4C mov al, [edi+15h] ; copy byte from 0x15 offset<br />
.text:64404C4F mov [esi+15h], al ; <strong>to</strong> 0x15 offset in CDwsGui object<br />
That function was called from the function named CDwsGui::CopyOptions! And thanks again <strong>for</strong> debugging<br />
in<strong>for</strong>mation.<br />
But the real answer in the function CDwsGui::Init():<br />
.text:6440B0BF loc_6440B0BF:<br />
.text:6440B0BF mov eax, [ebp+arg_0]<br />
.text:6440B0C2 push [ebp+arg_4]<br />
.text:6440B0C5 mov [esi+2844h], eax<br />
.text:6440B0CB lea eax, [esi+28h] ; ESI is pointer <strong>to</strong> CDwsGui object<br />
.text:6440B0CE push eax<br />
.text:6440B0CF call CDwsGui__CopyOptions<br />
Finally, we understand: array filled in load_command_line() is actually placed in CDwsGui class but<br />
on this+0x28 address. 0x15 + 0x28 is exactly 0x3D. OK, we found the place where the value is copied <strong>to</strong>.<br />
Let’s also find other places where 0x3D offset is used. Here is one of them in CDwsGui::SapguiRun<br />
function (again, thanks <strong>to</strong> debugging calls):<br />
.text:64409D58 cmp [esi+3Dh], bl ; ESI is pointer <strong>to</strong> CDwsGui object<br />
.text:64409D5B lea ecx, [esi+2B8h]<br />
.text:64409D61 setz al<br />
193
.text:64409D64 push eax ; arg_10 of CConnectionContext::<br />
CreateNetwork<br />
.text:64409D65 push dword ptr [esi+64h]<br />
; demangled name: const char* ATL::CSimpleStringT::opera<strong>to</strong>r PCXSTR<br />
.text:64409D68 call ds:mfc90_910<br />
.text:64409D68 ; no arguments<br />
.text:64409D6E push eax<br />
.text:64409D6F lea ecx, [esi+2BCh]<br />
; demangled name: const char* ATL::CSimpleStringT::opera<strong>to</strong>r PCXSTR<br />
.text:64409D75 call ds:mfc90_910<br />
.text:64409D75 ; no arguments<br />
.text:64409D7B push eax<br />
.text:64409D7C push esi<br />
.text:64409D7D lea ecx, [esi+8]<br />
.text:64409D80 call CConnectionContext__CreateNetwork<br />
Let’s check our findings. Replace setz al here <strong>to</strong> xor eax, eax / nop, clear TDW_NOCOMPRESS<br />
environment variable and run SAPGUI. Wow! There is no nagging window (just as expected, because variable<br />
is not set), but in Wireshark we can see that the network packets are not compressed anymore! Obviously,<br />
this is the place where compression flag is <strong>to</strong> be set in CConnectionContext object.<br />
So, compression flag is passed in the 5th argument of function CConnectionContext::CreateNetwork.<br />
Inside that function, another one is called:<br />
...<br />
.text:64403476 push [ebp+compression]<br />
.text:64403479 push [ebp+arg_C]<br />
.text:6440347C push [ebp+arg_8]<br />
.text:6440347F push [ebp+arg_4]<br />
.text:64403482 push [ebp+arg_0]<br />
.text:64403485 call CNetwork__CNetwork<br />
Compression flag is passing here in the 5th argument <strong>to</strong> CNetwork::CNetwork construc<strong>to</strong>r.<br />
And here is how CNetwork construc<strong>to</strong>r sets some flag in CNetwork object according <strong>to</strong> the 5th argument<br />
and some another variable which probably could affect network packets compression <strong>to</strong>o.<br />
.text:64411DF1 cmp [ebp+compression], esi<br />
.text:64411DF7 jz short set_EAX_<strong>to</strong>_0<br />
.text:64411DF9 mov al, [ebx+78h] ; another value may affect<br />
compression?<br />
.text:64411DFC cmp al, ’3’<br />
.text:64411DFE jz short set_EAX_<strong>to</strong>_1<br />
.text:64411E00 cmp al, ’4’<br />
.text:64411E02 jnz short set_EAX_<strong>to</strong>_0<br />
.text:64411E04<br />
.text:64411E04 set_EAX_<strong>to</strong>_1:<br />
.text:64411E04 xor eax, eax<br />
.text:64411E06 inc eax ; EAX -> 1<br />
.text:64411E07 jmp short loc_64411E0B<br />
.text:64411E09 ;<br />
---------------------------------------------------------------------------<br />
.text:64411E09<br />
.text:64411E09 set_EAX_<strong>to</strong>_0:<br />
.text:64411E09<br />
.text:64411E09 xor eax, eax ; EAX -> 0<br />
194
.text:64411E0B<br />
.text:64411E0B loc_64411E0B:<br />
.text:64411E0B mov [ebx+3A4h], eax ; EBX is pointer <strong>to</strong> CNetwork object<br />
At this point we know that compression flag is s<strong>to</strong>red in CNetwork class at this+0x3A4 address.<br />
Now let’s dig across SAPguilib.dll <strong>for</strong> 0x3A4 value. And here is the second occurence in CDwsGui::OnClientMessage<br />
(endless thanks <strong>for</strong> debugging in<strong>for</strong>mation):<br />
.text:64406F76 loc_64406F76:<br />
.text:64406F76 mov ecx, [ebp+7728h+var_7794]<br />
.text:64406F79 cmp dword ptr [ecx+3A4h], 1<br />
.text:64406F80 jnz compression_flag_is_zero<br />
.text:64406F86 mov byte ptr [ebx+7], 1<br />
.text:64406F8A mov eax, [esi+18h]<br />
.text:64406F8D mov ecx, eax<br />
.text:64406F8F test eax, eax<br />
.text:64406F91 ja short loc_64406FFF<br />
.text:64406F93 mov ecx, [esi+14h]<br />
.text:64406F96 mov eax, [esi+20h]<br />
.text:64406F99<br />
.text:64406F99 loc_64406F99:<br />
.text:64406F99 push dword ptr [edi+2868h] ; int<br />
.text:64406F9F lea edx, [ebp+7728h+var_77A4]<br />
.text:64406FA2 push edx ; int<br />
.text:64406FA3 push 30000 ; int<br />
.text:64406FA8 lea edx, [ebp+7728h+Dst]<br />
.text:64406FAB push edx ; Dst<br />
.text:64406FAC push ecx ; int<br />
.text:64406FAD push eax ; Src<br />
.text:64406FAE push dword ptr [edi+28C0h] ; int<br />
.text:64406FB4 call sub_644055C5 ; actual compression routine<br />
.text:64406FB9 add esp, 1Ch<br />
.text:64406FBC cmp eax, 0FFFFFFF6h<br />
.text:64406FBF jz short loc_64407004<br />
.text:64406FC1 cmp eax, 1<br />
.text:64406FC4 jz loc_6440708C<br />
.text:64406FCA cmp eax, 2<br />
.text:64406FCD jz short loc_64407004<br />
.text:64406FCF push eax<br />
.text:64406FD0 push offset aCompressionErr ; "compression error [rc =<br />
\%d]- program wi"...<br />
.text:64406FD5 push offset aGui_err_compre ; "GUI_ERR_COMPRESS"<br />
.text:64406FDA push dword ptr [edi+28D0h]<br />
.text:64406FE0 call SapPcTxtRead<br />
Let’s take a look in<strong>to</strong> sub_644055C5. In it we can only see call <strong>to</strong> memcpy() and some other function<br />
named (by IDA 5) sub_64417440.<br />
And, let’s take a look inside sub_64417440. What we see is:<br />
.text:6441747C push offset aErrorCsrcompre ; "\nERROR: CsRCompress:<br />
invalid handle"<br />
.text:64417481 call eax ; dword_644F94C8<br />
.text:64417483 add esp, 4<br />
Voilà! We’ve found the function which actually compresses data. As I revealed in past, this function is<br />
used in SAP and also open-source MaxDB project. So it is available in sources.<br />
195
Doing last check here:<br />
.text:64406F79 cmp dword ptr [ecx+3A4h], 1<br />
.text:64406F80 jnz compression_flag_is_zero<br />
Replace JNZ here <strong>for</strong> unconditional JMP. Remove environment variable TDW_NOCOMPRESS. Voilà! In<br />
Wireshark we see that client messages are not compressed. Server responses, however, are compressed.<br />
So we found exact connection between environment variable and the point where data compression routine<br />
may be called or may be bypassed.<br />
196
Chapter 8<br />
Other things<br />
8.1 Compiler’s anomalies<br />
Intel C++ 10.1, which was used <strong>for</strong> Oracle RDBMS 11.2 Linux86 compilation, may emit two JZ in row, and<br />
there are no references <strong>to</strong> the second JZ. Second JZ is thus senseless.<br />
For example, kdli.o from libserver11.a:<br />
.text:08114CF1 loc_8114CF1: ; CODE XREF:<br />
__PGOSF539_kdlimemSer+89A<br />
.text:08114CF1 ;<br />
__PGOSF539_kdlimemSer+3994<br />
.text:08114CF1 8B 45 08 mov eax, [ebp+arg_0]<br />
.text:08114CF4 0F B6 50 14 movzx edx, byte ptr [eax+14h]<br />
.text:08114CF8 F6 C2 01 test dl, 1<br />
.text:08114CFB 0F 85 17 08 00 00 jnz loc_8115518<br />
.text:08114D01 85 C9 test ecx, ecx<br />
.text:08114D03 0F 84 8A 00 00 00 jz loc_8114D93<br />
.text:08114D09 0F 84 09 08 00 00 jz loc_8115518<br />
.text:08114D0F 8B 53 08 mov edx, [ebx+8]<br />
.text:08114D12 89 55 FC mov [ebp+var_4], edx<br />
.text:08114D15 31 C0 xor eax, eax<br />
.text:08114D17 89 45 F4 mov [ebp+var_C], eax<br />
.text:08114D1A 50 push eax<br />
.text:08114D1B 52 push edx<br />
.text:08114D1C E8 03 54 00 00 call len2nbytes<br />
.text:08114D21 83 C4 08 add esp, 8<br />
From the same code:<br />
.text:0811A2A5 loc_811A2A5: ; CODE XREF:<br />
kdliSerLengths+11C<br />
.text:0811A2A5 ; kdliSerLengths<br />
+1C1<br />
.text:0811A2A5 8B 7D 08 mov edi, [ebp+arg_0]<br />
.text:0811A2A8 8B 7F 10 mov edi, [edi+10h]<br />
.text:0811A2AB 0F B6 57 14 movzx edx, byte ptr [edi+14h]<br />
.text:0811A2AF F6 C2 01 test dl, 1<br />
.text:0811A2B2 75 3E jnz short loc_811A2F2<br />
.text:0811A2B4 83 E0 01 and eax, 1<br />
.text:0811A2B7 74 1F jz short loc_811A2D8<br />
.text:0811A2B9 74 37 jz short loc_811A2F2<br />
.text:0811A2BB 6A 00 push 0<br />
.text:0811A2BD FF 71 08 push dword ptr [ecx+8]<br />
197
.text:0811A2C0 E8 5F FE FF FF call len2nbytes<br />
It’s probably code genera<strong>to</strong>r bug wasn’t found by tests, because, resulting code is working correctly<br />
anyway.<br />
198
Chapter 9<br />
Tasks solutions<br />
9.1 Easy level<br />
9.1.1 Task 1.1<br />
Solution: <strong>to</strong>upper().<br />
C source code:<br />
char <strong>to</strong>upper ( char c )<br />
{<br />
if( c >= ’a’ && c
}<br />
9.1.3 Task 1.3<br />
Solution: srand() / rand().<br />
C source code:<br />
static unsigned int v;<br />
void srand (unsigned int s)<br />
{<br />
v = s;<br />
}<br />
int rand ()<br />
{<br />
return( ((v = v * 214013L<br />
+ 2531011L) >> 16) & 0x7fff );<br />
}<br />
9.1.4 Task 1.4<br />
Solution: strstr().<br />
C source code:<br />
char * strstr (<br />
const char * str1,<br />
const char * str2<br />
)<br />
{<br />
char *cp = (char *) str1;<br />
char *s1, *s2;<br />
}<br />
if ( !*str2 )<br />
return((char *)str1);<br />
while (*cp)<br />
{<br />
s1 = cp;<br />
s2 = (char *) str2;<br />
}<br />
while ( *s1 && *s2 && !(*s1-*s2) )<br />
s1++, s2++;<br />
if (!*s2)<br />
return(cp);<br />
cp++;<br />
return(NULL);<br />
200
9.1.5 Task 1.5<br />
Hint #1: Keep in mind that __v — global variable.<br />
Hint #2: That function is called in startup code, be<strong>for</strong>e main() execution.<br />
Solution: early Pentium CPU FDIV bug checking 1 .<br />
C source code:<br />
unsigned _v; // _v<br />
enum e {<br />
PROB_P5_DIV = 0x0001<br />
};<br />
void f( void ) // __verify_pentium_fdiv_bug<br />
{<br />
/*<br />
Verify we have got the Pentium FDIV problem.<br />
The volatiles are <strong>to</strong> scare the optimizer away.<br />
*/<br />
volatile double v1 = 4195835;<br />
volatile double v2 = 3145727;<br />
}<br />
if( (v1 - (v1/v2)*v2) > 1.0e-8 ) {<br />
_v |= PROB_P5_DIV;<br />
}<br />
9.1.6 Task 1.6<br />
Hint: it might be helpful <strong>to</strong> google a constant used here.<br />
Solution: TEA encryption algorithm 2 .<br />
C source code (taken from http://en.wikipedia.org/wiki/Tiny_Encryption_Algorithm):<br />
void f (unsigned int* v, unsigned int* k) {<br />
unsigned int v0=v[0], v1=v[1], sum=0, i; /* set up */<br />
unsigned int delta=0x9e3779b9; /* a key schedule constant */<br />
unsigned int k0=k[0], k1=k[1], k2=k[2], k3=k[3]; /* cache key */<br />
<strong>for</strong> (i=0; i < 32; i++) { /* basic cycle start */<br />
sum += delta;<br />
v0 += ((v15) + k1);<br />
v1 += ((v05) + k3);<br />
} /* end cycle */<br />
v[0]=v0; v[1]=v1;<br />
}<br />
9.1.7 Task 1.7<br />
Hint: the table contain pre-calculated values. It’s possible <strong>to</strong> implement the function without it, but it will<br />
work slower, though.<br />
Solution: this function <strong>reverse</strong> all bits in input 32-bit integer. It’s lib/bitrev.c from Linux kernel.<br />
C source code:<br />
1 http://en.wikipedia.org/wiki/Pentium_FDIV_bug<br />
2 Tiny Encryption Algorithm<br />
201
const unsigned char byte_rev_table[256] = {<br />
0x00, 0x80, 0x40, 0xc0, 0x20, 0xa0, 0x60, 0xe0,<br />
0x10, 0x90, 0x50, 0xd0, 0x30, 0xb0, 0x70, 0xf0,<br />
0x08, 0x88, 0x48, 0xc8, 0x28, 0xa8, 0x68, 0xe8,<br />
0x18, 0x98, 0x58, 0xd8, 0x38, 0xb8, 0x78, 0xf8,<br />
0x04, 0x84, 0x44, 0xc4, 0x24, 0xa4, 0x64, 0xe4,<br />
0x14, 0x94, 0x54, 0xd4, 0x34, 0xb4, 0x74, 0xf4,<br />
0x0c, 0x8c, 0x4c, 0xcc, 0x2c, 0xac, 0x6c, 0xec,<br />
0x1c, 0x9c, 0x5c, 0xdc, 0x3c, 0xbc, 0x7c, 0xfc,<br />
0x02, 0x82, 0x42, 0xc2, 0x22, 0xa2, 0x62, 0xe2,<br />
0x12, 0x92, 0x52, 0xd2, 0x32, 0xb2, 0x72, 0xf2,<br />
0x0a, 0x8a, 0x4a, 0xca, 0x2a, 0xaa, 0x6a, 0xea,<br />
0x1a, 0x9a, 0x5a, 0xda, 0x3a, 0xba, 0x7a, 0xfa,<br />
0x06, 0x86, 0x46, 0xc6, 0x26, 0xa6, 0x66, 0xe6,<br />
0x16, 0x96, 0x56, 0xd6, 0x36, 0xb6, 0x76, 0xf6,<br />
0x0e, 0x8e, 0x4e, 0xce, 0x2e, 0xae, 0x6e, 0xee,<br />
0x1e, 0x9e, 0x5e, 0xde, 0x3e, 0xbe, 0x7e, 0xfe,<br />
0x01, 0x81, 0x41, 0xc1, 0x21, 0xa1, 0x61, 0xe1,<br />
0x11, 0x91, 0x51, 0xd1, 0x31, 0xb1, 0x71, 0xf1,<br />
0x09, 0x89, 0x49, 0xc9, 0x29, 0xa9, 0x69, 0xe9,<br />
0x19, 0x99, 0x59, 0xd9, 0x39, 0xb9, 0x79, 0xf9,<br />
0x05, 0x85, 0x45, 0xc5, 0x25, 0xa5, 0x65, 0xe5,<br />
0x15, 0x95, 0x55, 0xd5, 0x35, 0xb5, 0x75, 0xf5,<br />
0x0d, 0x8d, 0x4d, 0xcd, 0x2d, 0xad, 0x6d, 0xed,<br />
0x1d, 0x9d, 0x5d, 0xdd, 0x3d, 0xbd, 0x7d, 0xfd,<br />
0x03, 0x83, 0x43, 0xc3, 0x23, 0xa3, 0x63, 0xe3,<br />
0x13, 0x93, 0x53, 0xd3, 0x33, 0xb3, 0x73, 0xf3,<br />
0x0b, 0x8b, 0x4b, 0xcb, 0x2b, 0xab, 0x6b, 0xeb,<br />
0x1b, 0x9b, 0x5b, 0xdb, 0x3b, 0xbb, 0x7b, 0xfb,<br />
0x07, 0x87, 0x47, 0xc7, 0x27, 0xa7, 0x67, 0xe7,<br />
0x17, 0x97, 0x57, 0xd7, 0x37, 0xb7, 0x77, 0xf7,<br />
0x0f, 0x8f, 0x4f, 0xcf, 0x2f, 0xaf, 0x6f, 0xef,<br />
0x1f, 0x9f, 0x5f, 0xdf, 0x3f, 0xbf, 0x7f, 0xff,<br />
};<br />
unsigned char bitrev8(unsigned char byte)<br />
{<br />
return byte_rev_table[byte];<br />
}<br />
unsigned short bitrev16(unsigned short x)<br />
{<br />
return (bitrev8(x & 0xff) > 8);<br />
}<br />
/**<br />
* bitrev32 - <strong>reverse</strong> the order of bits in a unsigned int value<br />
* @x: value <strong>to</strong> be bit-<strong>reverse</strong>d<br />
*/<br />
unsigned int bitrev32(unsigned int x)<br />
{<br />
return (bitrev16(x & 0xffff) > 16);<br />
202
}<br />
9.1.8 Task 1.8<br />
Solution: two 100*200 matrices of double type addition.<br />
C/C++ source code:<br />
#define M 100<br />
#define N 200<br />
void s(double *a, double *b, double *c)<br />
{<br />
<strong>for</strong>(int i=0;i
version 2.1 of the License, or (at your option) any later version.<br />
The GNU C Library is distributed in the hope that it will be useful,<br />
but WITHOUT ANY WARRANTY; without even the implied warranty of<br />
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU<br />
Lesser General Public License <strong>for</strong> more details.<br />
You should have received a copy of the GNU Lesser General Public<br />
License along with the GNU C Library; if not, write <strong>to</strong> the Free<br />
Software Foundation, Inc., 59 Temple Place, Suite 330, Bos<strong>to</strong>n, MA<br />
02111-1307 USA. */<br />
/* If you consider tuning this algorithm, you should consult first:<br />
Engineering a sort function; Jon Bentley and M. Douglas McIlroy;<br />
Software - Practice and Experience; Vol. 23 (11), 1249-1265, 1993. */<br />
#include <br />
#include <br />
#include <br />
#include <br />
typedef int (*__compar_d_fn_t) (__const void *, __const void *, void *);<br />
/* Byte-wise swap two items of size SIZE. */<br />
#define SWAP(a, b, size) \<br />
do \<br />
{ \<br />
register size_t __size = (size); \<br />
register char *__a = (a), *__b = (b); \<br />
do \<br />
{ \<br />
char __tmp = *__a; \<br />
*__a++ = *__b; \<br />
*__b++ = __tmp; \<br />
} while (--__size > 0); \<br />
} while (0)<br />
/* Discontinue quicksort algorithm when partition gets below this size.<br />
This particular magic number was chosen <strong>to</strong> work best on a Sun 4/260. */<br />
#define MAX_THRESH 4<br />
/* Stack node declarations used <strong>to</strong> s<strong>to</strong>re unfulfilled partition obligations. */<br />
typedef struct<br />
{<br />
char *lo;<br />
char *hi;<br />
} stack_node;<br />
/* The next 4 #defines implement a very fast in-line stack abstraction. */<br />
/* The stack needs log (<strong>to</strong>tal_elements) entries (we could even subtract<br />
log(MAX_THRESH)). Since <strong>to</strong>tal_elements has type size_t, we get as<br />
upper bound <strong>for</strong> log (<strong>to</strong>tal_elements):<br />
bits per byte (CHAR_BIT) * sizeof(size_t). */<br />
#define STACK_SIZE (CHAR_BIT * sizeof(size_t))<br />
204
#define PUSH(low, high) ((void) ((<strong>to</strong>p->lo = (low)), (<strong>to</strong>p->hi = (high)), ++<strong>to</strong>p))<br />
#define POP(low, high) ((void) (--<strong>to</strong>p, (low = <strong>to</strong>p->lo), (high = <strong>to</strong>p->hi)))<br />
#define STACK_NOT_EMPTY (stack < <strong>to</strong>p)<br />
/* Order size using quicksort. This implementation incorporates<br />
four optimizations discussed in Sedgewick:<br />
1. Non-recursive, using an explicit stack of pointer that s<strong>to</strong>re the<br />
next array partition <strong>to</strong> sort. To save time, this maximum amount<br />
of space required <strong>to</strong> s<strong>to</strong>re an array of SIZE_MAX is allocated on the<br />
stack. Assuming a 32-bit (64 bit) integer <strong>for</strong> size_t, this needs<br />
only 32 * sizeof(stack_node) == 256 bytes (<strong>for</strong> 64 bit: 1024 bytes).<br />
Pretty cheap, actually.<br />
2. Chose the pivot element using a median-of-three decision tree.<br />
This reduces the probability of selecting a bad pivot value and<br />
eliminates certain extraneous comparisons.<br />
3. Only quicksorts TOTAL_ELEMS / MAX_THRESH partitions, leaving<br />
insertion sort <strong>to</strong> order the MAX_THRESH items within each partition.<br />
This is a big win, since insertion sort is faster <strong>for</strong> small, mostly<br />
sorted array segments.<br />
4. The larger of the two sub-partitions is always pushed on<strong>to</strong> the<br />
stack first, with the algorithm then concentrating on the<br />
smaller partition. This *guarantees* no more than log (<strong>to</strong>tal_elems)<br />
stack size is needed (actually O(1) in this case)! */<br />
void<br />
_quicksort (void *const pbase, size_t <strong>to</strong>tal_elems, size_t size,<br />
__compar_d_fn_t cmp, void *arg)<br />
{<br />
register char *base_ptr = (char *) pbase;<br />
const size_t max_thresh = MAX_THRESH * size;<br />
if (<strong>to</strong>tal_elems == 0)<br />
/* Avoid lossage with unsigned arithmetic below. */<br />
return;<br />
if (<strong>to</strong>tal_elems > MAX_THRESH)<br />
{<br />
char *lo = base_ptr;<br />
char *hi = &lo[size * (<strong>to</strong>tal_elems - 1)];<br />
stack_node stack[STACK_SIZE];<br />
stack_node *<strong>to</strong>p = stack;<br />
PUSH (NULL, NULL);<br />
while (STACK_NOT_EMPTY)<br />
{<br />
char *left_ptr;<br />
char *right_ptr;<br />
205
jump_over:;<br />
/* Select median value from among LO, MID, and HI. Rearrange<br />
LO and HI so the three values are sorted. This lowers the<br />
probability of picking a pathological pivot value and<br />
skips a comparison <strong>for</strong> both the LEFT_PTR and RIGHT_PTR in<br />
the while loops. */<br />
char *mid = lo + size * ((hi - lo) / size >> 1);<br />
if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)<br />
SWAP (mid, lo, size);<br />
if ((*cmp) ((void *) hi, (void *) mid, arg) < 0)<br />
SWAP (mid, hi, size);<br />
else<br />
go<strong>to</strong> jump_over;<br />
if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)<br />
SWAP (mid, lo, size);<br />
left_ptr = lo + size;<br />
right_ptr = hi - size;<br />
/* Here’s the famous ‘‘collapse the walls’’ section of quicksort.<br />
Gotta like those tight inner loops! They are the main reason<br />
that this algorithm runs much faster than others. */<br />
do<br />
{<br />
while ((*cmp) ((void *) left_ptr, (void *) mid, arg) < 0)<br />
left_ptr += size;<br />
while ((*cmp) ((void *) mid, (void *) right_ptr, arg) < 0)<br />
right_ptr -= size;<br />
if (left_ptr < right_ptr)<br />
{<br />
SWAP (left_ptr, right_ptr, size);<br />
if (mid == left_ptr)<br />
mid = right_ptr;<br />
else if (mid == right_ptr)<br />
mid = left_ptr;<br />
left_ptr += size;<br />
right_ptr -= size;<br />
}<br />
else if (left_ptr == right_ptr)<br />
{<br />
left_ptr += size;<br />
right_ptr -= size;<br />
break;<br />
}<br />
}<br />
while (left_ptr
}<br />
}<br />
ignore one or both. Otherwise, push the larger partition’s<br />
bounds on the stack and continue sorting the smaller one. */<br />
if ((size_t) (right_ptr - lo)
}<br />
}<br />
/* Insertion sort, running from left-hand-side up <strong>to</strong> right-hand-side. */<br />
run_ptr = base_ptr + size;<br />
while ((run_ptr += size) = run_ptr)<br />
{<br />
char c = *trav;<br />
char *hi, *lo;<br />
}<br />
<strong>for</strong> (hi = lo = trav; (lo -= size) >= tmp_ptr; hi = lo)<br />
*hi = *lo;<br />
*hi = c;<br />
208