Systems Programming C programs in (address) space and (run
Transcription
Systems Programming C programs in (address) space and (run
C programs in (address) space and (run-)time Systems Programming Where is my data and why do I have to know? 02. C Programs in Space and Time I Database and Information Systems Group Department of Computer & Information Science University of Konstanz C is closely related to the machine. Before talking about pointers, storage allocation etc. some background knowledge about address space, (virtual) memory and its allocation during program execution comes in handy I Knowledge about the memory layout of a program is quite helpful when debugging Summer Term 2008 I Knowledge about what is happening inside the machine on program execution is fundamental, to both, debugging programs and, in first place, writing clean code Alexander Holupirek 2 1 C, assembler, and machine code ausführbarer Binärcode (hexadezimal dargestellt) Repetition Computer Architecture Storage Classes C-Quellcode From Source Code To Executable Code int a, b; a = b * b; Construction of an Executable Intel iA32-Assembler-Quellcode mov imul mov 0x403030,%eax 0x403030,%eax %eax,0x403020 Maschinenbefehle bzw. Prozessorinstruktionen Relocation Process Adresse 3 4012ee 4012ef 4012f0 4012f1 4012f2 4012f3 4012f4 4012f5 4012f6 4012f7 4012f8 4012f9 4012fa 4012fb 4012fc 4012fd 4012fe a1 30 30 40 00 0f af 05 30 30 40 00 a3 20 30 40 00 Inhalt (je 1 Byte) 4 C, assembler, and machine code C-Quellcode Address Space Ausführbarer Binärcode Speicheradresse int a=4, b; int main(void) { Assembler-Quellcode if (a>5) 8048344: 804834b: 83 3d 94 94 04 08 05 7e 0c cmpl jle $0x5,0x8049494 8048359 c7 05 8c 95 04 08 01 00 00 00 movl $0x1,0x804958c b=1; 804834d: 8048354: 8048357: eb 0a jmp 8048363 8048359: 8048360: c7 05 8c 95 04 08 00 00 00 00 movl $0x0,0x804958c 8048363: c9 ... else b=0; } Speicherinhalte 0 Startadresse des Datenblocks 0x10000000 Letzte Byteadresse des Datenblocks 16 Byte Datenblock 0x1000000f 0x10000010 Adresse des ersten Byte nach dem Datenblock Größe des Datenblocks Adressen einzelner Byte 0x50000000 0x50000001 Höchstmögliche Adresse (»Speicherende«) Zahlenwerte in Binär- und Assemblercode sind alle hexadezimal zu verstehen a liegt auf Adresse 0x8049494 b liegt auf Adresse 0x804958c Speicheradressen Tiefstmögliche Adresse (»Speicherbeginn«) Speicherinhalt (=Maschinenbefehl) 0x56 0xfc max. 5 Byte Ordering 6 Alignment Rules Goal: Optimal Performance 0 Adr. Adressraum Daten (4 Byte): MSB d3 Big-Endian-System n max. Adr. n n+1 n+2 n+3 Inhalt d3 d2 d1 d0 MSB LSB LSB d2 d1 d0 I Determine the address locations for variables and instructions I Great impact on compiler, assembler, linker tools Little-Endian-System Adr. n n+1 n+2 n+3 d0 d1 d2 d3 Datenbus Adressraum Inhalt Adressen (hexadezimal) LSB MSB DatenLangwort (misaligned) Mit der Adresse n wird auf die 4 Byte großen Daten im Programm zugegriffen MSB = Most Significant Byte (höchstwertiges Byte) LSB = Least Significant Byte (niedrigstwertiges Byte) 0x35 0x36 0x37 0x38 Adressoffsets (Byteadressen) +0 0x34 +1 0x35 +2 0x36 +3 0x37 0x38 0x39 0x3a 0x3b 1. Zugriff 2. Zugriff Langwortgrenzen auf dem Bus Langwortgrenzen (ohne Rest durch 4 teilbar) im Adressraum 7 8 Alignment Rules (cont.) Repetition Computer Architecture For derived types16 (constructed from the basic types) alignment rules apply to each single component: alignment(1) Storage Classes alignment(4) struct artikel {char name[5]; int anzahl; double preis;}; From Source Code To Executable Code Construction of an Executable Alignment rules may be influenced through compiler directives Relocation Process (-malign-int aligns variables on 32-bit boundaries producing code that runs somewhat faster on processors with 32-bit busses at the expense of memory) 16 arrays, functions, pointers, structures, unions (we will discuss them later) 10 9 Storage Classes Automatic Storage Class Automatic Objects Placement of data in memory depends on storage class An object, such as a variable, is a location in storage, and its interpretation depends on two main attributes: its storage class and its type I auto and register give the declared objects automatic storage class, and may be used only within functions I They are local to a block17 , discarded on exit from the block I The storage class determines the lifetime of the storage associated with the identified object I Declarations within a block create automatic objects if no storage class specification is mentioned or auto is used I The types determines the meaning of the values found in the identified object. I I In C we have two storage classes: automatic and static Initialization of automatic objects is performed each time the block is entered at the top (if a jump into the block is executed the initializations are not performed) I Storage class specifiers (auto, extern, register, static) together with the context of an object’s declaration, specify its storage class I Objects declared register are automatic, and are (if possible) stored in fast registers of the machine I For register the address operator ’&’ is not allowed I 17 11 aka “compound statement”, such as the body of a function 12 Static Storage Class Storage Class and Sections Static Objects Intermediate Summary I May be local to a block or external to all blocks I In both cases, they retain their values across exit from and reentry to functions and blocks I Within a block, static objects are declared with static I Objects declared outside of all blocks (at the same level as function definitions) are always static I A program executed does not only use storage for its instructions, but additionally needs space for, e.g., variables I Variables may be temporary, dynamically allocated, or static (i.e., permanent in terms of storage allocation), initialized or uninitialized, declared as constant (const) and thus read-only I Placement of data in memory depends on its storage class I On the outer level, the keyword static makes them local to a particular translation unit (internal linkage) I During the translation process the compiler uses sections to divide the address space into logical units I They are global to an entire program by omitting an explicit storage class, or by using extern (external linkage) I Details vary with operating systems and compiler used 13 Typical Program Organisation 14 Program Sections A typical program divides naturally in sections Adressraum Code machine instructions, should be unmodifiable, size is known after compilation, does not change (.text) Data I static data I I I I .text initialized (.data) /uninitialized (.bbs) constant address in memory permanent life time dynamic data I I I PROM oder RAM schreibgeschützt .data RAM .bss RAM PROM: Programmable Read Only Memory (im Betrieb nicht beschreibbarer Speicherbaustein) RAM: Random Access Memory (Speicher mit wahlfreiem Zugriff) stack or heap storage space not known volatile life time 15 16 Virtual Memory and Segments A Program in Memory Virtual Memory I I Whenever a process is created, the kernel provides a chunk of physical memory which can be located anywhere 0 static data Through the magic of virtual memory (VM), the process believes it has all the memory on the computer Code, Konstanten aus ausführbarer Datei geladen initialisierte Daten bei Prozessstart bereitgestellt und mit 0 initialisiert (gelöscht) nicht initialisierte Daten dynamic data Heap I Text Segment (.text) I Initialized Data Segment (.data) I Uninitialized Data Segment (.bss) I The Stack I The Heap Adressen Typically the VM space is laid out in a similar manner: Stack bei Prozessstart bereitgestellt, für dynamische Speicherallozierung, wächst dem Stapel entgegen bei Prozessstart bereitgestellt, wächst zu tieferen Adressen (bzw. zu höheren Adr.; prozessorabhängig) 17 Different Memory Layouts (A) Lösung auf PC (iA32) Memory Segments Code, Konstanten Stack Programmstartadresse Text Segment The text segment contains the actual code (including constants) to be executed. It’s usually sharable, so multiple instances of a program can share the text segment to lower memory requirements. This segment is usually marked read-only so a program can’t modify its own instructions. (B) Stack umgekehrt wachsend 0 0 initialisierte Daten nicht initialisierte Daten Code, Konstanten initialisierte Daten Stack Initialized Data Segment This segment contains global variables which are initialized by the programmer. Heap Uninitialized Data Segment Also named .bss (block started by symbol) which was an operator used by an old assembler. This segment contains uninitialized global variables. All variables in this segment are initialized to 0 or NULL pointers before the program begins to execute. Adressen Adressen nicht initialisierte Daten Heap 18 19 20 Memory Segments (cont.) Variable Placement and Life Time (Code) int a ; static int b ; void func ( void ) { char c ; static int d ; } The Stack The stack is a collection of stack frames which we will discuss later. When a new frame needs to be added (as a result of a newly called function), the stack grows downward. The Heap Dynamic memory, where storage can be (de-)allocated via C’s free(3)/malloc(3). The C library also gets dynamic memory for its own personal workspace from the heap as well. As more memory is requested “on the fly”, the heap grows upward. int main ( void ) { int e ; int * pi = ( int *) malloc ( sizeof ( int )); func (); func (); free ( pi ); return (0); } 22 21 Variable Placement and Life Time (Code) Variable Placement and Life Time (Diagram) int a ; /* Permanent life time */ static int b ; /* dito , but reduced scope */ Adresse 0 void func ( void ) { char c ; /* only for the life time of func () */ /* but 2 x ; visible only in func () */ static int d ; /* i ’m unique , exist once at a stable */ /* address , visible only in func () */ } 1. Instruktion 2. Instruktion 3. Instruktion 4. Instruktion ... a b d int PC(t=0) PC(t=x) int main ( void ) { int e ; /* life time of main () */ pi SP(t=x) int * pi = ( int *) malloc ( sizeof ( int )); /* newborn */ func (); func (); free ( pi ); /* RIP , pi points to an invalid address */ return (0); c pi e SP(t=0) max. Code Daten Halde (Heap) Stapel (Stack) t=0: Programmausführung wird gestartet, d.h., Ausführungsumgebung ist bereits initialisiert t=x: beliebiger Zeitpunkt während der Programmausführung } 23 24 Variable Placement Repetition Computer Architecture Variables (outside a function) Globally declared variables go to the Uninitialized Data Segment if they are not initialized, to Initialized Data Segment otherwise. Necessary for the OS to decide if storage has to be loaded with initialization data from the executable binary. Storage Classes From Source Code To Executable Code Variables (inside a function) Implicit assumption of auto, go to The Stack. Declared as static, see above. Construction of an Executable Constants (const) Text Segment Function Parameters Are pushed on The Stack or stored in registers. If pointers are passed, data is elsewhere. Relocation Process 26 25 From source code to executable code Translation steps using gcc(1) Translation Steps (multi-phase compilation) Compilation HLL source code to assembler source code Quellcode C/C++ Objektdatei, Bibliotheksdatei Assembler-Quellcode Assembly Assembler source code to object code Eingabedateien Linking Object code to executable code Compilers and assemblers create object files containing the generated binary code and data for a source file. Linkers combine multiple object files into one, loaders take object files and load them into memory. *.c/*.cc/*.cpp *.s Präprozessor Ausgabedateien Goal: An executable binary file (a.out) Vorverarbeiteter C/C++-Quellcode From high-level language (HLL) source code to executable code, i.e., concrete processor instructions in combination with data. 27 Compiler *.i/*.ii Assembler-Quellcode *.o/*.a Assembler *.s Binder *.o Objektdatei (ungebunden) a.out Ausführbare Datei (= Objektdatei, ladbar) 28 File suffixes and their meaning Creation of an executable file (Filename).c For any given input file, the file name suffix determines what kind of compilation is done (see gcc(1)) for more details and suffixes: Kompilieren gcc suffix .c .i .h .s .o compilation step C source code which must be preprocessed C source code which should not be preprocessed Header file to be turned into a precompiled header Assembler code An object file to be fed straight into linking (Filename).s = Operation = Kommando = Eingang oder Ausgang Assemblieren gas (Filename).o Object/Library Files ld Binden a.out 29 The C Preprocessor 30 File Inclusion A control line of the form # include filename The C preprocessor performs . . . I Inclusion of named files I Macro Substitution I Conditional Compilation causes the replacement of that line by the entire contents of the file filename. Note The characters in the name filename must not include > or \n, and the effect is undefined if it contains any of ", ’, \ , or /*. Location The named file is searched for in a sequence of implementationdependent places (often starting in /usr/include). 31 32 Macro Substitution Macro Substitution (cont.) A control line of the form A control line of the form # define identifier token - sequence # define identifier ( identifier - list ) token - sequence causes the preprocessor to replace subsequent instances of the identifier with the given sequence of tokens. where there is no space between the first identifier and the ’(’, is a macro definition with parameters given by the identifier list. Example Example # define # define # define # define # define # define EXIT_FAILURE 1 EXIT_SUCCESS 0 S_IRWXU 0000700 S_IRUSR 0000400 S_IWUSR 0000200 S_IXUSR 0000100 /* /* /* /* # define # define # define # define # define RWX mask for owner */ R for owner */ W for owner */ X for owner */ S_ISDIR ( m ) S_ISCHR ( m ) S_ISBLK ( m ) S_ISREG ( m ) S_ISFIFO ( m ) (( m (( m (( m (( m (( m & & & & & 0170000) 0170000) 0170000) 0170000) 0170000) == == == == == 0040000) 0020000) 0060000) 0100000) 0010000) 33 Macro Substitution (cont.) /* /* /* /* /* directory */ char sp . */ block sp . */ regular */ fifo */ 34 Conditional Inclusion A control line of the form # undef identifier Parts of a program may be compiled conditionally causes the identifier’s preprocessor definition to be forgotten. It is not erroneous to apply #undef to an unknown identifier. Example Example # ifndef # ifdef # define # else # define # endif # endif /* * Some header files may define an abs macro . * If defined , undef it to prevent a syntax error * and issue a warning . * # warning is a pragma ( implementation - dependent action ) */ # ifdef abs # undef abs # warning abs macro collides with abs () prototype , undefining # endif 35 NULL __GNUG__ NULL __null NULL 0L 36 Predefined Names Compilation Several identifiers are predefined, and expand to produce special information. They, and also the preprocessor expression operator defined, may not be undefined or redefined. evtl. temporäre Dateien Text A decimal constant containing the current source line number A string literal containing the name of the file being compiled A string literal containing the data of compilation ’Mmm dd yyyy’ A string literal containing the data of compilation ’hh:mm:ss’ The constant 1. It is intended that this identifier be defined to be 1 only in standard-conforming implementations LINE FILE DATE TIME STDC HLL-Quellcode Text Kompilation Compiler Assembler-Quellcode Text Übersetzungsliste mit Fehlermeldungen 37 Assembly 38 Linking evtl. temporäre Dateien evtl. temporäre Dateien Objektformat Maschinencode und Zusatzinfo. Objektformat Text Assemblierung AssemblerQuellcode Assembler Maschinencode und Zusatzinformationen Binärcode od. Objektformat Binden Objektformat Maschinencode und Zusatzinfo. Binder (Linker) Text Bibliotheksobjektformat Maschinencode und Zusatzinfo. Übersetzungsliste mit Fehlermeldungen und Symboltabelle 39 Absoluter Code oder relozierbarer Code mit Zusatzinfo. library search Text Link Map (Adressraumbenutzung), Symbolliste 40 Program Section In Virtual Memory Repetition Computer Architecture Nach Bindung Nach Kompilation Adressraum Sektion .text (Code): Storage Classes 0 0 0x08048244 xx From Source Code To Executable Code Sektion .data (init. Daten) 0x08049370 0 Construction of an Executable yy Jede Sektion beginnt bei Adr. 0, Sektionen sind »logische. Adressräume« des Compilers Relocation Process 0xffffffff Alle Sektionen sind im Adressraum »absolut« platziert 41 Linking an Executable Binary OBJ1 .text1 OBJ2 .text2 OBJ3 .text3 .data1 Relocation Records .bss1 .text: Code .data: initialisierte Variablen .bss: nicht initialisierte Variablen .bss2 .data3 .bss3 Eingabedaten: ungebundene Objektdateien Bindung (linking) .text1 OBJtotal .text2 .text3 .data1 .data3 .bss1 .bss2 .bss3 Verarbeitungsresultat: ausführbare Datei (gebunden, reloziert) I I I I Once sections are placed subsequently, relocation can start I Executable code contains embedded addresses I Static data, function calls, jump targets I On relocation those have to be changed inside the code I Without a relocation table this is not possible I A relocation record holds the relative address of a symbol (name of a variable, a function etc.) RELOCATION RECORDS FOR [. text ]: OFFSET TYPE VALUE 0000001 a R_386_32 b 00000023 R_386_32 a 00000029 R_386_32 b Each object code (compiled seperately) starts at address 0 Linking them together involves I 42 centralization of sections relocation of adresses 43 44 Source File: compile.c int a = 1; int b ; Analysis of Object Files (compile.o) $ file compile . o ELF 32 - bit LSB relocatable , Intel 80386 , version 1 , not stripped /* Global variable , initialized -> . data */ /* Global variable , uninitialized -> . bss */ int main ( void ) { static int c ; $ objdump -x compile . o compile . o : file format elf32 - i386 compile . o architecture : i386 , flags 0 x00000011 : HAS_RELOC , HAS_SYMS start address 0 x00000000 /* Local , static variable -> . bss */ b = 5; c = b + a + 16; return c ; Sections : Idx Name 0 . text } I Compile a relocatable object file 1 . data cc -c compile.c (creates compile.o) I 2 . bss Linking an executable binary (one-step compilation) 3 . rodata cc compile.c -o compile Size 0000005 a CONTENTS , 00000004 CONTENTS , 00000004 ALLOC 00000005 CONTENTS , VMA LMA 00000000 00000000 ALLOC , LOAD , RELOC , 00000000 00000000 ALLOC , LOAD , DATA 00000000 00000000 File off 00000034 READONLY , 00000090 Algn 2**2 CODE 2**2 00000094 2**2 00000000 00000000 00000094 ALLOC , LOAD , READONLY , DATA 2**0 45 Object File: compile.o (cont.) SYMBOL TABLE : 00000000 l 00000000 l 00000000 l 00000000 l 00000000 l 00000000 l 00000000 g 00000000 g 00000004 df d d d O d O F O * ABS * . text . data . bss . bss . rodata . data . text * COM * 00000000 00000000 00000000 00000000 00000004 00000000 00000004 0000005 a 00000004 compile . c c .0 a main b RELOCATION RECORDS FOR [. text ]: OFFSET TYPE VALUE 0000001 a R_386_32 b 00000023 R_386_32 a 00000029 R_386_32 b 00000031 R_386_32 . bss 00000036 R_386_32 . bss 0000004 c R_386_32 . rodata 47 46 compile . o : file format elf32 - i386 Disassembly of section . text : 00000000 < main >: 0: 55 push 1: 89 e5 mov 3: 83 ec 18 sub 6: 83 e4 f0 and 9: b8 00 00 00 00 mov e: 29 c4 sub 10: a1 00 00 00 00 mov 15: 89 45 e8 mov 18: c7 05 00 00 00 00 05 movl 1f: 00 00 00 22: a1 00 00 00 00 mov 27: 03 05 00 00 00 00 add 2d: 83 c0 10 add 30: a3 00 00 00 00 mov 35: a1 00 00 00 00 mov 3a: 8 b 55 e8 mov 3d: 3 b 15 00 00 00 00 cmp 43: 74 13 je 45: 83 ec 08 sub 48: ff 75 e8 pushl 4b: 68 00 00 00 00 push 50: e8 fc ff ff ff call 55: 83 c4 10 add 58: c9 leave 59: c3 ret % ebp % esp ,% ebp $0x18 ,% esp $0xfffffff0 ,% esp $0x0 ,% eax % eax ,% esp 0 x0 ,% eax % eax ,0 xffffffe8 (% ebp ) $0x5 ,0 x0 0 x0 ,% eax 0 x0 ,% eax $0x10 ,% eax % eax ,0 x0 0 x0 ,% eax 0 xffffffe8 (% ebp ) ,% edx 0 x0 ,% edx 58 < main +0 x58 > $0x8 ,% esp 0 xffffffe8 (% ebp ) $0x0 51 < main +0 x51 > $0x10 ,% esp 48 compile . o : file format elf32 - i386 Disassembly of section . text : 00000000 < main >: int b ; /* Global variable , uninitialized -> . bss Executable Binary File: compile compile : file format elf32 - i386 compile architecture : i386 , flags 0 x00000112 : EXEC_P , HAS_SYMS , D_PAGED start address 0 x1c000408 */ int main ( void ) { 0: 55 push % ebp ... 6 more lines ... 15: 89 45 e8 mov % eax ,0 xffffffe8 (% ebp ) static int c ; /* Local , static variable -> . bss */ 18: 1f: 22: 27: 2d: 30: 35: b = 5; c7 05 00 00 00 00 00 c = b + a + a1 00 00 00 03 05 00 00 83 c0 10 a3 00 00 00 return c ; a1 00 00 00 Sections : Idx Name ... 9 . text movl $0x5 ,0 x0 ... 12 . data 0 x0 ,% eax 0 x0 ,% eax $0x10 ,% eax % eax ,0 x0 ... 20 . bss 00 mov add add mov 00 mov 0 x0 ,% eax 00 00 05 16; 00 00 00 SYMBOL TABLE : 3 c003140 l 3 c003280 g 1 c0005c0 g 3 c001018 g } ... 10 more lines ... Size O O F O File off Algn 00000214 1 c000408 1 c000408 00000408 CONTENTS , ALLOC , LOAD , READONLY , CODE 2**2 00000014 3 c001008 3 c001008 CONTENTS , ALLOC , LOAD , DATA 00001008 2**2 00000184 ALLOC 00001100 2**5 . bss . bss . text . data VMA LMA 3 c003100 00000004 00000004 0000005 a 00000004 c .0 b main a 49 1 c0005c0 < main >: int b ; /* Global variable , uninitialized -> . bss int main ( void ) { 1 c0005c0 : 55 1 c0005c1 : 89 1 c0005c3 : 83 1 c0005c6 : 83 1 c0005c9 : b8 1 c0005ce : 29 1 c0005d0 : a1 1 c0005d5 : 89 static int 3 c003100 50 */ Repetition Computer Architecture e5 ec e4 00 c4 00 45 c; push % ebp mov % esp ,% ebp 18 sub $0x18 ,% esp f0 and $0xfffffff0 ,% esp 00 00 00 mov $0x0 ,% eax sub % eax ,% esp 31 00 3 c mov 0 x3c003100 ,% eax e8 mov % eax ,0 xffffffe8 (% ebp ) /* Local , static variable -> . bss */ b = 5; 1 c0005d8 : c7 05 80 1 c0005df : 00 00 00 c = b + a + 16; 1 c0005e2 : a1 18 10 1 c0005e7 : 03 05 80 1 c0005ed : 83 c0 10 1 c0005f0 : a3 40 31 return c ; 1 c0005f5 : a1 40 31 } Storage Classes From Source Code To Executable Code Construction of an Executable 32 00 3 c 05 movl $0x5 ,0 x3c003280 00 3 c 32 00 3 c 00 3 c mov add add mov 0 x3c001018 ,% eax 0 x3c003280 ,% eax $0x10 ,% eax % eax ,0 x3c003140 00 3 c mov 0 x3c003140 ,% eax Relocation Process 51 52 Relocation Of An Assembler Instruction Relocation Of An Assembler Instruction (cont.) During the linking process relocated addresses are injected in the code, for example the assignment b = 5; ? How to find the right places in the machine code to perform the substitutions? Before relocation ( relocatable ‘ compile .o ‘): 18: c7 05 00 00 00 00 05 movl $0x5 ,0 x0 1 c0005d8 : c7 05 80 32 00 3 c 05 movl $0x5 ,0 x3c003280 After relocation ( executable ‘ compile ‘): I Linker has relocation record (relative address) of b RELOCATION RECORDS FOR [. text ]: ( compile . o ) 0000001 a R_386_32 b The proper address for b can be found in the symbol table. I SYMBOL TABLE : ( compile ) 3 c003280 g O . bss 00000004 b I SYMBOL TABLE : ( compile ) 3 c003280 g O . bss 00000004 b 1 c0005c0 g F . text 0000005 a main The symbol table for compile yields 3c003280 for variable b 53 Relocation Of An Assembler Instruction (cont.) Putting it all together: RELOCATION RECORDS FOR [. text ]: ( compile . o ) 0000001 a R_386_32 b ( relative offset ) SYMBOL TABLE : ( compile ) 3 c003280 g O . bss 00000004 b ( abs . address of b ) 1 c0005c0 g F . text 0000005 a main ( abs . address of main ) Computing the address where substitution must be performed: 1 c0005c0 + 0000001 a = 1 c0005da 18: 1 c0005d8 : c7 05 00 00 00 00 05 c7 05 80 32 00 3 c 05 movl movl Linker has absolute address of main from symbol table $0x5 ,0 x0 $0x5 ,0 x3c003280 55 54