mach_inject is an API written by Jonathan Rentzsch for ppc Macintosh computers that let one process inject and execute some code into another running process.
mach_override on the other hand let one override the code of a function within a process.
More information can be found on Jonathan’s original
mach_star page.
 
I have ported this code to intel so it can now run on core duo Macs. Please read below if you are interested in understanding the differences between the ppc and IA-32 versions.
 
Latest version: mach_star-1.2-intel-0.3
 
 
Important restriction: In 10.4.4 (intel build), Apple removed the ability to send low-level mach commands from one task to another. A task trying to control another one must now be root or belong to the procmod group. See Enabling cross-task control on intel if you need help adding yourself to procmod.
 
 
News
 
03/10/2006: mach_override now works on intel too. Available in mach_star-1.2-intel-0.3
03/08/2006: mach_inject now works fine when injecting Universal bundles
03/04/2006: first version of mach_inject for IA-32
A port of mach_inject and mach_override to intel
Differences between PPC and IA-32 versions
 
Since I’m not a low-level programmer, I thought I should explain the differences between the PPC and intel version so that people more knowledgeable than me could double-check, find mistakes and propose corrections.
 
IA32 OS X ABI
This was the first and obvious and easy part. To execute injected code in the target process, mach_inject remotely starts a mach_thread using thread_create_running. The new_state argument in this call contains the initial value of cpu registers when the thread starts up.
 
It’s thus up to mach_inject to set up those registers so that as soon as the thread starts up it executes the code we injected in the target process.
 
To achieve this, mach_inject on PPC creates a stack by hand and initializes the registers so that to simulate a call to one of the functions we injected (called INJECT_ENTRY).
 
While there is very little preconditions on the stack contents on PPC (function arguments are passed in registers), stack has to be carefully filled up on IA32.

Right after a CALL instruction on IA32, the stack (i.e. the memory zone register SP points to) must look like (if all function argument are 32 bits long):
What about mach_override ?
 
This section is outdated. mach_override now works on intel too.
 
Mach override lets one override a function, by replacing the first instruction of corresponding assembly with a jump to a controlled code.
 
On PPC, Jonathan Rentzch explains it’s possible because:
    - All instructions are 32 bits long
    - A jump instruction exists that is exactly 32 bits long, including a 24 bits operand.
    - Writing 32 bits of memory can be done atomically
 
On IA-32:
    - All instructions are not 32 bits long.
    - A 32 bits long jump instruction would only accept a 16 bits operand
 
The first issue means that re-writing the beginning of a function may partially overwrite an instruction, thus leaving an invalid instruction as the re-entry point of overridden function. I guess it could be circumvented by allowing override of functions having a well-known prolog. Implementations starting with a prolog similar to:
 
    push   %ebp
    mov    %esp,%ebp
    sub    $0x??,%esp
 
are probably good candidates for overriding. This prolog is 6 bytes long and a long absolute jump instruction is 6 bytes too.
 
Unfortunately, I don’t think there is a way to write 6 bytes atomically, i.e. without taking the risk of temporarily writing invalid code. I don’t know how to solve this, and probably don’t have the enough IA-32 knowledge to.
Correction (03/06/06): I actually think this is possible thanks to the CMPXCHG8B which, if prefixed with LOCK, would allow an atomic modification of 64 bits of memory at once if 16-bits aligned. Since we have only 6 bytes to write, I think we’re good to go.
 
Any idea or information is welcome :)
 
Setting up the stack is easy. Setting the Program Counter to point to INJECT_ENTRY is also very easy. Unfortunately, while this is enough on PPC, is far from being sufficient on intel.
This is enough to call INJECT_ENTRY, but it crashes as soon as any extern function is called.
 
 
DYLD jump tables
As explained briefly in Jonathan Rentzch’s Mac Hack paper, an implementation block than needs access to functions defined elsewhere contains a vector table which entries are the actual external function addresses. This vector table is filled at bind time.
 
While this is always true on PPC, things are a bit different on intel. The stub called when accessing an external function directly is rewritten at bind time into a very simple JMP instruction.
 
For instance, a call to an external function:
    external_func();
 
becomes an assembler call to a stub
    call   0x305f <dyld_stub_external_func>
 
and dyld_stub_external_func is something like
    JMP 0x0x098ff110
 
This instruction is written at bind time.
 
Unfortunately the JMP instruction is a relative jump (i.e. its argument is an offset from current instruction). Thus, as soon as we move the code, this jump points to nowhere interesting and generally leads to a crash.
 
As a consequence, when injecting a dyld image to the target task, one must manually offset the JMP instructions so that they point to the actual external functions. This means the dyld image must be copied, then modified, then injected in the thread process.
 
Fortunately, the position of  these JMP instructions within a dyld image is easy to get with dyld API by accessing the ("__IMPORT",  "__jump_table") section.
 
Once done, external functions get called correctly, unfortunately, most of them crash very soon, and even a very simple INJECT_ENTRY function crashes on the mandatory thread_suspend() call.
 
 
pthread structure
This one is a bit tricky to explain.
Many libc functions try to access some data associated to the current pthread (posix thread). They generally get this information by using a call named pthread_self() defined in libc.
This call returns a pointer to the data structure associated to the current pthread.
 
They generally behave well if this call returns NULL.
 
We would expect pthread_self() to return NULL in INJECT_ENTRY since we call it by creating a mach thread, and no pthread environment has been set for this thread.
 
Unfortunately on IA32, if called in INJECT_ENTRY without any prior work, it doesn’t return NULL, it crashes !
 
This is due to the way posix thread data structure is accessed on IA-32. On many processors, a dedicated register points to this data. But not on IA-32. On this architecture, a segment register is used instead.
 
Segmented access is some kind of indirection to access memory zones. For example, segment 0 can  be used to access physical address 0x1000000 and beyond, segment 1 to access physical address 0xffc000, etc ...
 
Then, if %gs register is set to 1, the following instruction:
    movl %gs:8, %ax
would put the content of physical address 0xffc000+8 into %ax
 
When a pthread is created, a memory zone is allocated to store the thread data structure.
The address of this memory zone is then passed to a function named pthread_set_self()
 
This function asks the kernel to setup segment number 0x37 so that it points to the newly allocated data structure. It then sets %gs to 0x37.
 
From now on, %gs:OFFSET is a direct access to the thread data structure. Since this structure contains a pointer to itself at offset 0x48, address %gs:0x48 contains the address of current thread data structure.
 
Guess what ? This is exactly what pthread_self() uses to get the pointer to current thread data structure.
 
The thing is, in our INJECT_ENTRY function, pthread_set_self() has never been called. The segment 0x37 has not been prepared, and any attempt to access it will certainly crash the process.
 
This is exactly what happens as soon as some function calls pthread_self(), and believe me, many do, including thread_suspend() (via mig_get_reply_port)
 
The solution to this problem is to allocate a fake thread data structure full of zeroes and then call pthread_set_self at the beginning of INJECT_ENTRY.
 
Since malloc() itself calls pthread_self, the memory region must be allocated before INJECT_ENTRY is called and given as an argument to it. I used the higher part of the stack, one might want to allocate a dedicated memory zone, but I did not want to change mach_inject too much.