By now, you're probably looking around at device driver code and
wondering, "How does the function foo_read()
get called?" Or perhaps
you're wondering, "When I type cat /proc/cpuinfo
, how does the
cpuinfo()
function get called?"
Once the kernel has finished booting, the control flow changes from a comparatively straightforward "Which function is called next?" to being dependent on system calls, exceptions, and interrupts. Today, we'll talk about system calls.
What is a system call?
In the most literal sense, a system call (also called a "syscall") is an instruction, similar to the "add" instruction or the "jump" instruction. At a higher level, a system call is the way a user level program asks the operating system to do something for it. If you're writing a program, and you need to read from a file, you use a system call to ask the operating system to read the file for you.
System calls in detail
Here's how a system call works. First, the user program sets up the arguments for the system call. One of the arguments is the system call number (more on that later). Note that all this is done automatically by library functions unless you are writing in assembly. After the arguments are all set up, the program executes the "system call" instruction. This instruction causes an exception: an event that causes the processor to jump to a new address and start executing the code there.
The instructions at the new address save your user program's state, figure out what system call you want, call the function in the kernel that implements that system call, restores your user program state, and returns control back to the user program. A system call is one way that the functions defined in a device driver end up being called.
That was the whirlwind tour of how a system call works. Next, we'll go into minute detail for those who are curious about exactly how the kernel does all this. Don't worry if you don't quite understand all of the details - just remember that this is one way that a function in the kernel can end up being called, and that no magic is involved. You can trace the control flow all the way through the kernel - with difficulty sometimes, but you can do it.
A system call example
This is a good place to start showing code to go along with the theory.
We'll follow the progress of a read()
system call, starting from
the moment the system call instruction is executed. The PowerPC architecture
will be used as an example for the architecture specific part of the code. On
the PowerPC, when you execute a system call, the processor jumps to the address
0xc00
. The code at that location is defined in the file:
arch/ppc/kernel/head.S
It looks something like this:
/* System call */ . = 0xc00 SystemCall: EXCEPTION_PROLOG EXC_XFER_EE_LITE(0xc00, DoSyscall) /* Single step - not used on 601 */ EXCEPTION(0xd00, SingleStep, SingleStepException, EXC_XFER_STD) EXCEPTION(0xe00, Trap_0e, UnknownException, EXC_XFER_EE)
What this code does is save some state and call another function
called DoSyscall
. Here's a more detailed explanation (feel free to
skip this part):
EXCEPTION_PROLOG
is a macro that handles the switch from user
to kernel space, which requires things like saving the register state of the
user process. EXC_XFER_EE_LITE
is called with the address of this
routine, and the address of the function DoSyscall
. Eventually,
some state will be saved and DoSyscall
will be called. The next two
lines save two exception vectors on the addresses 0xd00
and
0xe00
.
EXC_XFER_EE_LITE
looks like this:
#define EXC_XFER_EE_LITE(n, hdlr) \ EXC_XFER_TEMPLATE(n, hdlr, n+1, COPY_EE, transfer_to_handler, \ ret_from_except)
EXC_XFER_TEMPLATE
is another macro, and the code looks
like this:
#define EXC_XFER_TEMPLATE(n, hdlr, trap, copyee, tfer, ret) \ li r10,trap; \ stw r10,TRAP(r11); \ li r10,MSR_KERNEL; \ copyee(r10, r9); \ bl tfer; \ i##n: \ .long hdlr; \ .long ret
li
stands for "load immediate", which means that a constant
value known at compile time is stored in a register. First, trap
is loaded into the register r10
. On the next line, that value is
stored on the address given by TRAP(r11)
. TRAP(r11)
and the next two lines do some hardware specific bit manipulation. After that
we call the tfer
function (i.e. the
transfer_to_handler
function), which does yet more housekeeping,
and then transfers control to hdlr
(i.e. DoSyscall
).
Note that transfer_to_handler
loads the address of the handler
from the link register, which is why you see .long DoSyscall
instead of bl DoSyscall
.
Now, let's look at DoSyscall
. It's in the file:
arch/ppc/kernel/entry.S
Eventually, this function loads up the address of the syscall table and
indexes into it using the system call number. The syscall table is what the OS
uses to translate from a system call number to a particular system call. The
system call table is named sys_call_table
and defined in:
arch/ppc/kernel/misc.S
The syscall table contains the addresses of the functions that
implement each system call. For example, the read()
system call
function is named sys_read
. The read()
system call number is 3, so
the address of sys_read()
is in the 4th entry of the system call table
(since we start numbering the system calls with 0). We read the data
from the address sys_call_table + (3 * word_size)
and we get the
address of sys_read()
.
After DoSyscall
has looked up the correct system call address,
it transfers control to that system call. Let's look at where
sys_read()
is defined, in the file:
fs/read_write.c
This function finds the file struct associated with the fd number you passed
to the read()
function. That structure contains a pointer to the
function that should be used to read data from that particular kind of file.
After doing some checks, it calls that file-specific read function in order to
actually read the data from the file, and then returns. This file-specific
function is defined somewhere else - the socket code, filesystem code, or
device driver code, for example. This is one of the points at which a specific
kernel subsystem finally interfaces with the rest of the kernel. After our
read function finishes, we return from the sys_read()
, back to
DoSyscall()
, which switches control to
ret_from_except
, which is in defined in:
arch/ppc/kernel/entry.S
This checks for tasks that might need to be done before switching back
to user mode. If nothing else needs to be done, we fall through to
the restore
function, which restores the user process's state and
returns control back to the user program. There! Your read()
call is
done! If you're lucky, you even got your data back.
You can explore syscalls further by putting printks at strategic places. Be
sure to limit the amount of output from these printks. For example, if you add
a printk
to sys_read()
syscall, you should do
something like this:
static int mycount = 0; if (mycount < 10) { printk ("sys_read called\n"); mycount++; }
Have fun!