07 Anatomy of a Kernel Hack

Reading Val's 06 "Your First printk", reminded me of something that's been bugging me about the bogomips calculation. The kernel does an experiment at boot time to determine how fast the processor is, and it prints out the result, which I can see with dmesg:

% dmesg | grep Bogo
Calibrating delay loop... 1395.91 BogoMIPS

Apparently it only does this once, at boot time. But one of my machines is a laptop with speedstep, which means that when it's running off batteries it runs slower to save power. So if I boot it plugged in, then unplug and run off batteries, the kernel is now wrong in its idea of how fast the processor is. And there's currently no way, as far as I know, to set it right.

Val was hanging out on IRC one afternoon, and I mentioned being curious how hard it would be to write a driver that recalculated the value and printed out the current value. She encouraged me to find out, so I did, and it works! (Well, mostly, but I'll get to that.)

Here's what I did. I'll go through how I figured it out, step by step, in case other kernel hacking newbies like me are curious. Val was on IRC giving me suggestions and moral support throughout.

First, I wondered whether I should make it a device in /dev or something in /proc. Val said /proc would probably be much easier, and I didn't want to deal with making major/minor IDs, so /proc it was. I decided to call my device /proc/bogomips.

Now, I had to figure out how /proc devices are made. So I did

% ls /proc
and I looked for something with a fairly unique name that I could grep for in the kernel source. execdomains looked pretty unique -- I have no idea what it is, but I did a search on that string.

Aside: I have an alias for doing a recursive grep in C source (that's something I need to do all the time at work). It looks like this:

alias       cgr     "find . -name '*.h' -or -name '*.c' \
  -or -name '*.cpp' -or -name '*.cc' | xargs grep \!* /dev/null"

So I went to my kernel source directory and did: cgr execdomains. That pointed me to fs/proc/proc_misc.c. So (as root) I edited that file and searched for "execdomains". I found a routine, execdomains_read_proc, and a line down near the bottom that was pretty clearly the place where the actual /proc node was getting registered.

So I added my new line there, right underneath the execdomains line:

	{"execdomains",	execdomains_read_proc},
	{"bogomips",	bogomips_read_proc},
Then I went back to execdomains_read_proc, made a copy of it and renamed it bogomips_read_proc.

Now I had to figure out what to put in bogomips_read_proc. It was back to Val's lesson 5: she said to put the printk in init/main.c, and that's where the bogomips calculation was, in a function called calibrate_delay(void). It looked like all I had to do was call calibrate_delay(), then check the variable loops_per_jiffy and print it out, and I'd have my driver.

But how was I supposed to return it? execdomains_read_proc() had an int len which it set to something, then it returned:

	int len = get_exec_domain_list(page);
	return proc_calc_metrics(page, start, off, count, eof, len);
Okay, so I didn't know what that was doing, but all the stuff in proc_misc.c seemed to return proc_calc_metrics like that, so I figured I'd better do that too. So I did this:
	int len;
	calibrate_delay();
	len = loops_per_jiffy;
	return proc_calc_metrics(page, start, off, count, eof, len);

Then I tried compiling: make bzImage from the top. (First I tried doing a make in fs/proc, but that didn't work so I guess I have to run it from the top every time.)

The compile failed -- it didn't know about loops_per_jiffy. So it's back to recursive grep to see if there was any .h file which mentioned that variable. (I have another alias, hgr, that does a recursive grep but only looks in .h files.) Sure enough, linux/delay.h declares it, so I added

#include <linux/delay.h>
near the top of proc_misc.c, and ran make bzImage again.

This time it compiled! Cool! So now it's time to try it:

cp arch/i386/bzImage /boot/vmlinuz-test
(edit /etc/lilo.conf to add an entry for vmlinuz-test)
lilo
(gulp, deep breath) reboot!

Cool! The new kernel booted and I was able to log in as myself. I cd'ed to /proc, did an ls -- wow, bogomips is actually there! So (deep breath) cat bogomips.

Yowch! screensful of garbage, beeping, and at the end my prompt was in what looked like vt100 graphics characters. :-) Okay, that wasn't quite it. But it didn't panic the system, and reset got me my normal prompt back.

Back into X, su, xemacs, and look at my routine, and meanwhile talk to Val on IRC. She suggested looking at what cpuinfo did, since /proc/cpuinfo prints lots of information including bogomips. So I looked at that for a while, and found out that /proc/cpuinfo is a lot more complicated than I thought. :-) It's complicated because it implements four operations -- open, read, llseek, and release -- rather than just read, and it does everything via routines named seq_(variousthings). seq_read ultimately called seq_printf off in some other file (more recursive grepping here, I'll spare you), and Val said seq_printf keeps track of the beginning and end of the string that you pass in, and I decided that this was more complicated than what I was after and I needed to find something simple that I did understand.

So ls /proc and stare at the list, and I found something that looked like it should be simple: /proc/loadavg prints a single line and I understand what a loadaverage is, so maybe I can use that as a model. Back to proc_misc.c and find loadavg_read_proc -- aha, it does

	len = sprintf(various stuff);
	return proc_calc_metrics(page, start, off, count, eof, len);
I figure I can do that.

So using init/main.c as a model for how to print out bogomips so they look good, I do this:

static int bogomips_read_proc(char *page, char **start, off_t off,
                              int count, int *eof, void *data)
{
	int len;
	printk("Recalibrating delay loop... Initially %d\n", loops_per_jiffy);
	calibrate_delay();
	printk("Afterward: %d\n", loops_per_jiffy);
        len = sprintf(page, "bogomips: %lu.%02lu\n",
                      loops_per_jiffy/(500000/HZ),
                      (loops_per_jiffy/(5000/HZ)) % 100);
	return proc_calc_metrics(page, start, off, count, eof, len);
}

(I'm hoping HZ is just magically defined in some include file I'm already including, and I got lucky there.) make bzImage tells me that loops_per_jiffy is a long and %d is an int format. I hate code that throws warnings so I change %d to %ld in both places, and rebuild. It builds! I copy to /boot, lilo and reboot.

It boots! I log in as myself and (several deep breaths)

% cat /proc/bogomips

It printed out:

Recalibrating delay loop! Initially 6979584
Calibrating delay loop... 1395.91 BogoMIPS
Afterward: 6979584
bogomips: 1395.91
Recalibrating delay loop! Initially 6979584
Calibrating delay loop... 1395.91 BogoMIPS
Afterward: 6979584

Woohoo! That "Calibrating delay loop..." isn't from me, it's from calibrate_delay() ... so my driver actually worked! For some reason it called calibrate_delay() a second time, which isn't so good, but basically it worked! I started X and ran it again, and this time I just saw:

% cat /proc/bogomips
bogomips: 1395.91
(since all the other stuff was going to the console, not to my terminal).

Now for the real test: this was all done with the laptop plugged in. So I unplugged it (now it's running on batteries and should be slower):

% cat /proc/bogomips
bogomips: 1097.72

It got it right! This is so cool! Plugged back in and it went back up to 1395.91! I went back in and cleaned up -- removed the printk calls, and also the "bogomips: " since Val said it was normal to just print the value.

I still didn't know why it called calibrate_delay() again after it prints the value, so I needed to figure that out. But I was pretty jazzed -- I actually wrote a real (if very simple), useful driver that does something I wanted anyway!


(The next day ...)

I've been thinking about why calibrate_delay_loop gets called twice (that's bugging me). I put in some printk calls and discovered that my read proc is being called twice; the arguments are all the same except for offset, which is 0 the first time and 8 the second time. The length it's returning (the number of characters from the sprintf) is 8, so I'm guessing that cat is trying to read beyond that because it didn't see an eof (i.e., read() didn't return 0).

I discovered that the O'Reilly device drivers book talks a little about /proc, and that I'm supposed to set *eof=1 when I have nothing more to return. But proc_calc_metrics is already doing that (and I tried bypassing proc_calc_metrics and setting *eof=1 myself, which didn't help), but the read proc is still called a second time with offset=8.

It looks like the main difference is that the first time, it sets *start and returns 8, whereas the second time, it returns 0 and doesn't set *start.

Of course, I could hack this by saying if (offset>0) return, or perhaps by returning 0 instead of the length, but those doesn't seem right. The O'Reilly book is pretty clear that I'm supposed to return the number of characters, not zero. I want to understand why setting *eof isn't doing it. But it looks like there are other /proc drivers which have the same issue, so I've added a check for (offset > 0) into the code, so it no longer calls calibrate_delay() twice.

Editor's note: Several other people on the linux-kernel mailing list have noticed and discussed the EOF problem. At this time, no one has implemented a general solution.