Saturday, 27 June 2015

Static code analysis on kernel source

Since 2014 I have been running static code analysis using tools such as cppcheck and smatch against the Linux kernel source on a regular basis to catch bugs that creep into the kernel.   After each cppcheck run I then diff the logs and get a list of deltas on the error and warning messages, and I periodically review these to filter out false positives and I end up with a list of bugs that need some attention.

Bugs such as allocations returning NULL pointers without checks, memory leaks, duplicate memory frees and uninitialized variables are easy to find with static analyzers and generally just require generally one or two line fixes.

So what are the overall trends like?

Warnings and error messages from cppcheck have been dropping over time and "portable warnings" have been steadily increasing.  "Portable warnings" are mainly from arithmetic on void * pointers (which GCC handles has byte sized but is not legal C), and these are slowly increasing over time.   Note that there is some variation in the results as I use the latest versions of cppcheck, and occasionally it finds a lot of false positives and then this gets fixed in later versions of cppcheck.

Comparing it to the growth in kernel size the drop overall warning and error message trends from cppcheck aren't so bad considering the kernel has grown by nearly 11% over the time I have been running the static analysis.

Kernel source growth over time
Since each warning or error reported has to be carefully scrutinized to determine if they are false positives (and this takes a lot of effort and time), I've not yet been able determine the exact false positive rates on these stats.  Compared to the actual lines of code, cppcheck is finding ~1 error per 15K lines of source.

It would be interesting to run this analysis on commercial static analyzers such as Coverity and see how the stats compare.  As it stands, cppcheck is doing it's bit in detecting errors and helping engineers to improve code quality.

Friday, 19 June 2015

Powerstat and thermal zones

Last night I was mulling over an overheating laptop issue that was reported by a user that turned out to be fluff and dust clogging up the fan rather than the intel_pstate driver being broken.

While it is a relief that the kernel driver is not at fault, it still bothered me that this kind of issue should be very simple to diagnose but I overlooked the obvious.   When solving these issues it is very easy to doubt that the complex part of a system is working correctly (e.g. a kernel driver) rather than the simpler part (e.g. the fan not working efficiently).  Normally, I try to apply Occam's Razor which in the simplest form can be phrased as:

"when you have two competing theories that make exactly the same predictions, the simpler one is the better."

..e.g. in this case, the fan is clogged up.

Fortunately, laptops invariably provide Thermal Zone information that can be monitored and hence one can correlate CPU activity with the temperature of various components of a laptop.  So last night I added Thermal Zone sampling to powerstat 0.02.00 which is enabled with the new -t option.
 
powerstat -tfR 0.5
Running for 60.0 seconds (120 samples at 0.5 second intervals).
Power measurements will start in 0 seconds time.

  Time    User  Nice   Sys  Idle    IO  Run Ctxt/s  IRQ/s  Watts x86_pk acpitz  CPU Freq
11:13:15   5.1   0.0   2.1  92.8   0.0    1   7902   1152   7.97  62.00  63.00  1.93 GHz  
11:13:16   3.9   0.0   2.5  93.1   0.5    1   7168    960   7.64  63.00  63.00  2.73 GHz  
11:13:16   1.0   0.0   2.0  96.9   0.0    1   7014    950   7.20  63.00  63.00  2.61 GHz  
11:13:17   2.0   0.0   3.0  94.5   0.5    1   6950    960   6.76  64.00  63.00  2.60 GHz  
11:13:17   3.0   0.0   3.0  93.9   0.0    1   6738    994   6.21  63.00  63.00  1.68 GHz  
11:13:18   3.5   0.0   2.5  93.6   0.5    1   6976    948   7.08  64.00  63.00  2.29 GHz  
....  

..the -t option now shows x86_pk (x86 CPU package temperature) and acpitz (APCI thermal zone) temperature readings in degrees Celsius.

Now this is where the fun begins.  I ran powerstat for 60 seconds at 2 samples per second and then imported the data into LibreOffice.  To easily show corrleations between CPU load, power consumption, temperature and CPU frequency I normalized the data so that the lowest values were 0.0 and the highest were 1.0 and produced the following graph:

One can see that the CPU frequency (green) scales with the the CPU load (blue) and so does the CPU power (orange).   CPU temperature (yellow) jumps up quickly when the CPU is loaded and then steadily increases.  Meanwhile, the ACPI thermal zone (purple) trails the CPU load because it takes time for the machine to warm up and then cool down (it takes time for a fan to pump out the heat from the machine).

So, next time a laptop runs hot, running powerstat will capture the activity and correlating temperature with CPU activity should allow one to see if the overheating is related to a real CPU frequency scaling issue or a clogged up fan (or broken heat pipe!).