28 July, 2009

How to read bug-reports

Well, recently I’ve noticed that there are many people, who have certain difficulties with reading and/or understanding bug-reports. There are many related questions both on support and forums. While it is quite surprising for me, it looks like there is a problem, which need to be solved.

Ok, let’s take few examples of bug-reports and see, how we should understand them. But before: remember, that you can increase usefulness of your bug-reports. And you better do it before bug-reports start arrive ;)

Call stacks

Call stack is a central piece of information in any bug-report. Call stack is a sequence of addresses (mostly with textual description), which leads us to place of the problem (exception or memory problem):
Call Stack Information:      
------------------------------------------------------------------------       
|Address |Module      |Unit        |Class |Procedure/Method      |Line |       
------------------------------------------------------------------------       
|*Exception Thread: ID=8820; Priority=0; Class=; [Main]                |       
|----------------------------------------------------------------------|       
|004D316F|Project8.exe|Unit9.pas   |TForm9|Button1Click          |33[3]|       
|76BCF6A5|user32.dll  |            |      |GetWindowLongW        |     |       
|76BCF6B1|user32.dll  |            |      |GetWindowLongW        |     |       
|76BC8ABF|user32.dll  |            |      |NotifyWinEvent        |     |       
|76BCF6A5|user32.dll  |            |      |GetWindowLongW        |     |       
|76BCF6B1|user32.dll  |            |      |GetWindowLongW        |     |       
|77AA8502|ntdll.dll   |            |      |NtFindAtom            |     |       
|76BF7B8D|user32.dll  |            |      |GetRawInputDeviceInfoW|     |      
|76BD078F|user32.dll  |            |      |GetPropW              |     |       
|76BD0AF5|user32.dll  |            |      |SendMessageW          |     |       
|76BB8C6B|user32.dll  |            |      |CallNextHookEx        |     |       
|76BD0697|user32.dll  |            |      |CallWindowProcW       |     |       
|76BD0681|user32.dll  |            |      |CallWindowProcW       |     |       
|76BB8C6B|user32.dll  |            |      |CallNextHookEx        |     |       
|76BD078F|user32.dll  |            |      |GetPropW              |     |       
|76BBE001|user32.dll  |            |      |GetCapture            |     |       
|76BD005B|user32.dll  |            |      |DispatchMessageW      |     |       
|76BD0051|user32.dll  |            |      |DispatchMessageW      |     |       
|004D499D|Project8.exe|Project8.dpr|      |                      |17[4]|       
|7691490F|kernel32.dll|            |      |BaseThreadInitThunk   |     |       
------------------------------------------------------------------------
EurekaLog’s call stack (raw text)

image
EurekaLog’s call stack (EurekaLog Viewer)

4CC5B7 [Unit15.pas][Unit15][Unit15.B][35]      
4CC5C4 [Unit15.pas][Unit15][Unit15.A][39]       
4CC5DC [Unit15.pas][Unit15][Unit15.TForm15.FormCreate][43]       
4C0CCB [Forms][Forms.TCustomForm.DoCreate]       
4C0913 [Forms][Forms.TCustomForm.AfterConstruction]       
4C08E8 [Forms][Forms.TCustomForm.Create]       
4CA539 [Forms][Forms.TApplication.CreateForm]       
4CD986 [Project14][Project14.Project14][14]       
75B94911 [BaseThreadInitThunk]       
770FE4B6
FastMM’s call stack

(00087A8C){Project1.exe} [00488A8C] Unit1.B (Line 40, "Unit1.pas" + 1) + $5      
(00087ABC){Project1.exe} [00488ABC] Unit1.A (Line 44, "Unit1.pas" + 0) + $0       
(00055523){Project1.exe} [00456523] Controls.TWinControl.WndProc + $513       
(0003AA8C){Project1.exe} [0043BA8C] StdCtrls.TButtonControl.WndProc + $6C       
(00055673){Project1.exe} [00456673] Controls.DoControlMsg + $23       
(00055523){Project1.exe} [00456523] Controls.TWinControl.WndProc + $513       
(00068584){Project1.exe} [00469584] Forms.TCustomForm.WndProc + $594       
(00054C3C){Project1.exe} [00455C3C] Controls.TWinControl.MainWndProc + $2C       
(00025724){Project1.exe} [00426724] Classes.StdWndProc + $14       
(0005561F){Project1.exe} [0045661F] Controls.TWinControl.DefaultHandler + $D7       
(00055523){Project1.exe} [00456523] Controls.TWinControl.WndProc + $513       
(0003AA8C){Project1.exe} [0043BA8C] StdCtrls.TButtonControl.WndProc + $6C       
(00025724){Project1.exe} [00426724] Classes.StdWndProc + $14
JCL’s call stack

image
Delphi’s call stack (View/Debug Windows/Call Stack)

So, as you can see: call stack always contains very similar information, which don’t depend on tool created it (well, there are all different applications in the above examples, so don’t try to compare them ;) ).

First, usually the first element in the call stack’s string is a raw address. Those are numbers like 00488ABC or 488ABC (without leading zeros form). This is $00488ABC in Delphi’s style – i.e. the pointer on memory’s location, namely: on some code. You can use these addresses manually (in case, if there is no other information available): run your application, call “View”/“Goto address” menu item and enter $00488ABC – and IDE will go to this code location and show you either source or assembler code in CPU window (if source is not available). This will happen only in the case, when address doesn’t change at restarting application.

Usually there are also names of modules listed near those addresses. For example, Project8.exe or user32.dll. You can use these names to fast distinguish your code from system’s code.

But enough talking already about these boring numbers – you will use them rarely :) Because there is additional (more human-friendly) information available, if you have some debug information available. For example, names of the unit, class, method or routine and even line number. Sometimes part of information may be missing – that depends on detail level of debug information. We’ve already discussed it here and there.

Well, this information is rather clear: here is your Delphi’s unit, and here is routine and particular line in it, where either the problem happened (for the first line in call stack) or the call of sub-routine occurred (for other lines in call stack). The only piece, that require additional explanation, is line number. Often near absolute line numbers (counting from file’s start) there also are relative offsets in lines (or even in bytes) – which shows offset of the line from start of routine, to which the line belongs. For example, suppose, that we have the following code:
procedure Test;
begin
  DoActions;
  DoActions2;
end;
On the left side you can see absolute numeration, as it is in IDE code editor – starting from file’s start. So, line number information for DoActions2 may appear as “48[2]” (EurekaLog’s style) or “Line 48, "Unit1.pas" + 1” (JCL’s style). You should read this as: “it is a 48th line in the file or second line in Test routine”.

Why we need these relative offsets? Well, they can be quite handy if your sources was modified. For example, suppose that you’ve added a new routine before the Test procedure. This means, that the Test routine itself will be shifted down on, say, 27 lines. It is clear, that the line 48 is no longer belongs to the Test routine. However, you still can find a DoActions2 routine: if you look at routine’s name in the call stack and skip 2 lines from its start – you’ll arrive at DoAction2’s line.

BTW, EurekaLog is capable of looking inside your __history folders (available in recent Delphi’s versions) and extracting most relevant source for viewing (that happens, when you double-clicks on call stack’s line in EurekaLog Viewer).

How to read call stacks

Here are few moments, which you should have in mind, when you are going to read call stacks.

First: one bug-report file may contain several bug-reports, and one bug-report may contain several call stacks (for example: one call stack for each thread). So, before actually reading – make sure, that you’re going to read the required (“the one”) call stack ;)

Usually, call stacks should be read from the bottom to the top: the first called routine is placed at their bottom. The next called routine is above it, on the next line. And so on. A current location (the last called routine) is located at the top of call stack (call stack’s first line). Sometimes current location indicate the location of the problem (but it is not always so – see below), and the rest of the call stack indicate execution path – i.e. the way, how we got to the current place.

First few routines (from bottom) usually are system’s ones or belongs to Delphi’s RTL and, therefore, aren’t very interesting. Often there are only partial information available for them – due to partial or missing debug information. And sometimes the call stack itself has a limited depth (length/capacity) – i.e. it can not contain more than N lines or M characters. In that case the bottom part is simply cut.

On the other side: top part of the call stack may mention some RTL tool routines along with error’s location and your code. That depends on many factors. For example, it depends on stage (level of processing), at which exception was caught and the call stack was created. The examples of such tool routines are RTL’s exception processing helpers, GetMem’s calls, etc, etc.

Some tools also may specially add an additional line to the very top of the call stack. This line represents the exact error’s location, if it is available (for example, it is an code instruction’s address for exceptions). If this is the case, then if call stack already contains this address as its first line, then nothing happens. But if there are lines with RTL’s tool routines at the top – then this address will be duplicated in the call stack. First time it’ll be listed in the very first line (as special added) and the second time it’ll be listed few lines below in the call stack (after RTL’s tool routines).

The next thing: you should be aware of the method used for building call stack. There are two methods available: frame-based tracing and raw tracing methods. First one (frame-based) builds call stack by using sequence of frames, which are added to the stack on call of most routines. Usually this method gives acceptable results (if you don’t have too many very short routines). You can increase it’s effectiveness by enabling “Stack Frames” option. This method is quite precise and fast as it looks only for “registered” calls. It don’t scan the entire stack – just walks by sequence of frames, where every call frame points to another.

Raw tracing method works differently: it just scan the whole stack, trying to find return addresses. Really: the frames may be or may be not present in the stack. But return addresses are always here. The problem is that there is no 100% way to find them. So, raw method takes every integer in stack and tries to guess: does it look like return address? For this reason, the raw tracing methods can be used only with some heuristic algorithm. Created call stacks may differ significantly, depending on quality of the heuristic.

Why am I telling all this stuff? That’s because reading of a call stack may depend on the method, which was used to create the call stack.

So, what’s the difference? Well, you may guess it yourself: frame-based method may miss some calls, if there wasn’t call frame generated (usually, it are a very short routines) or even don’t work entirely, if one of the frames was damaged. On the other side, the raw tracing method is quite more “aggressive” than frame-based: it will not miss the call (if it wasn’t filtered out by heuristic), but it may add wrong elements to the call stack – so called “false-positive” entries.

So (finally :) ), if you’re analyzing the call stack and see strange thing (“Hey? What the…? This routine can not be called here by my code!”) – you should take into account the tracing method and suppose that: either there is call missed (it’s for the frame-based method) or there is unnecessary (false-positive) call (it’s for the raw method). You should also remember about this difference, if you use call stack in your application directly (for example, to get textual description of your caller).

Version 6 of EurekaLog use only raw-method (frame-based support is suggestion for v7). FastMM and JCL supports both methods. You can switch between them in settings. They use frame-based method by default (well, it appears that the latest version of FastMM now uses raw method by default, so you better check your settings).

Searching bug’s location

Actually, the thing, that I want to say here is: the problem may be located not at the place, where your call stack points to! Huge amount of people do not understand that for some reason.

Look, suppose, that you have one of those dreaded memory-corruption bug in your application. Buggy code may run without a visible problem and corrupt some other memory. And the actual exception will be raised later, when fully unrelated code will be executed. Now you have Access Violation and call stack, which points to some innocent code, which have nothing to do with the original problem.

You should just mindfully analyze the situation and do not blame unrelated code.

Even more than that: few types of bug-reports can not point to the problem by definition. Of course, I didn’t mean exceptions here (hey, we always have the exact address for exception), but rather talk about memory problems here (both leaks and corruption). It is quite obvious, that memory manager (I talk about EurekaLog, FastMM or any other memory tool here) can not scan the entire memory pool for problems at every machine instruction in your application (“hey, is this CPU command going to corrupt my precious memory?”). What’s more: it is not possible do scan entire pool on every request to MM (GetMem/FreeMem) either. As there is a lot of used memory in every application. There are even more control structures. So, if you’ll scan all memory at every call to MM, then your performance will be near zero.

That’s why memory manager in debug mode usually checks only blocks, which are directly related to the current operation. For example, when we ask to free memory, then MM will check only this block for corruption. Note: it’ll check only this block, not all allocated blocks. Next example: we’re asking for memory. So, MM will go through available “free” blocks and pick suitable one. Before returning it to us, MM’ll scan it to check, that no one have wrote into that block, while it was free.

That is why problems with memory will be raised later, and not in the moment, when they actually occurred.

Examples? Well, let’s take FastMM here. It has more advanced features than EurekaLog, so it is more interesting as example (but the example also applies to EurekaLog or any other tool). Every FastMM’s error message starts with “FastMM has detected an error during a ” and a name of current operation (GetMem, FreeMem, ReallocMem, or scanning). The next is the error itself. Here are few examples:
  • “The block header has been corrupted”
  • “The block footer has been corrupted”
  • “FastMM detected that a block has been modified after being freed”
  • “An attempt has been made to free/reallocate an unallocated block”
(Well, there are other, more rare error messages too). Look: actually all these error messages talk about one (the same) particular problem: unexpected change in memory (i.e. memory corruption). But it says “FastMM has detected”. Nowhere it says that “The error occurred”. Which means that FastMM only detected a problem in the current moment. The problem itself occurred earlier.

To help you to find bug’s source, FastMM may add to the bug report not only the call stack to the current moment (which is quite unrelated to the problem, as we saw above), but also call stacks for last successful operations with the same block. So, that’ll means, that problem happened somewhere between two moments. For example, if FastMM is telling you, that somebody has changed free block, than this means that:
  1. You’ve allocated a block (FastMM provides a call stack for this).
  2. You’ve free the block (FastMM provides a call stack for this).
  3. Some buggy code writes to this freed block by mistake.
  4. You call GetMem. Now FastMM is going to return this memory block and detect changes in it (FastMM provides a call stack for this).
There will be three (!) call stacks in one bug-report’s entry and none of them points to actual error’s location! Actual error sits somewhere between second and last call stack. Also note, that there can be multiply actions between any of the two points above. Yes, including calls to memory manager.

To put it short: just use your head and carefully read error messages (not just jump to analyze the code in the call stack) and it’s going to be ok ;)

BTW, I’ll talk about these kinds of memory problems in the next time (yes, I know that I already promised to do it, but I’m really busy recently – sorry).

Other information in the report

Apart from call stack there may be other information available in the bug reports. You can extract hints to the problem from this auxiliary information. The most important are information about CPU register and memory dumps. Sometimes you may extract information about variables from these pieces. Of course, it requires assembler knowledge, but it can be very handy sometimes.

Actually, the memory dumps can be useful even for un-experienced programmer. For example, the dump of memory leak can tell you, what data was there at run-time. Of, you can see, say, a string’s data instead of object’s layout and so you can trace where did this string go and find the source of memory corruption.

Unfortunaly, it is really difficult to give some general advices here – almost every case require individual approach. When you suggest the situation at run-time in your application, then the additional information in the report can help you to check these assumptions.

Well, an example. The very strange access violation at, seemingly, usual place at function’s call. You may notice that your application was launched in Windows 2000 (that is field in the bug report). And you do know, that you didn’t tested your application on this system. So, you dig a little deeper and discover, that used function do not exists in Windows 2000, and so this is the reason for the problem (say, you’ve imported the function via GetProcAddress, but didn’t check for errors).