30 March, 2010

Memory problems in Delphi apps - final article

I've published several articles already, which cover different aspects of memory's issues. I'm going to give a short review of these articles first and then finish this series with current article. If you didn't read my previous articles or if you're a newbie - I recommend to read them from first to last before you'll start reading this one.

Okay, back to our series, here comes it's

Contents

1. Introducing article about pointers (and here is a separate article about strings). Well, these are not my articles, but it's great and well-written ones, so I'll just use a link to them. These article discuss pointers, reference types and many other things, which you should know about before you'll start debugging your memory problems.

2. Preparations. Tools for diagnosing memory issues require certain environment and conditions. You can improve (or make worse) their results by adjusting the options of your projects. That's why, before advancing any further, this article explains various debugging options and their affect on your application. There are also listed options sets for typical use cases (note, those are only recommendations). Well, this article is optional, if you already have a memory issue, but do not suffer from low detalization of your bug report (assuming you have it at all).

3. EAccessViolation exception. This exception (as, probably, most common one in Delphi) may have various reasons, but all reasons have something in common: they all are the memory problems. This article explains what this exception is, examines examples of its reasons, and studies how can you diagnose and fix it - either by debugging manually or by using debugging tools. As it becomes clear to the end of article, the EAccessViolation exception is actually a blessing - because it allows you to spot (and react on) the problem immediately. The article provides an example of how the same code can lead (depending on external conditions) either to exception ("good"), to the application's crash/data corruption ("bad"), or even to normal operation, although the code still has a bug ("very bad"). The article discusses only the first case - an obvious exception, and leaves other cases for later (this article).

4. Memory leaks. Though I'm talking about memory leaks in this article, my real target is to introduce memory managers and their debug mode. I've mentioned this at the beginning of the article - namely, that leak searching is actually close related to debugging memory corruptions. I've concentrated on leaks only in this article, saving other interesting questions for later (this article again).

5. (off-topic 1) How to read bug reports. This article is not related to this series about memory issues, but I think it's worth to mention it. Apparently, many beginners have difficulties with reading/understanding bug reports, which are generated by different debugging tools, mentioned in previous articles. So, this article tries to explain, how you should read reports and to discuss typical mistakes in interpreting reports.

6. (off-topic 2) Add-on to my article about memory leaks. Well, it looks like many people have used my first article about memory leaks for its most obvious purpose: catching mem-leaks ;) Apparently, original article wasn't good enough for this purpose, because there were many additional questions from people about it. Well, may be it is not that surprising, if you recall that original purpose was to introduce debugging memory managers. That's why I've decided to write an add-on, which covers missed topics. This add-on wasn't commented well, but we're receiving much less questions about mem-leaks, so I think that it hit the spot.

7. This article. I've only mentioned most scary problems in the previous articles, dealing with simple cases first and leaving hard cases for later. Now it's time to deal with them, therefore finishing the series. To put it short: if you have strange issues with crashes, empty reports (apart most obvious case of low-detailed debug info) or hangs - this article is for you (note: sometimes hang can be caused by memory corruption too: example).

Now, I've made a review of this series, so we can get back to this article itself:

When EurekaLog can not help you (or: what we will talk about)

Note: when I say "EurekaLog" - I actually mean any similar tool, like JclHookExcept, madExcept, WER, etc.

EurekaLog is a tool for debugging your application. It is not a "silver bullet", which magically solves any of your problem. Like every tool, EurekaLog have the purpose, the scope and the limitations. That's why there are cases, when this great tool can not help you.

The problem is that each tool requires some basics to function. You break this basis - you broke the tool. And there is no way a tool can escape that. Well, usually it's hard thing to do. But it is possible and it even may be not so hard, if you mis-use pointers and memory usage. I've already gave examples of simple code that can crash your application and nobody will do a thing about it (see, for example, the end of "Looking for the Access Violation’s reason by analyzing the code" block in this article).

So, all those scary problems, which make your debugging tools useless, are all memory issues. Namely: memory corruption. Any tool need to store its data somewhere. If a bug in your application damages or even erase this data - the game is over. Even more: the code can trash not only tool's data, but app's critical data too - like return addresses or saved pointers to exception handlers. You may already noticed that most hard cases includes stack corruption. That's because critical data (which you can affect at all from user-mode) is stored in application's stack.

Well, some of memory corruption problems (apparently, not so hardest ones) can be even detected by EurekaLog. EurekaLog tries to create an simplified version of the bug report in that case and to display the following message:


As you can see, EurekaLog thinks that this is its fault, though most common case is it's somebody else's fault. EurekaLog says so, because it have encounter a problem during its work on exception handling. But this problem may be caused either a bug in EurekaLog or a bug in other code (may be yours, may be not).

EurekaLog suggests you to send this report to EurekaLog's developers. If you agree, it will open your default e-mail client with such report (example):
Version   : 6.0.23
Date      : Sat, 27 Mar 2010 13:09:00 +0300
OS        : Microsoft Windows 7 (64 bit)
RAD       : BDS 7.0
Dump      : $89 $10 $8B $45 $94 $8B $40 $08 $48 $85 $C0 $7C $7B $40 $89 $85 $30 $FF $FF $FF $C7 $85 $38 $FF $FF $FF $00 $00 $00 $00 $8B $95
Section   : 16
LastExcept: Exception
Address   : $004A7FE6 - [00400000] Project38.exe - ExceptionLog.pas -  - InternalExceptNotify - 14540[484]
Exception : EAccessViolation
Message   : Access violation at address 004A7FE6 in module 'Project38.exe'. Write of address 00000000
Call Stack: 00 $00419534 - [00400000] Project38.exe - SysUtils.pas - Exception - Create - 17419[0]
            01 $00515E63 - [00400000] Project38.exe - Unit39.pas - TForm39 - Button1Click - 29[1]
            02 $004A6E6C - [00400000] Project38.exe - ExceptionLog.pas -  - InternalExceptNotify - 14056[0]
            03 $004A879A - [00400000] Project38.exe - ExceptionLog.pas - TExceptionThread - Execute - 14689[2]
            04 $0043E7DA - [00400000] Project38.exe - Classes.pas -  - ThreadProc - 11018[8]
            05 $0040667C - [00400000] Project38.exe - System.pas -  - ThreadWrapper - 13579[33]
            06 $76EF3675 - [76EE0000] kernel32.dll
            07 $77C89D70 - [77C50000] ntdll.dll
            08 $77C89D4B - [77C50000] ntdll.dll
            09 $77C89D40 - [77C50000] ntdll.dll
This is a short version of bug report for fatal errors. Sure, you should send us such reports - however, if you see that the problem in question is access violation or invalid pointer exceptions - this means memory corruption problem in your application. Which means that you better check your application very carefully, as in most cases such reports indicate problem in your or 3rd party code and not in EurekaLog. Either way, if you think that this is definitely a problem with EurekaLog - be prepare for long conversation with developers. That's because memory corruption problem can't be fixed from one report. Report can only indicate it. To fix it - additional work is required.

It's worth to note, that EurekaLog integrates itself into Delphi's IDE, hooking all IDE exceptions too. I.e. EurekaLog catches not only errors in your applications, but in Delphi too. This is useful feature for old Delphi's version, which do not have similar feature. New Delphis come with their own tool for automatic reporting - so you may want to disable this EurekaLog behaviour. In any case - you can turn off or on this behaviour in "EurekaLog"/"EurekaLog IDE Options" menu:

EurekaLog IDE Options
"IDE Integration" check-box enables exception hooking inside IDE. Here is an example of IDE problem.

Why I'm telling this stuff here? Well, apparently, not all people realize, that EurekaLog hooks IDE too. Sometimes they confuse errors with their applications and errors with IDE. If you're in doubt - turn off this option.

Okay, let's get back to our apps. An report about internal problem is just one case of how memory corruption can manifest itself. Other (the harder ones) cases include application's crash or hang. You will not get any message or report in those cases.

How to diagnose and fix memory problem

If you have memory corruption issue and you gor a report for it - this report will be a simple indication that you have a problem. You won't be able to fix the problem by using this report. Why? Because any such report is a note that the problem had occurred somewhere and some time ago. It's somehow similar to memory leaks - and we've already discussed it earlier. The problem is that nobody can scan each CPU instruction and ask: "is this command going to corrupt my memory?" That's why all checks are performed from time to time at certain checkpoints. Besides, only special data can be validated automatically. For example, if we take mem-leaks case - the checkpoints are calls of memory manager's routines and verified data are internal structures and freed memory. But even in that simplest case memory manager does not scan the entire memory pool on each request, limiting check to one memory block in question only. This is a usual trade-off between speed and functionality.

Okay, so, having a report, you will know that there is a problem. But you don't know where is it. You have a chance to locate it in the case of memory leaks, but not in the case of memory corruption. That's because you have some references to code for leaks, but references to code for mem corruptions are off-topic. The real culprit-code can sit a million instructions away in space and time from the code, which crashed because of it, and there is no any references to it. That's why the very first thing, that you should try to do (wherever you have a report or just crash/hang) is to try to reproduce the problem. Sometimes you can do it easily; sometimes it is possible, but hard to do; and often it is just not possible at all.

If you've managed to reproduce the problem - then it is a very simple case. Just debug your application as much as you want. I suspect that the most useful tools here will be memory breakpoints. General strategy is simple: you need to find a moment, when memory is committed, but is not corrupted yet. You place a break-point on the memory (yes, Delphi's debugger can do it; I'll not discuss it here - please, refer to other resources or Delphi's help) and you just run your application. As soon as this break-point fire - you'll find the culprit for memory corruption. Make yourself at home and take your time: analyze the call stack, variables, etc, etc - the situation is under your control.

So, to put it short: the main question here is to locate the problem (assuming you can reproduce it at all). I'll discuss the different methods below, which you can use to locate the problem. Some of them you can use always - both in debug and release version. Some of them are only applicable to debug version.

If you aren't able to solve the problem (either you can't reproduce it or you can reproduce, but can't locate it) – then the only options is to use passive methods. I.e. things, which aren't directed to your particular issue, but rather helps you to improve your code - that way after improvements you'll be able to diagnose the problem or it may be that the problem will go away without doing anything specific. For example, if your code is chaotic mix of totally unrelated routines calls without slightest sign of logic (okay, I'm just joking - I do not think that you're that bad :-D ) - you can spend half a year looking for the reason (and still not solve it). Or you can spend few months to refactor your code, to improve it - and then hunt down and fix not only this problem, but other issues too, which you've spotted because your code becomes much clearer.

I also think that we may publish in our blog examples of debugging particular problems. For example, I want to publish a demo how to debug a hang in your application. But this is for the next time. Just be sure to scan and read us periodically ;)

Problem's locating (active methods)

First at all, you should analyze, what can be your problem. There are two main cases here: dynamic memory (heap) or the stack. Depending on the answer you may use methods for the heap or for the stack. For example, using debugging memory manager can help you with the memory corruptions in the heap, but it can do nothing about stack corruptions. If you aren't sure about it - just use all methods ;)

1. Using debugging memory manager (heap). Debugging memory manager is any memory manager, which provides additional features for debugging memory problems. We meet some examples (like EurekaLog or FastMM) in articles about memory leaks. That's because searching for memory leaks and searching for memory corruption bugs use the very similar approach. EurekaLog's case: additional checks are enabled with "Catch leaks exceptions" option. FastMM's case: see "CheckHeapForCorruption" and "CatchUseOfFreedInterfaces" options (together with "FullDebugMode"). Other options may affect the results too, but these ones are primary options for memory corruption checks. We've already discussed an installation and use of debugging memory managers in previous articles, so I won't go into details here: it's the same for memory corruption problems as it's for mem-leaks – just don't forget to enable additional options and run your application, until memory manager will catch a problem.

Apart from already discussed memory managers - I want to mention a SafeMM memory manager. This is a debugging memory manager too, but it's a bit different from already discussed ones. You can download it here. And here is an example of use (a video) – the first part is about profiling and second part (starting from 22-nd minute) is about memory problems.

2. Enabling debugging options (stack and heap). We mentioned this before too. The main option here is "Range check errors", which allows you to catch out of range errors in array-based data structures (note, that this option have a bug in old Delphi's versions). Besides this option, you may want to disable inlining and optimization (to simplify debugging and to avoid bugs like this). Unfortunately, Delphi's compiler do not have a more generic option for checking stack's state like others compilers have. I've created a request for it – feel free to vote for it. If such feature exist - this would greatly simplify hunting for stack corruption problems. Well, without compiler support - we have only manual workarounds.

3. Forced checkpoints (stack and heap). As I've already said, any report about memory problem reports only about moment of detection, not about the problem itself. You must locate the problem. But how can you do it? Obviously, you need to find a point, when problem is not occurred yet (memory is not corrupted); find the point, when memory is corrupted. Therefore, the problem will sit somewhere between those two points. Each of these moments will be a checkpoint. Moving (or creating) checkpoints - you can reduce code's area with problem until you locate it. Sometimes, those checkpoints are created automatically. For example, debugging memory manager validates memory block each time its routine is called for this block. For the stack: it can be routine leave. Since you successfully leave the routine - this means that return address wasn't damaged, so there was no stack corruption (at least some type of it). There may be other examples, but it's not important now. I just want to say that if those checkpoints aren't suffice to locate the problem (or they aren't created at all) - then you need to create them manually.

An very good way for stack would be an option proposed just in previous item. Well, we don't have it, so there are no good checkpoints for stack - you need to do everything by hands: by validating local variables manually from time to time - see the next item.

We have an option to force check manually for the heap. You can call CheckMemoryOverrun routine for EurekaLog's case and ScanMemoryPoolForCorruptions routine for FastMM's case. You force memory manager to scan the entire memory pool for corruptions by calling one of these routines (obviously, only consistence of internal info/headers can be validated, not the data inside memory blocks). By putting calls to these functions around the code - you put explicit checkpoints. Start with calling them periodically. Once you found a problem between two calls - move them closer to each other, until you locate the problem. Additionally, FastMM allows you to enable full scanning on each memory operation. To do that - you need to set global variable FullDebugModeScanMemoryPoolBeforeEveryOperation to True. However, you need to understand, that doing so will bring your application's performance to its knees. Never turn it on for large amount of time. First, try to locate the problem with ScanMemoryPoolForCorruptions's calls as much as you can. And only if you aren't able to move further - enable FullDebugModeScanMemoryPoolBeforeEveryOperation, but only for the code between two calls to ScanMemoryPoolForCorruptions. Well, probably, in many cases you would like to use SafeMM to track the problem.

4. Debug checks (stack and heap). It's not always possible to use or set checkpoints as discussed in previous item. For example, no one can check consistency of your information, all automated tools can check only their info, not yours. That's why you may need to validate your info manually. Well, it's simple: just place as many checks as you want around your code. Put Assert's call everywhere. Check every thing, that you're able to check. Once you found a problem between two Assert's call - move them closer to each other, just like in checkpoint's case. As soon as you reduce a gap enough to acquire the address of corrupted memory - you're done. Just run your application until the moment before problem and put a memory-break-point on this address (see also below).

5. Avoiding local variables (stack). Since we don't have much tools for the stack - you can move the problem elsewhere by avoiding local variables: try to use global variables (just for test, of course) or (better yet) put all local variables into record, which you allocate dynamically in the heap. This will move the problem to another area, where we have some handy tools (your favourite debugging memory manager).

6. Problem with threads (heap). Multi-threading usually does not affect stacks, but it can be a reason for many hard-to-detect problems with global or heap's data (well, not multi-threading by itself, but rather synchronization errors). Debugging of multi-threaded application is large and complex thing, so I won't go into details here - see other resources.

7. Memory breakpoints (stack and heap). If you'll found an specific address for memory, which was corrupted, things will become much easier. All you need to do now is to use memory breakpoints. Memory breakpoints is handy ability of Delphi's debugger, which allows you to put break-point on memory, just like you do this for code. A memory break-point triggers, when some code accesses memory. I won't go into details on how to use them - use Delphi's help. Ok, back to our problem. So, you have a memory's address. Run your application until the moment, when this memory will be available (allocated). It should be in the valid state at this moment. Place a memory break-point on it. And run your application. When break-point fires - check the code, which caused it. You'll find the culprit eventually.

So, if you wasn't able to solve your problem with the above methods - then the only thing left is:

Prevention of problems with memory (passive methods)

1. Avoid low-level code. It's simple: scan all your code, looking for calls of low-level routines (which aren't type-safe, therefore have a high chance of corrupting memory). Double-check all usage cases. Replace low-level code with high-level counterpart, if you can do it. It's better to do it slow and safe/correct than do it fast, but incorrect.

2. Use wrappers. I've mentioned this before, when we talked about WinAPI calls and "other resources" in add-on for mem-leaks article. Separate all such code into separate unit/class, which you can validate as single entity. You'll reduce searching area and simplify code by placing suspicious/potential troublesome code in the same place.

3. Code review by other developer. It's well-know fact, that your eyes see only things, which you brains want to see. That's why it's good thing to give your code to colleague - sometimes he/she can spot obvious problem, which you can't solve for few hours/days.

4. Actually, this section is endless. There are many books, which tells you how to write a good quality code. And they do this in more details, than I can do it here. That's why I won't list anything further - just give you some advice: read "smart" books. Consider the text above only as short example. You can improve yourself and your code by reading books and blogs. Many problems will be easier to spot or they can disappear eventually.

To put it short: read, read and read again (including us :) ).

Well, that's it. The end of the series. I hope you'll find it useful.