Sunday, February 27, 2011

Windows Hang and Crash Dump Analysis Webcast Review

Recently, I spent an exciting Saturday watching and documenting a Sysinternal's webcast by Windows expert, Mark Russinovich, entitled, Windows Hang and Crash Dump Analysis. Mark takes you step by step into the internals of Windows to help you better understand how system crashes happen, what caused the crash, and how to fix it.


These are my notes I took during the Webcast. On a separate afternoon, while writing this blog, I tested WinDbg. Included in this documentation, are additional notes about what I did to create a crash dump, using the program Mark wrote called NotMyFault, and how I tested WinDbg. I am hoping the hours I spent watching and documenting the webcast will be useful one day to me and to my blog readers. The entire process has been a learning experience and very gratifying. 

And now, the Webcast....




Mark says,
"Basic crash dump analysis is pretty straight forward."
More advanced crash dump analysis requires lots of experience, advanced internals, compiler and CPU knowledge.

More often than not,
"The victim is not the culprit."
For example, a driver corrupts the operating system, and Windows crashes later.

Why does Windows crash? 
There is something wrong in kernel-mode.
"Windows only crashes when something in kernel-mode goes wrong. User-mode code, because of the protection mechanisms built into the operating system, cannot cause a problem that results in the operating system turning over. Kernel-mode is a trusted environment in the Windows operating system. Any drivers in kernel-mode can access anything they want to. They can access data buffers sitting in the file system cache about to go out to disk. They can access user-mode code, if they want to, and data. If a component suspects something is wrong, it has one primary responsibility, the preservation of your data."
Corruption could have happened already. If the system ignored it, this could result in the kernel-mode overriding data about to go to disk. The system wants to stop corruption from happening, and if corruption has already started, it wants to stop it. The Blue Screen of Death is presented, instead.

Microsoft's analysis of crash root causes indicate that most crashes are caused by third-party driver code.

When a condition is detected that requires a crash, KeBugCheckEx function is called. It turns off interrupts, tells other CPUs to stop, paints the blue screen, notifies registered drivers of the crash, and if a dump is configured and it is safe to do so, KeBugCheckEx writes the dump to disk.

Bugcheck Codes
Bugcheck codes are shared by many system components and drivers. Two common ones are:
  •  (DRIVER_) IRQL_NOT_LESS_OR_EQUAL (0X0A) usually invalid memory access
  •  INVALID-KERNEL_MODE_TRAP (0X7F) and KMODE_EXCEPTION_NOT_HANDLED (0X1E) generated by executing garbage instructions, usually caused when a stack is trashed.
Most bugcheck codes are documented in the Debugging Tools help file and the Microsoft Knowledge Base.
"Often bugcheck code and parameters are not enough to solve the crash."
Many times you could be having crashes you don't know about because Windows automatically reboots after a crash. Mark recommends auditing the Event logs for system crashes.

To analyze a crash, you should generate a crash dump to analyze memory. To configure for a crash dump file:

  • Right click My Computer
  • Select Properties
  • Go to Advanced
  • Click on Start Up and Recovery
  • Look at System failure area
  • Check off the boxes and indicate what type of crash dump you want
    Crash Dump Types:
    1. (none)
    2. Complete memory dump. The full state of the system at the time of the crash. The disadvantage is the dump is huge. You won't be able to generate it on some systems because of space reasons.
    3. Small memory dump (triage or mini-dump).  The advantage is the dump is so small you can send it as an email attachment. The disadvantage is there is not much information in the small memory dump. Unless the cause of the crash is in that small dump, you will not be able to determine the cause. Mini-dumps are kept as unique file names, by default, and are kept forever until you delete them, so you have a complete history of your system dumps.
    4. Kernel memory dump. The Kernel memory dump is a copy of physical memory that is owned by the OS and drivers. The user-mode code and data is excluded, and Mark explains user mode code cannot cause kernel-mode crashes. 
    If you are looking to analyze a crash, all of the information you need is in kernel-mode memory, including all of the data structures that are needed, like the active processes and drivers loaded onto the machine. 
    Mark recommends configuring all of your systems for kernel memory dumps. You get the mini-dump for free when you configure the kernel memory dump. Microsoft generates the mini-dump in preparation of you sending it off to Microsoft for analysis.
      According to Mark, the
      "Kernel memory dump is a nice compromise".
      Writing a Crash Dump
      Crash dumps are written to the paging file on the boot volume.

      Paradoxically, the boot volume is where the Windows directory is. The system volume is where boot.ini and the other boot files are located.

      To write to a another file would have required drivers and drivers can not be relied on as being stable at the time of a crash. Another reason for writing to the boot volume is that the boot volume is where the Windows directory is located and the boot volume cannot be defragged and cannot be a striped volume. It is safe. When the system boots, it checks:
      HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl

      Relevant components are check summed, the boot disk miniport driver, crash I/O functions, page file map. On crash, if checksum does not match, a dump is not written.

      Why Would You Not Get a Dump?
      • Crash occurred before the paging file was open. For example, during driver initialization
      • Crash corrupted components involved in the dump process
      • Spontaneous reboot
      • Paging file on the boot volume is too small
      • Not enough free space
      • Hung system
      What is a spontaneous reboot? Spontaneous reboot happens when you have a triple fault on an X86 processor and the system gives up on the OS and won't even create a dump.

      note: Make sure your paging file is as big as physical memory when you want to generate a dump.

       At the Reboot:
      •  Session Manager (User mode) initializes paging, determines there is a crash dump in the paging file and marks that area of the paging file as off-limits for use
      •  NtCreatePagingFile (Kernel-mode)
      •  WinLogon (User mode) looks to see if there is a crash in the paging file. If there is, it spins off the SaveDump process.
      •  SaveDump (User mode) reads the data from the the crash file and writes it out to the target location you specified 
      •  Paging File (Kernel-mode)
      •  Memory.dmp (User mode)
      Online Crash Analysis (OCA)
      By default, after a reboot, Windows prompts you to send the dump to http://watson.microsoft.com/.
      Can be configured with Computer Properties/Advanced/Error Reporting.
      Can be customized with Group Polices.

      What gets sent to OCA?
      1. XML description of system version, drivers present, loaded PnP drivers
      2. Minidump file
      What does OCA do?
      What does OCA (Online Crash Analysis) do? It looks up the crash in Microsoft's crash resolution database. Sometimes OCA will point you at Knowledge Base articles that may tell you to use Windows Update to get newer drivers or to install a hotfix or service pack. Many times OCA will say a driver caused a problem. This happens when OCA was unable to determine the exact cause but suspects a driver is the problem, which it normally is.

      Analyzing a Crash Dump Yourself

      Mark pointed to two kernel-level debuggers that can open crash dump files:
      1. WinDbg (Windows GUI program). WinDbg is a user-mode and kernel-mode debugger with a graphical interface. I downloaded WinDbg from CNET.com, located at Debugging Tools for Windows 6.11.1.402, onto my Vista laptop. If you do this, you can access the Debugging Tools Help File from here. It installs in your Programs directory on the C:\ drive.
      2. Kd (command-line program) Kernel Debugger
        Both of these programs provide the same kernel debugger analysis commands. These programs must first be configured to point to symbols. It is easiest to use the Microsoft Symbol server. I downloaded the symbols from Download Windows Symbol Packages for Vista, the one that says Most customers want this package). It installs in the C:\Windows\Symbols directory.
















        In WinDbg: click on File, Symbol File Path and
        Enter srv*c:\symbols*http://msdl.microsoft.com/download/symbols".
         (c:\symbols* is where your symbols are stored. I entered c:\windows\symbols*)




        If this is a minidump, you must also configure the image path to point to the location of images (File, Image File Path). Use the same string as for the symbol server (Windows XP and beyond)
         
        To open a crash dump:

         Right click WinDbg and Run As Administrator: go to File, Open Crash Dump. If you do not Run As Administrator for WinDbg, you will have trouble opening your crash dump file, Memory.dmp, with WinDbg, even though you are the administrator.



        IRQLs
        IRQL stands for Interrupt Request Level. Each CPU maintains IRQL independently. Software and hardware interrupts map to IRQLs. Mark went on to talk about advanced system internals such as Passive and Dispatch level interrupts. He noted that User-mode code always executes at Passive_level and that Kernel-level code executes at Passive_level most of the time. Dispatch_level is the highest interrupt level.
          

        Stacks
        A stack is an area of temporary storage. Each thread has two of these, a user-mode and a kernel-mode stack. The user-mode stack is usually 1 MB on X86. The kernel-mode stack is typically 12 KB (20 KB for GUI threads) on X86 systems.

        Stacks allow for nested function execution. When one function calls another function, a stack frame is created. Parameters can be passed to a function and the parameter is placed on the stack. The stack stores return addresses and serves as storage for frame pointers and local variables. 


        Calling Conventions
        Mark says stacks are easy to interpret if functions use standard calling conventions. Other calling conventions make the stack hard to figure out.

        No frame pointer
        Register arguments (fast calls)
        A debugger requires symbol information to parse non-standard stack frames. Non-standard stack frames makes accurate analysis of crashes involving third-party drivers difficult. Third-party driver developers don't normally make their symbol files available. So, the debugger has to make guesses in this situation.

        Analyze an "easy" crash
        Mark used a program he wrote, called Notmyfault.exe. The program executes a program called myfault.sys that creates a crash dump.  Look in How to generate a kernel or a complete memory dump file in Windows Server 2008 for more information about how to create a crash dump.

        I downloaded Notmyfault.zip to my laptop, unzipped it, and clicked on the Notmyfault application, located inside the Release folder of Notmyfault.


        I left NotMyFault defaulted to High IRQL fault (kernelmode) and hit the Do Bug button. You can see in the photo, below, my desktop background, another one of favorite Sysinternals utility called BGInfo.

        Here is my BSOD:



        Right click WinDbg and Run As Administrator: go to File, Open Crash Dump.
        It generates some verbiage that tells you what driver it thinks is the problem. Type !analyze -v, at the bottom or click on !analyze -v in the dump, to analyze the crash dump. Stop codes are documented in the Debugging Help file.




        Look at STACK_TEXT: in the dump to see what functions are called (usually a third party driver). The stack frames at the time of the crash make it a primary resource to determine what caused the problem. Mark explains that when you look at the STACK-TEXT:, you see the nt! functions. These are Windows functions. Windows functions are valid. Then, what do you know, you see Myfault, the third-party driver, called from Notmyfault.

        Crash Transformation
        Many crashes cannot be analyzed. The analyzer may point to Ntoskrnl.exe or Win32K.sys or other Windows components. This is the case of the
        "The victim crashed the system, not the criminal".
        You may get many different crash dumps all pointing at different causes. Your goal is not to analyze impossible crashes. It is to try to make an unanalyzable crash into one that can be analyzed.
        If the system points to a core system component as being at fault, you can bet the system did not get it right. It is very unlikely a core system component caused the crash. It is likely that it is an unanalyzable crash. 
        The tool to use for an unanalyzable crash is the Driver Verifier, verifier.exe (not in Start menu).

        Using the Driver Verifier
        The tool for crash transformation is Driver Verifier, an awesome crash analysis tool, according to Mark. Driver Verifier was introduced in Windows 2000. It's task is to improve the quality of third-party drivers, therefore improving the quality of Windows. It helps developers test their drivers and systems administrators identify faulty drivers.

        Run Verifier.exe
           Choose Create Custom Settings
           Choose Select Individual Settings from a List
           Enable all options except Low Resource Simulation
        A problem is that drivers act in a privileged mode where it is assumed they know what they are doing, but a lot of drivers have bugs in them.
        If you think a driver might have a bug in it, you can tell Verifier to watch that driver and double-check what it is doing in the system.

        To run Driver Verifier, go to the Run menu and type verifier to launch the GUI wizard. Select the Create custom settings (for code developers) option. Press Next. Press Select Individual Settings from a full list. Press Next. You will see the full list of options:
        • Special Pool When you run Verifier, it allocates memory from a special pool.
        • Pool Tracking is not especially useful for crash analysis.
        • Force IRQL checking is very powerful. The driver vendor can put its driver through all sorts of stress testing and it violates the dispatch-level rule and never gets caught. But, eventually when it releases the driver commercially, the bug shows up. Force IRQL checking shows when the paged memory is accessed at dispatch-level and marks it as not valid. There will be a page fault and that driver will be caught.
        • I/O verification
        • Enhanced I/O verification
        • Deadlock detection
        • DMA checking
        • Low resource simulation. Not used in crash analysis. It is used to stress drivers.
        • Disk integrity checking. Not used in 32-bit XP. Introduced in Server 2003 and 64-bit.
        Select all of the above except for the Low resource simulation.


        On the next page select the drivers to verify.
        •    Automatically select unsigned drivers. You might use this.
        •    Automatically select drivers built for older versions of Windows. Do not select this.
        •    Automatically select all drivers installed on this computer. Do not select this.
        •    Select driver names from a list. You might use this. It takes you to another page where you can select the drivers individually.





        Crash Transformation Recipe
        The Recipe: (for using Verifier)
        1. Try suspicious drivers, ones recently updated or known to be problematic
        2. Try enabling verification on all third-party drivers and or all unsigned drivers
        3. Enable verification on groups of 10 to 20 drivers at a time
        4. Run Windows Memory Diagnostic to see if the problem is actually a memory issue
        When you run Verifier, it allocates memory from a special pool (The App Verifier also uses Special Pool).

        What is Special Pool?

        Special Pool is a kernel buffer area where buffers are sandwiched with invalid pages. Every other page is invalid memory. Special Pool is a limited resource. When it runs out, verified drivers allocate from standard pool. This is why you only want to verify a small set of drivers at a time.

        note: I can attest to the above, as i ran Verifier against all of my drivers at once and received BSODs. 

        Buffer Overruns
        Buffer Overruns result when a driver goes past the end or the beginning of a buffer. This is usually detected when the overwritten data is referenced. There can be a very long delay between corruption and detection.

        Code Overwrites
        Code Overwrites are caused when a bug results in a wild pointer. A wild pointer that points at invalid memory is easily detected. A wild pointer that points at data is similar to a buffer overrun. It might not cause a problem for a long time. The crash makes it look like it is caused by something other than what really caused it.

        A code overwrite could happen in a Windows 2000 system with less than 127 MB memory, or if it is a Windows XP or Windows 2003 Server system with 255 MB memory.

        System code write protection catches code overwrite, but it is not turned on because of performance reasons. To turn System code write protection on, turn on the Driver Verifier and point to any driver, or, you can manually set it in the registry.

        Manual Analysis
        Sometimes !analyze is not enough and it does not tell you anything useful. You want to know what was happening at the time of the crash.

        Useful commands:
        • List loaded drivers: lm kv (make sure drivers are recognized and up to date)
        • Look at memory usage: !vm (make sure memory pools are not full; if full, use !poolused - requires pool tagging to be on)
        • Examine current thread: !thread (may or may not be related to crash)
        • List all processes: !process 0 0 (use especially on server systems - make sure you understand what was running on the system. You should understand the purpose of every process and driver on the server)
        • If a verifier detected deadlock: !deadlock
        • Additional commands: !help
        Stack Trashing
        Stack trashing is an example of a crash requiring manual analysis. Stack trashes have several possible causes:
        • A driver pushing things on the stack causes a stack overflow
        • A driver overruns a stack-allocated buffer
        Usually results in garbage code being executed resulting in a KMODE_EXCEPTION_NOT_HANDLED.
        Driver Verifier cannot determine the cause. Since the stack is corrupted, analysis is especially hard.

        Troubleshooting crashes that do not generate crash dumps
        If you are getting crashes with no resulting dump, or spontaneous reboots, you need to boot in debugging mode:
        • Press F8 during the boot and choose "Debugging mode" (warning! do not do it this way - it assumes you are debugging over a serial port and will be excruciatingly slow)
        • Edit the target's boot.ini file to configure it with switches and use a higher baud rate or an alternate connection like IEEE 1394 or USB 2.0: (do it this way)
        /debugport=comX /baudrate=XXX (note: default baud rate in Debugging Mode is 19200)


        Windows XP and Windows 2003 support 1394
        Windows Vista supports USB 2.0


        In either case, this loads the kernel debugger at boot time. It does not affect performance.

        On a crash, the system will wait indefinitely for the debugger connection! It waits for you to come up and connect it with another computer with the communications port you specified and debug it. This is when you attach winDbg and debug it. When you connect to it, you are looking at a crash dump that hasn't been saved to disk, yet.

        Connecting to a Crashed System
        When a system crashes, attach a kernel debugger and analyze.
           In winDbg: choose File, Kernel Debug. Configure baud rate and COM port. Click OK.
           Debugger should connect and display the bugcheck code.
           Type !analyze -v, and if necessary perform additional analysis commands as described earlier.

        To save complete memory dump for offline analysis,
           use ".dump" (or ".dump /f" to capture a full dump)
        note: this will be slow over a serial cable

        Hung Systems
        Sometimes the system becomes unresponsive. The keyboard and mouse freeze.

        Two types of hang:
              Instant lockup
                   Kernel synchronization deadlock
                   Infinite loop at high IRQL or very high priority thread
              Slow grinding to a halt
                   Storage stack resource deadlock

        To analyze a hung system, you need to take some steps ahead of time.
        There are two options: boot  the machine in debugging mode or configure it so you can crash the machine manually.
          Initiating a Manual Crash
          Crash from the keyboard. This requires a PS2 keyboard and the right control key.
          Hit the right CTRL button and then Scroll Lock twice.
          This must be configured in the registry:

               HKLM\SYSTEM\CurrentControlSet\services\i8042prt\Parameters\CrashOnCtrlScroll (DWORD)-set                      to value of 1
              Documented in Debugging Tools Help file.
              Keyboard interrupts must run for this to work.

          Use a hardware 'dump switch”

              Some servers come with NMI (Non-masterable Interrupt) button.
              You can also make one: http://www.microsoft.com/whdc/system/CEC/dmpsw.mspx.
              Must be configured in the Registry:

                    HKLM\System\CurrentControlSet\Control\CrashControl\NMICrashDump (DWORD) – set to    value of 19

          Breaking into a Hung System
          Instead of crashing you can boot in debugging mode and break in when the system hangs.

          After the hang, connect the host debugger system to the target,
              Run Winbg (or KD).
              Press Ctrl -C (or click Debug, Break) – this breaks into target system.

          Sometimes you can't even break in when the system is hung.

          Then attempt to determine the reason for the hang. (This is the hard part.)
              Use !thread to see what's running – check the stack.
                    Check each CPU by using the ~command, for example, ~0, ~1.
              Use !locks to look at possible deadlocks.
              User !irql to see previous IRQL (Windows Server 2003 and later).

          If you can't figure it out but want to save it for later analysis:
              use .crash to force a crash.
              Or .dump to save the current state of the system in a dump file.

          This can also be done with LiveKD (free from Sysinternals) on a live system

          Analyzing a “Sick” System
          Sometimes a system is still responsive, but you know that something is wrong with it
              You want to look at its kernel state, but...
              You don't want to take it offline by crashing it or connecting a debugger to it

          You can get a “dump” of a live system with LiveKd (free download from Sysinternals.com)
              Use it to run Windbg or Kd
              Use .dump to snapshot live system

          About Driver Verifier
          Tools for Verifying Drivers


          Enhanced by Zemanta

          No comments:

          Post a Comment

          "Comment As:" anonymous if you would rather not sign into an account!