Presentation is loading. Please wait.

Presentation is loading. Please wait.

Developing For The Windows Hardware Error Architecture John Strange Software Design Engineer Windows Kernel Microsoft Corporation.

Similar presentations


Presentation on theme: "Developing For The Windows Hardware Error Architecture John Strange Software Design Engineer Windows Kernel Microsoft Corporation."— Presentation transcript:

1 Developing For The Windows Hardware Error Architecture John Strange Software Design Engineer Windows Kernel Microsoft Corporation

2 Outline Windows Hardware Error Architecture (WHEA) Overview What does it mean to develop for WHEA? Implement Required Functionality Extending WHEA with Platform Specific Hardware Error Driver (PSHED) Plug-ins Implementing Firmware Error Handlers Baseboard Management Controller (BMC) Interaction Status and Roadmap Call to Action

3 Terminology BERT – Boot Error Record Table ERST – Error Record Serialization Table Error Packet – Structure describing the error information extracted from a specific error source Error Record – Full description of an error event Error Source – Hardware resource that notifies software of errors ETW – Event Tracing For Windows HEST – Hardware Error Source Table LLHEH – Low-level Hardware Error Handler PSHED – Platform-Specific Hardware Error Driver WHEA – Windows Hardware Error Architecture WMI – Windows Management Interface

4 WHEA – In A Nutshell Common error record format Management applications benefit Pre-boot and out-of-band applications Error source discovery Fine-grained control of error sources Common error handling flow All hardware errors processed by same code path Hardware error abstractions become operating system first-class citizens Enables error source management

5 Hardware/Firmware HAL WHEA – Overview PCI.SYS Kernel Management/Reporting Applications Platform-Specific Hardware Error Driver Plug-in LLHEH WMI Management Interface ETW Error Notifications Provided by: Microsoft ISV/IHV Code Gen

6 WHEA Error Flow Worker Thread Build Error Record Contained Persist Error Record Send Notifications Bugcheck WheaReportHwError WHEA_ERROR_PACKET PSHED Plug-in Schedule Work Item End Corrected Errors Process Error Recovered End

7 Developing For WHEA Implementing required functionality Error Source Enumeration Error Record Persistence Error Injection Adding value by extending WHEA with PSHED Plug-ins Firmware Error Handlers Error Reporting/Management Applications

8 x86/x64 Default Error Source Support Machine Check Exceptions Corrected Machine Checks Non-Maskable Interrupt PCI Express AER BOOT Error

9 Itanium Default Error Source Support Machine Check Exceptions Corrected Machine Checks Corrected Platform Errors PCI Express AER INIT

10 Default Error Source Support For complete details on the default error source support, including default parameters, implemented by the PSHED, see the PSHED Plug-in Developer’s Guide

11 Required Functionality Error source enumeration PSHED implements support for default error sources Configured with default parameters Platform overrides default error sources only if necessary

12 Required Functionality Error source enumeration If default error source list or control parameters need to be augmented, there are two complementary ways to achieve this Implement an ACPI HEST Implement Error Source Discovery support in a PSHED Plug-In

13 Required Functionality Error record persistence Platform must implement support for error record persistence Read/write/clear error records from persistent store Single unrecoverable error record written prior to bugchecking system WHEA reads, processes, and then clears any existing error records during subsequent boot Pre-boot and OOB applications can also consume error records WHEA requires that platform implement support for writing at least one error record x86/x64 requires at least 1 KB Consider possible limitations on 1 KB Itanium requires at least 100 KB

14 Required Functionality Error record persistence x86/x64 Preferred solution: Implement support for PSHED’s hardware persistence interface Implement ACPI ERST Implement PSHED Plug-In Itanium Preferred solution: Implement error record serialization support in Get/SetVariable Modify the EFI GetVariable/SetVariable routines so they can support error record persistence Implement PSHED Plug-In WHEA error record persistence model coexists with Itanium SAL error record persistence

15 Hardware Serialization

16 Hardware Serialization Error log address range Range of memory specifically designated to be used for error record serialization INT 15H E820H on PCAT Identified as AddressRangeMemory w/ new extended attribute AddressRangeErrorLog (0x03) GetMemoryMap() boot services function on EFI-based systems Identified via an Extended Address Space Descriptor with new resource type specific attribute called ACPI_MEMORY_LOG (0x0000000000002000) Range must large enough to hold at least one error record WHEA ACPI constructs are expected to be part of standard, but they are currently provisional

17 Hardware Serialization ERST Describes platform’s error record serialization (i.e. persistence) interface to the operating system Serialization instruction entries describe how to execute the required serialization actions Field Byte Length Byte Offset Description ACPI Standard Header Header Signature 40x0‘ERST’ Serialization Header Serialization Header Size 40x24 Length in bytes of the serialization header Reserved40x28 Must be zero Instruction Entry Count 40x2c The number of Serialization Instruction Entries in the Serialization Action Table Serialization Action Table Serialization Instruction Entries 32* Count 0x30 A series of error logging instruction entries

18 Hardware Serialization Serialization action entry Field Byte Length Byte Offset Description Serialization Action 1N The serialization action that this serialization instruction is a part of Instruction1N+0x1 Identifies the instruction to execute. See table 5 for a list of valid instructions Flags1N+0x2 Flags that qualify the instruction Reserved1N+0x3 Register Region 12N+0x4 Generic address structure as defined in section 5.2.3.1 of the ACPI specification to describe the address and bit Value8N+0x10 Value used with READ_REGISTER_VALUE and WRITE_REGISTER_VALUE instructions Mask8N+0x18 The bit mask required to obtain the bits corresponding to the serialization instruction in a given bit range defined by the register region

19 Hardware Serialization Sample action entries Command Register: 0xFEA0 (IO) Status Register: 0x00000000AAFF0000 (Memory Mapped) { 0x00, // BEGIN_WRITE_OPERATION 0x03, // WRITE_REGISTER_VALUE 0x00, // Flags 0x00, // Reserved 0x00, 0x01, 0x00, 0x03, 0x00000000AAFF0000, 0x0000000000000080,0x00000000000000FF}{ 0x05, // EXECUTE_OPERATION 0x03, // WRITE_REGISTER_VALUE 0x00, // Flags 0x00, // Reserved 0x01, 0x01, 0x00, 0x03, 0x000000000000FEA0, 0x0000000000000040,0x00000000000000FF}

20 Required Functionality Error injection WHEA requires that platforms implement support for injecting at least one corrected and one uncorrected error This interface is used for end-to-end error handling flow validation (operating system code, system firmware, plug-ins) Itanium implements built-in error injection support x86/x64 requires a PSHED Plug-In for error injection This Plug-In should not ship to customers

21 Extending WHEA With PSHED Plug-Ins Error Source Enumeration Error Source Control Error Record Persistence Error Information Retrieval Error Recovery Error Injection

22 PSHED Plug-Ins Error source enumeration GetAllErrorSources Allows Plug-In to supply error source information PSHED Kernel Plug-In Error Source Table GetAllErrorSources 1. Kernel calls PSHED 2. PSHED creates error source table 3. PSHED calls Plug-In Plug-In augments the error source table as necessary PshedGetAllErrorSources

23 PSHED Plug-Ins Error source enumeration GetErrorSourceInfo Allows Plug-In to supply info about a specific error source PSHED PCI Bus Driver Plug-In PshedGetErrorSourceInfo GetErrorSourceInfo 1. PCI Bus Driver creates descriptor 2. PCI Bus Driver calls PSHED 3. PSHED calls Plug-In Plug-In supplies error source info If no Plug-In, default error info is returned to the caller If Plug-In does not supply information for the specified error source, it returns error and PSHED try to fulfill the request Error Source Descriptor

24 PSHED Plug-Ins Error source control Start/StopErrorSource Allows Plug-In carry out the operations associated with starting/Stopping a given error source PSHED Kernel Plug-In PshedStart/StopErrorSource Start/StopErrorSource 1. Kernel calls PSHED to start/stop a given error source 2. PSHED delegates start/stop operation to the Plug-In Plug-In interacts with error source as necessary to start/stop it If Plug-In cannot start/stop the specified error source, it returns error and PSHED will try to start/stop it Error Source Descriptor

25 PSHED Plug-Ins Error source control ErrorSourceControl Allows Plug-In carry out control operations associated for a given error source (set error thresholds, set error severity levels, etc.) PSHED Kernel Plug-In PshedErrorSourceControl ErrorSourceControl 1. Management invokes WMI method 3. PSHED delegates control operation to the Plug-In Plug-In interacts with error source as necessary to carry out control operation If Plug-In cannot perform control operation on the specified error source, it returns error and PSHED will try Management Application WMI Interface 2. Kernel calls PSHED Control Pkt Error Source Descriptor

26 PSHED Plug-Ins Error record persistence WriteErrorRecord Plug-In writes error record to persistent store PSHED Kernel Plug-In PshedWriteErrorRecord WriteErrorRecord 1. Kernel calls PSHED 3. Plug-In writes the error record to non-volatile store Plug-In may invoke firmware or execute hardware commands 2. PSHED calls Plug-In Persistent Error Record Error Record

27 PSHED Kernel Plug-In PshedReadErrorRecord ReadErrorRecord Persistent Error Record Buffer PSHED Plug-Ins Error record persistence ReadErrorRecord Plug-In reads error record to persistent store 1. Kernel calls PSHED 3. Plug-In reads the specified error record from non-volatile store and copies the record into the supplied buffer Plug-In may invoke firmware or execute hardware commands 2. PSHED calls Plug-In

28 PSHED Plug-Ins Error record persistence ClearErrorRecord Plug-In clears error record from persistent store PSHED Kernel Plug-In PshedClearErrorRecord ClearClearRecord 1. Kernel calls PSHED 3. Plug-In clears the specified error record from non-volatile store Plug-In may invoke firmware or execute hardware commands 2. PSHED calls plug-in Persistent Error Record

29 PSHED Plug-Ins Error information retrieval RetrieveErrorInfo Plug-In retrieves error information and populates error packet PSHED LLHEH Plug-In PshedRetrieveErrorInfo RetrieveErrorInfo 1. LLHEH extracts error data and creates an error packet 3. PSHED calls Plug-In Plug-In augments/extends the error packet as necessary using its knowledge of the hardware 2. LLHEH calls PSHED Error Packet

30 PSHED Plug-Ins Error information retrieval FinalizeErrorRecord Plug-In can add additional error sections to the error record PSHED Kernel Plug-In PshedFinalizeErrorRecord FinalizeErrorRecord 1. Kernel creates error record using info in the error packet 3. PSHED calls Plug-In Plug-In adds additional sections to error record as necessary to fully describe the error condition 2. Kernel calls PSHED Error Record Section

31 PSHED Plug-Ins Error information retrieval ClearErrorStatus Allow Plug-In to clear private error status PSHED Kernel Plug-In PSHEDClearErrorStatus ClearErrorStatus 2. PSHED calls Plug-In Plug-In clears any private error status that it is using and/or responsible for 1. Kernel calls PSHED

32 PSHED Plug-Ins Error recovery AttemptErrorRecovery Allow Plug-In to perform any platform-level operations required to recover from the specified error condition PSHED Kernel Plug-In PSHEDAttemptErrorRecovery AttemptErrorRecovery 3. PSHED calls Plug-In Plug-In attempts recovery and returns success/failure to PSHED 1. Kernel calls PSHED Error Record 2. PSHED may attempt recovery 4. If recovery was successful, PSHED updates error record to inform kernel and prevent bugcheck

33 PSHED Plug-Ins Error injection GetErrorInjectionCaps/ InjectError Plug-In is responsible for carrying out this operations PSHED Kernel Plug-In PshedGetInjectionCapabilities/ PshedInjectError GetInjectionCapabilities/ InjectError 1. Management application invokes WMI method 3. PSHED delegates control operation to the Plug-In Plug-In returns the error injection capabilities of the platform Plug-In interacts with error source as necessary to cause the specified error to occur Management Application WMI Interface 2. Kernel calls PSHED

34 Firmware Error Handlers Some errors may be handled in firmware Errata management Error containment If platform must handle any of the required error sources in firmware, it must implement a mechanism for notifying operating system of the error subsequent to processing It must implement an alternative error source for the operating system Operating system can be notified via an interrupt or by configuring a polling mechanism This may require an accompanying PSHED Plug-In to interact with this error source

35 Error Reporting And Management Interface Error events are reported to applications via Event Tracing for Windows (ETW) Microsoft will implement a logging service that listens for error events and writes entries to the system event log WHEA Whitepaper ties together reporting and management scenarios Two WMI Classes defined for WHEA management Error Injection Interface Error Source Interface

36 WMI Management Interface WHEAErrorInjectionMethodsGetErrorInjectionCapabilitiesInjectErrorWHEAErrorSourceMethodsGetAllErrorSourcesGetErrorSourceInfoSetErrorSourceInfo

37 BMC Interaction WHEA co-exists with BMC-based error handling/reporting solutions Duplicate error events may show up in the event log WHEA-aware management applications will consume WHEA events Existing management applications that depend on BMC-originated events can continue to consume their respective events Management applications can be consumers of both WHEA events and BMC-originated events Tighter and better integration moving forward

38 Status And Roadmap Core of Windows Vista hardware error handling is based on WHEA Implements native support PCI Express AER Error log entries are common error record format ETW-based logging replaces WMI-based error log entries Windows Server codenamed “Longhorn” Core operating system implementation nearing completion Validating hardware vendor error record persistence and error source discovery implementations Validating hardware vendor PSHED Plug-Ins Future Standardize error record format Standardize ACPI mechanisms Working with processor vendors on better error reporting architectures Working with hardware vendors on better OS/firmware integration architectures Extend WHEA to endpoint device stacks

39 Call To Action Work with us to get builds and resources needed for developing PSHED Plug-Ins Work with us to validate WHEA platform support Provide BIOS with WHEA ACPI support Provide BIOS with error record persistence support Work with us to develop WHEA-based management applications Powerful health monitoring and error recovery features are possible with common error record format and ETW Error source control applications allow for very fine grained control over error source operational parameters Pre-boot and out-of-band error record processing applications

40 Additional Resources Specifications WHEA Whitepaper PSHED Plug-In Developer’s Guide PSHED Interface Specification WHEA Error Record Persistence Specification WHEA BOOT Error Record Specification WHEA ACPI Specification Longhorn Server WHEA Logo Requirements Related WinHEC Presentations CPA070 – PCI Express in Depth for Windows Vista and Beyond SER121 – Windows Server Platform Directions Send feedback/comments/questions to our feedback alias wheafb @ microsoft.com

41 © 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

42


Download ppt "Developing For The Windows Hardware Error Architecture John Strange Software Design Engineer Windows Kernel Microsoft Corporation."

Similar presentations


Ads by Google