Advanced Active Directory Design and Troubleshooting

Advanced Active Directory Design and Troubleshooting
Ed Whittington Principal Software Engineer HP Business Critical Call Center Oct. 06, 2002

Topics Troubleshooting Basics Troubleshooting Tools
DNS Troubleshooting Troubleshooting Replication Troubleshooting DCPromo Troubleshooting FRS Replication and DFS Troubleshooting Group Policy Troubleshooting in .NET

Troubleshooting Basics

Basic Troubleshooting Steps
Define the problem (make sure there is one) What’s failing? Client authentication and security Group policy application. Replication. Name resolution. Errors and warnings in event logs. FRS/DFS Application How is the problem replicated? One or multiple machines? Narrow the variables The first step in troubleshooting Active Directory is defining the problem. Make sure you understand what exactly the problem is – what isn’t working? The bullets in the slide identify major areas that the problem will be in. How can the problem be replicated. This is crucial – if it can’t be replicated, it can’t be fixed. Is it one or multiple users, machines, etc. Does the problem move with the user or the machine? Are other users experiencing the problem? Is it occurring on multiple DCs or servers or clients? Did it ever work? If so, when did it break and what happened at the time of the break?

MPSReports_DS (from HP or Microsoft) Get the Log files Event logs %windir%\debug\usermode\Userenv.log %windir%\debug\DCPromo*.log Turn on Verbose Logging Run NetDiag, DCDiag (verbose) Get status report from Replication Monitor. Best troubleshooting tool is Microsoft’s MPSReports. Various versions – DS, Cluster, Network, DataCenter. Have to get it from Microsoft or HP. Runs variety of utilities – good snapshot of the enterprise: dcdiag, netdiag, repadmin/showreps, event logs, etc. examine the event logs (all of them). Clear and repro the problem to isolate relevant events. For usermode errors (authentication, group policy, security). See Q to enable verbose logging. Non-verbose is useless. DCPromo.log, DCPromoui.log – logs DCpromo errors. Turn on verbose logging (how = later in presentation) Netdiag /v and DCdiag /v – from Support Tools – great snapshot of error conditions. Tests all facets of the environment Replmon has Status Report that tests all facets of replication health.

Check DNS. Resolver on ALL computers. Name Server Properties (forwarding, etc.). Monitoring tab – test name resolution. Nslookup, ping to test name resolution. Ping SRV records. Check Replication. Force replication. Identify who isn’t replicating to whom. Outbound vs. inbound. DNS is critical. incorrect resolver (TCP/IP properties) values on client or Name Servers will cause authentication, replication, etc. to fail. Make sure you read the DNS best practice papers from Microsoft.

If all else fails, try demoting. Really cleans up a lot of problems… If problem is isolated to one DC. If replication isn’t working, demotion won’t work. Reinstall to remove the AD, then clean up AD Ntdsutil to remove server object. Delete server object from Sites & Services. Delete FRS server object from System container. Can manually demote a DC. Demoting a DC then repromoting is a reasonable alternative to finding the real problem. Cleans up a lot of problems if the problem is only on one DC. Manual Demotion is a great repair technique. Unfortunately Microsoft wont’ make it public so I can’t tell you. However HP has permission to give this info to our customers if needed. This preserves data, etc. and saves time over reinstall. There is supposedly a tool to manually demote a DC in .NET

Manual Demotion of a DC Change from LanManNT to ServerNT
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet \Control\ProductOptions Product Type= ServerNT (when the computer is a Member Server) LanManNT (when the computer is a Domain Controller) Change from LanManNT to ServerNT It’s now a “dirty” member server Clean server objects from the AD (Ntdsutil) Clean up the disk and Registry Create new Forward Lookup Zone – Bogus.com Run DCpromo – create new forest for Bogus.com Demote and eliminate Bogus.com Wait for Replication Promote back into domain – use same name if desired Tool in Windows .NET Manual Demotion of a DC is a valuable tool in many cases. If a problem is isolated to a single DC such as replication, Service Principal Name (SPN) errors, Global Catalog errors, etc. and the problem cannot be resolved, it would normally have to be reinstalled. However, if that DC is also a file/print or application server, as is the case with many remote DCs, reinstalling has serious consequences. Manual Demotion allows the DC to be “demoted” from a DC to a member server without reinstalling the operating system. The steps include setting the registry key shown in the slide from LanManNT to ServerNT. This turns off all DC functionality and it is a member server. If you run DCpromo, the first screen in the wizard will say that it will install Active Directory (this verifies that it is a member server). Before running DCpromo, you must remove all the old objects for that DC from Active Directory, and you must clean up the sysvol tree and registry on the computer itself. Microsoft KB article Q describes how to clean up the AD. To clean up the computer, do the following: Install DNS on this computer and create a forward lookup zone with dynamic updates turned on – create it for a new domain – like bogus.com. Set this computer’s tcp/ip properties to point to itself for DNS. Promote this machine to create the new forest and domain bogus.com After the promotion is successful, demote it. This will remove all the registry entries and the sysvol tree. After replication has completed in the forest (replicating the AD cleanup), you can promote this machine back into the domain with the same (old) name it had before the demotion. This is offered in a GUI in Windows .NET Note: this procedure has not been made public by Microsoft but is rumored to have a KB article recently released. The registry key is actually shown on a public web site. After that you just have to do the clean up.

Troubleshooting Tools
Gathering Information

Netdiag.exe NETDIAG.EXE /v - verbose – always turn this on.
/l - log – writes netdiag.log to default directory. /d:domain controller – finds DC in domain. /test: - runs only specified tests. /skip: - skips specified tests. Can’t execute remotely. C:>netdiag /v /l Netdiag should always be run with the verbose switch. Unfortunately it cant’ be run remotely. Individual tests can be run independently with the /test: switch. All are run with the /v switch It can be scripted

Netdiag.exe Domain Controller Discovery
Bindings, IP address, Default Gateway tests DNS tests NBTstat and WINS ping Netstat Route Trust Kerberos Netdiag tests all phases of network health. The bullets here show the sections covered by netdiag.

Dcdiag.exe DCdiag /v Domain controller functions of netdiag
More domain-specific FSMO roles Connectivity Replications Domain controller locator Intersite “health” Topology integrity DCdiag, in the Support Tools set, is an excellent tool to validate health of the domain and domain controllers. The bullets on the slide identify the sections of the diagnosis. DCdiag should always be run with the /v (verbose) switch to get the most information returned .

Nltest.exe /server:servername Sets default server
/dsgetdc:domainname Dsgetdcname API [ /gc /timeserv /ldap ] /dclist:domainname Lists DCs in domain /parentdomain Lists parent domain /dsgetsite Lists site of server /dsgetsitecov Lists DC “covering” site /dcname:domainname Lists PDC for domain /dcpromo Tests potential success of DCPromo /whowill:domain user Returns name of DC that will authenticate user NLtest doesn’t yield a report – rather it allows you to test various API functions. For instance you can issue NLtest /dsgetdc:hp.com to view the information retrieved by Netlogon when it issues the dsgetdcname command. This is an excellent way to get a view of the current environment – i.e. the current site, list of DCs, etc. nltest /server:<servername> should be issued first to define the server that the commands will be executed on. No need to set this again until you change servers. nltest /dsgetsitecov Lists the coverage of the site for a domain where there is no DC for that domain in the site. It will tell which DC in which Site is covering the empty site for authentication, etc. The /dcpromo switch was added in the SP2 version of Support Tools. Although it is supposed to test if Dcpromo will be successful, I’ve seen cases where it reports a failure but dcpromo succeeds. It’s not perfect

Netdom.exe /join /add /reset /resetpwd /query FSMO /trust
Netdom is a utility in Support Tools that is widely used from resetting secure channel information, analyzing trust healths and revealing FSMO role holders. It is also used to join machines to a domain. This is an improvement on the NT4 version. Netdom on workstations has different switches than that on a domain controller/server. Windows .NET added the computername switch to Netdom for renaming a domain controller

NTDSUtil Built-in utility. Directly accesses Active Directory.
Authoritative Restore. Can restore an older version of the AD and force it on all DCs to correct variety of problems. Entire AD or single tree. Can’t restore the schema. FSMO Roles. List, Transfer, Seize roles. Better than UI – can manipulate all roles in forest and all domains from one utility.. NTDSUtil is a powerful tool. Some features are: Authoritative Restore – force all or part of the AD to be restored to a previous date Compress the ntds.dit (AD) file List, Transfer, Seize FSMO roles in one spot (using the UI requires two different UIs)

NTDSUtil Metadata Cleanup Useful tool for listing contents of the AD
Delete orphaned objects. Servers Domains The UI can and will lie to you! Don’t trust it. Useful tool for listing contents of the AD Sites, domains, servers, FSMO role holders. Domains in site. Servers in domain, servers in site. Q216364, Q216498, Q230306 AD Cleanup – orphaned objects, DCs that are not demoted gracefully, etc. Useful for listing all sites, domains, servers, etc. the KB articles noted are excellent for step by step instructions to using ntdsutil

Gpresult.exe Run on client Returns: Security group membership
User and Computer policy info GPOs applied to each Registry settings set in the GPO Client-side extensions set Scripts applied Remember Policy is cached – reboot / login to clear Note who authenticating server is Environmental Variable “logon server” Much Improved in .NET! GPresult is a tool in the Windows 2000 Resource Kit It is an excellent tool for determining group policy application on a client. Run with the /v (verbose) switch, it will list all registry settings applied from the policy, names of security groups the user or computer is member of, GPOs that are applied to user and computer, and client side extensions that are applied. This tells us not only if the gpo applied but what settings are applied Remember that policy is cached. When debugging group policy, it’s best to reboot (for machine based policy) or logoff/logon (for user based policy) to ensure the policy is refreshed. Secedit /refreshpolicy will work, but reboot/re-login is absolute.

GPOtool.exe Run on domain controller. Returns:
Analysis of all GPOs in domain. GUID and friendly name of all GPOs. DS and Sysvol versions. Errors encountered. Good group policy troubleshooting tool. May take a long time to process (#GPOs) GPOtool, in the Resource Kit, tests the validity and consistency of the GPO itself on domain controllers. This should be run on a DC to test the AD and Sysvol versions of the GPO – they should be the same after FRS replication finishes. One side benefit is it lists the GPOs by guid and then lists their friendly name.

ADSIedit.exe GUI much like Users & Computers snap-in /Advanced features. Graphical view of AD. Like LDP.exe but: Easier to browse. Can modify attribute values Don’t confuse with Users & Computers! ADSIedit and LDP allow you to see the actual contents of the AD. Other snapins like Users and Computers don’t always show the absolute data and don’t show metadata. ADSIedit is an easy to use GUI and lists all the attributes of objects. This is a great way to just see the attributes that belong to the object. ADSIedit also is an easy tool to use to modify attributes and values. It looks very much like the AD Users and Computers Snap-in.

LDP.exe Takes time to set up: Connect Bind View – Tree
Enter DN to start (blank for default) Exposes attributes quickly, easy to see. Faster than ADSIedit – no GUI to traverse. LDAP searches. Can delete and modify, but not as easy as ADSIedit. Can execute remotely. LDP, listed in Support Tools menu as Active Directory Administration Tool, has a GUI that is not as easy to read as ADSIedit, and requires a setup procedure to view the AD tree. You must connect to a DC (can connect to any DC so this is a good thing), Bind with an account and view any AD tree or sub tree. LDAP searches and attribute modification is possible. LDP’s value is that it allows you to click on an object in the left pane and it displays all of the attributes that have been defined in the right pane. This makes it easy to search for an attribute. In ADSIedit, you would have to browse each object, look at properties, then search thru the list. LDP lets you see them all at once immediately.

DCPromo.log, DCPromoui.log
Located in %systemroot%\debug. Logged every time dcpromo runs. DCPromo.log Shorter. Appended (read bottom up). DCPromoUI.log and DCPromoUI.xxxx.log Results of what is seen in the UI – longer. Find: Results of getdsdcname, DNS query, Time service sync, authentication, replication, Site info. Error (0x0) = success – no error . Error reporting different – read both logs. These files are located in the WINNT\debug folder and are generated when DCPromo runs. DCPromo.log isn’t as verbose as dcpromoui.log but it is easier and faster to read. Every time dcpromo runs, it appends information for that promotion at the end of dcpromo.log. DCPromoui.log contains the information dumped from the DCpromo GUI. When DCpromo runs the second time on a computer, the existing DCpromo will be renamed to DCpromoui001.log and the new information will be put in DCpromoui.log. The next time, it will save the dcpromoui.log file to dcpromoui002.log, etc. so dcpromoui.log has the latest information. Just look for errors and failures. Note that in the dcpromoui.log there are lines that say Error (0x0), or have some numbers in ( ). 0x0 means success. Numbers other than 0 indicate a failure –read the description. You will learn a lot about the dcpromo operation by taking time to read these logs – esp the dcpromoui.log

Userenv.log Located: %systemroot%\debug\usermode
User environment info: Group policy (registry) Client side extensions Scripts Security Increase verbose logging (Q221833) Take time – read and study and you may be surprised at what you can find! This log is created automatically in the Winnt\debut\usermode directory. Use KB Q to set verbose logging. Rename the existing log and then reproduce the problem. A new userenv.log file will be created. this helps isolate the problem. It is excellent for producing additional information about group policy failures or failures of client side extensions. Taking time to read and study will produce a lot of information. It isn’t very cryptic and provides a lot of information. Gotcha: while there is a time stamp for the events in the log, there is no “date” stamp. Read the log bottom to top as most recent info is at the top.

Additional User Mode Logs
Client-side extensions Registry see Q216357 HKLM\software\Microsoft\WindowsNT\currentversion\winlogon\ GPExtension Errors created in %windir%\debug\user mode Named after the .dll Scripts = Gptext.dll = gptext.log Folder Redirection = fdeploy.dll = fdeploy.log Security = scecli.dll = winlogon.log Q245422 Produced automatically on error (except winlogon.log) Check User Mode directory for these files Invaluable in debugging. Use them! The important point here is that the client side extensions (CSE) such as folder redirection, security, scripts, EFS, Explorer Branding, and disk quotas, operate as “extensions” to the GPO and are separate entities at least when troubleshooting. For example, if a logon script deployed thru a GPO isn’t working, you might try testing to see if the GPO applies by editing the GPO and disabling the run command. When you log in, you see that the run command is disabled but the script isn’t running. Just because an Admin template setting in the GPO works, doesn’t mean the script (CSE) will work. They have to be diagnosed separately. The userenv.log and client side extension logs located in %windir%\debug\usermode are good sources for debugging. Client side extensions operate as extensions to group policy. Group Policy may be successful but the CSE attached to it may fail. If the CSE fails, it will often write a log in the winnt\debug\usermode directory. The name of the log tells what CSE it belongs to. They are named after the dll for the extension. For instance the folder redirection extension uses the fdeploy.dll. It’s error log is called fdeploy.log. The Scripts extension uses gptext.dll so the log it outputs in an error case is gptext.log. Look in the registry under the key shown on the slide and drill down to the Policies folder. There will be several folders with GUID names. Expand these and look on the right pane to find the dll name and the friendly name of the extension. You can’t force them to be generated and can’t set verbose logging.

Client Side Extensions (registry)
This shows the scripts extension in the registry Note that under the GPExtensions key, there are keys (GUID names) for each client side extension. In the example here, we are viewing the contents of the Scripts extension (first line on the right pane) and we can see that the gptext.dll is the dll used. Thus, as noted on the previous slide, if there is a “severe” scripts error, the gptext.log file will be created in the %windir%\debug\usermode directory.

Windows .NET Troubleshooting Tools
This section describes some new and some improved troubleshooting tools available in Windows .NET

Remote Desktop Resource Redirection
Client Resources Available when using Terminal Services Remote Desktop File System – Local drives and Network drives on Local Machine available on Remote machine Audio – Audio streams such as .wav and .mp3 files can be played through the client sound system. Port – Applications have access to the serial and parallel ports Printer – The default local or network printer on the client becomes the default-printing device for the Remote Desktop. Clipboard – The Remote Desktop and client computer share a clipboard Terminal Services Virtual Channel Application Programming Interfaces (APIs) are provided to extend client resource redirection for custom applications. Remote Desktop replaced Terminal Services Administration mode in Windows Remote Desktop is available for Windows XP and .NET server family products. It is located in the administration tools menu. Note: There is no Terminal Services Administration mode in Windows .NET – if you install Terminal Services component in .NET, you will only be able to install the application mode. Cool features include the ability to: Cut and paste between the local desktop and the remote computer Copy files from the remote computer to local drives without mapping shares

WMI Computer management Active Directory
Provider: MicrosoftActiveDirectory Classes: Replication - See replprov.mof %windir%\system32 Trust health Provider: MicrosoftHealthMonitor Classes: see system32\wbem\trusthm.mof DNS Provider: MicrosoftDNS Classes: system32\wbem\dnsprov.mof Cluster MSCluster Also look in CIM Studio in MSDN The Windows Management Interface (WMI) is an interface to extract data from the local hardware to Active Directory and DNS. It was available in Windows 2000, but there was no convenient interface for the user other than programming. In Windows .NET and Windows XP, the WMIC interface allows execution of command to extract data from various sources. Type WMIC at the command line of a Windows XP or Windows .NET server and it will install the interface. The trick is formatting the WMIC command correctly. Just typing /? at the WMIC: prompt will return a help file showing other command options as well as some quick commands to get local system information. For instance typing QFE at the WMIC: prompt will return a list of hotfixes installed. This can be seen easily on a Windows XP client – try it on your laptop!

WMIC Sample Commands Look in %windir%\system32\wbem *.mof files for names of providers, classes, etc. Active Directory Provider: MicrosoftActiveDirectory wmic:/namespace: \\root\microsoftactivedirectory PATH msad_replneighbor (shows replication partners) wmic:/namespace:\\root\rsop\user path RSOP_GPO (lists GPOs with User settings) We need to know values for the Provider (namespace), classes, etc. This information is found to a large degree in .MOF files found in %winnt%\system32\wbem. For instance, on a DNS server, there will be a XXX.mof file. In it you will find that the “Provider” or namespace is MicrosoftActiveDirectory. Locating class designations will allow us to put those values in the PATH option. One such class is “msad_replneighbor” which will return a list of replication partners. We could then formulate the command: WMIC: /namespace: \\root\MicrosoftDNS PATH msad_replneighbor To list the GPOs for the domain, use the following command: WMIC: /namespace:\\root\rsop\user PATH rsop_gpo

Admin Tool Improvements
Users and Computers snap-in Drag and drop. Multi-select and edit user objects. Heavily revised object picker. Users and Computers, Sites and Services, DNS Snap-ins Saved queries. Viewing Saved DS, DNS, FRS eventlogs on non-DCs! .NET Adminpak (only on XP) the Users and computers snap-in, you can now drag and drop objects – not available in w2k also can select multiple objects Also there is a Saved Queries folder. You can create searches and save them – for instance if you are looking for a list of users that are locked out, you can create the query, save it, and then execute it later. The Users and COmputers, Sites and Services and DNS snap-ins contain the Directory Services, FRS and DNS eventlogs so you can search for events without having to go to the event viewer separately. There is a .NET adminpak just like the Win2k version. Problem is right now it will only work on an XP client – wont’ work on a .NET server!

Command Line Tools GPresult Enhanced reporting DCDiag
dcdiag /test:DCPromo Repadmin – enhanced reporting Netdom – computername for DCrename Others Shipped on Service Pack 2 CD (install manually) .NET Server, AdvSvr CD Several new features for old tools: GPresult is a built in tool (was in Win2k Reskit) Provides advanced functions including listing of ACL filters DCdiag test:dcpromo – run on member server to see if DCpromo to a domain would be successful Netdom contains a “computername” option used in Domain Controller rename operations These tools are available in the latest service pack for Windows Note that installing the SP doesn’t install the support tools. You have to install them separately from the Service Pack CD or from Microsoft’s web site.

Windows .NET Improvement to NTDSUtil
Change Offline, DS Repair Mode Password While Online! NTDSUtil Set DSRM Password (main menu) Increases server up-time limited by password change interval in Win2K. (Had to reboot to DS Repair mode to change.) Q (Win2K limit) Cool error message! Setting password failed. WIN32 Error Code: 0x6ba Error Message: The RPC server is unavailable. See Microsoft Knowledge Base article Q at for more information. Win2k required the Directory Service Repair Mode (DSRM) password to be reset (if you forgot it) by rebooting into DSRM. Windows .NET allows the password to be reset from NTDSutil online (without rebooting). See Q223301 In NTDSutil main menu go to “Reset DSRM Password” Interesting note – the error on the slide occurred while attempting to change the DSRM password. The interesting thing is that besides saying “RPC Server is unavailable” they point you to a KB article and a web site to find the article.

Errors in Windows .NET Kinder, Gentler and Report to Microsoft
Besides being a bit humorous (we are sorry for the inconvenience – if you were in the middle of something, you’re hosed), it is a valuable way to help Microsoft to help us. If you click the Send Error Report button, it will generate a trouble report and send it to Microsoft. If you are paranoid about Microsoft being “big brother” and snooping, you can view the log before you send it. Microsoft reportedly collects this info in a database and use data mining techniques to find common occurrences of problems

Active Directory Load Balancing Tool
Does the job of branch office deployment. KCC chooses BHS for connection objects – choose the same one. Tool allows you to spread the load to other DCs in the site (that have that NC). ADLB tool modifies the Hub DC’s replication schedules to spread it out over time. Generates a log – like replmon’s status log. For Deployments with hundreds of branch offices all replicating to a single hub.. Tool=no benefit to sites with only one DC per domain. In Windows 2000, there was a problem with replication load balancing. In a single domain, the KCC picks only one bridgehead server(BHS) per site to handle replicatioin for that domain. Problem is if you have a “hub” site that is serving many remote sites, in a hub and spoke configuration, one BHS at the hub serves all those remote sites Microsoft determined that a single BHS can only effectively serve about sites for replication. You then have to create manual connection objects from other DCs in the Hub to groups of remote site BHS to “load balance” the BHS. They published a whitepaper, Branch Office Deployment, that includes a set of instructions and scripts for doing this since you have to manage all the replication stuff manually (turn off the KCC – or schedule it). Windows .NET includes the Active Directory Load Balancing Tool (ADLB) which is a GUI based tool to help configure load balancing – all the things the whitepaper and scripts helped you to do manually.

Future: Graphical Replication Monitoring Tool
Very much like ‘Age of Directories’ Ability to make configuration changes Not in .NET - maybe Longhorn or Blackcomb? For the future, Microsoft is developing a graphical tool to monitor replication. This will supposedly be similar to HP’s unofficial, unsupported “Age of Directories” tool (see elsewhere in this presentation). It will allow you to view the replication topology in 3D graphics and

Troubleshooting DNS

DNS Resolver Configuration
Win2K clients, servers point to Win2K DNS Name Server that is SOA for their zone. Don’t point to ISP, other Internal NS. (even as “additional”.) Keep it simple. Win2K Name Servers forward to ISP or internal name server hosting registered domain. DNS is fairly simple and the resolver on every computer must be configured properly or it will fail. In the TCP/IP properties on every computer, it must point the “Preferred DNS Server” to the correct DNS server. The correct DNS server is the one who is SOA for it’s domain. Many make the mistake of pointing to their ISP or an Internal NS for internet access. This will break DNS. The client should only point to win2k DNS servers in it’s domain. Internet access is provided by the win2k name servers forwarding to the ISP or Internal NS who are registered on the internet. Internet access is a separate issue from Win2k name resolution.

DNS Name Server Configuration Basics
Dynamic updates = Yes. Active Directory Integrated Zone Select one “Primary” All other ADI Primary NS point to it for DNS Win2k Name Servers can: Forward to ISP or Internal NS. Use root hints (or modify root hints). Reverse Lookup Zones NOT required Needed only for tools - NSLookup Name server configuration: Set Dynamic updates to yes (zone property) ADI zones – Microsoft changed their position on this – used to say have each NS point to itself for DNS and other NS as “additional” DNS in TCP/IP properties. This created islands of name resolution – several “sources” for registration that had gaps due to replication. Now it’s recommended to select one of the NS as “primary”. Only that one Name Server points to itself for DNS. Other NS in the zone point to that “primary” as “Preferred DNS” in TCP/IP properties and to themselves as “additional dns servers”. This has made incredible performance improvements for customers. Documented in DNS Best Practices on Forward to ISP or Internal registered on Internet Internet access is separate issue from Win2k DNS Root hints will get you to the internet without forwarding

ADI Primary and Standard Secondary mixed zone
Only a DC can host an ADI primary zone Member Servers can host Secondary zone Synch off of an ADI Primary Name server configuration: Set Dynamic updates to yes (zone property) ADI zones – Microsoft changed their position on this – used to say have each NS point to itself for DNS and other NS as “additional” DNS in TCP/IP properties. This created islands of name resolution – several “sources” for registration that had gaps due to replication. Now it’s recommended to select one of the NS as “primary”. Only that one Name Server points to itself for DNS. Other NS in the zone point to that “primary” as “Preferred DNS” in TCP/IP properties and to themselves as “additional dns servers”. This has made incredible performance improvements for customers. Documented in DNS Best Practices on Forward to ISP or Internal registered on Internet Internet access is separate issue from Win2k DNS Root hints will get you to the internet without forwarding ADI Primary Secondary Secondary ADI Primary ADI Primary

DNS Case Study Forwarding corp.net na.corp.net sa.corp.net eu.corp.net
In this actual case study, the customer has 4 domain trees – a “root” and 3 geographic. Each domain has SOA servers for it’s domain and forwards to it’s local ISP. The geographic NS all forward to the root server where secondaries of the geographic zones are located. Thus this is all dependent on zone transfers. Forwarding corp.net na.corp.net sa.corp.net eu.corp.net na.corp.net Zone xfers Secondary zones sa.corp.net eu.corp.net

DNS Case Study corp.net na.corp.net sa.corp.net eu.corp.net
Problem is when a client in one domain wants to resolve a name in another. Client in EU domain wants to find a resource in NA. The request is forwarded to the root, who finds it in it’s secondary na.corp.net zone who resolves it and returns the answer to the client. corp.net na.corp.net sa.corp.net eu.corp.net eu.corp.net find na.corp.net sa.corp.net na.corp.net

With Conditional Forwarding Feature In Windows .NET Server…
Windows .NET allows Conditional Forwarding. That is, in the forwarders tab on the DNS server, you can specify a domain to be associated with a Name Server. In this case we configure the Name server for the NA.corp.net domain to point to a NA.corp.net name server, etc. This eliminates the secondary zones. The client, looking for a resource in the NA.corp.net domain is referred directly to a NS in the na.corp.net domain and returned to the client. This is much more efficient and faster than the previous method. corp.net na.corp.net sa.corp.net eu.corp.net find na.corp.net

Problem: SRV records only in Root domain
Location of SRV: PDC GC Cname w2k.net corp.com corp.com In this case, there are two domain trees, w2k.net and corp.com. Internet access is thru corp.com only. A secondary zone is loaded on the NS in w2k.net zone. All domains have a NS who is SOA for their zones. NA and EU have delegations from w2k.net. Problem is when zone xfer fails, replication, authentication fails since the PDC, GC and Cname records are only stored on Corp.com zone. = Zone Xfer = Forwarder NA.w2k.net EU.w2k.net

Solution: Delegate _msdcs zone
Location of SRV: PDC GC Cname corp.com _msdcs _tcp _sites _udp w2k.net _msdcs Solution is to put those SRV records at the other zones. This is done by delegating the _msdcs.corp.com sub-domain to name servers in the w2k.net domain. This makes the PDC, GC and Cname records more readily available to the w2k.net domain and it’s child domains. In some cases it may make sense to delegate the _msdcs zone to all 3 domains, depending on connectivity and reliability of the network. = Delegation = Forwarder NA.w2k.net EU.w2k.net

DNS Hotfix Symptom: Replication breaks
Configuration: Using Secondary Zones for root _msdcs at child domains. Problem: Serial Number of Secondary zone is higher than the primary – zone transfers stop. Hotfix Q304653 The Serial Number Is Decremented in DNS When You Reboot Solved in .Net There is currently a bug in Windows 2000 DNS with regard to using standard secondary zones. Symptoms are that replication breaks – authentication, AD replication, etc. Comparing the Serial Number of the secondary zone and the serial number of the primary zone shows that the SN of the secondary is higher than the primary. This stops zone transfers Cause: Apparently, rebooting the Primary name server for the domain causes the serial number to decrement. FIX: Post SP3 hotfix Q – an updated version of DNS.exe If you use standard secondary zones, apply this hotfix.

DNS Troubleshooting Basics
Check DNS event log (and others). Check Location of DNS servers. Usually want Name Server in remote sites. Check population of SRV records. _msdcs; _tcp; _udp; _sites Need Kerberos, LDAP records for each DC. Correct address, etc. Can delete, repopulate by restarting netlogon. Check Delegations – correct names, IP. Check for DNS lookup errors. These are logged in the description of events in the DS and sometimes in the System log – not just the DNS log. Clients in remote sites with name server issues may need a local DNS name server – even if it is a caching only one. Make sure any DC that is having problems has SRV records registered properly in DNS. If there is any question if the records are correct, delete them from DNS and restart Netlogon on the DC to re-register them. Note: Multi master replication makes this hard sometimes! If there are delegations, ensure that the NS that host the delegation are the right name and IP.

DNS Troubleshooting Basics
Use of Active Directory Integrated (ADI) zones. Put standard secondary zones on mbr svrs. Can clear problems by switching to Std Pri. Ping DC by SRV record: ping <guid>.site._msdcs.compaq.com. Clear the server cache. Negative Caching problems. Test – Server Properties – Monitoring tab. Test – Ping names, NSLookup. Best way to solve ADI name resolution problems is to stop DNS on all ADI primary Name servers and delete the zone (don’t delete it from the AD!) – leave it on ONE name server only. Delete any secondary zones as well. On the remaining ADI primary NS, change the zone to Standard Primary. Create Secondary zones on one or 2 other NS – depending on network config. This will force all NS to have the same info. After the problem is resolved, you can delete the secondarys, convert the std. primary to ADI primary and add additional ADI primarys. Remember the note earlier about selecting a single ADI NS as the “primary” and have all other NS in the domain point to it for DNS in tcp/ip properties. You can ping SRV records and resolve them via NS lookup to see if they can be resolved In the DNS snap-in, go to server properties and in the Monitoring tab you can test simple query and recursive queries. Both should pass. Ping domain names, FQDN names of DCs, etc. to be sure name resolution works

Troubleshooting AD Replication

Replication Troubleshooting Tools
Event logs – Directory Services, System Sites and Services snap-in Age of Directories (AOD) – HP Replication Monitor Aelita Event Admin NetPro Directory Analyzer Command Line (Support Tools & Res Kit) DCdiag, Netdiag Repadmin.exe These are some of the best tools for troubleshooting AD Replication Aelita can be found at Netpro can be found at

Event Logs for Replication Troubleshooting
Directory Services Log Subnets not mapped. Will break client’s “site awareness.” serious - Not enough connectivity. Connectivity, traffic issue. Sites with DCs and no site links. Site topology incorrectly defined. DNS Lookup failure. 1772 – RPC Server is unavailable. Physical connectivity. DNS. These are some common events seen in the Directory Services (DS) event log. Note that “DNS Lookup failure” can occur in events as a description – not as a particular event ID – and may appear in system and Directory Service event logs in addition to the DNS log.

Event Logs for Replication Troubleshooting
System Log Netlogon errors Authentication Trusts Secure channel w32Time errors Kerberos authentication required for replication DCs must be no more than five minutes out of sync. Watch time zones! Netlogon errors can be debugged with NLTest and Netdom utilities – testing trusts, secure channels and authentication. These can be due to corruption, although they are more likely due to inability to contact the target machine. Look for 1311 and 1722 events Authentication and replication errors can be caused by the system time being more than 5 minutes out of sync. This breaks Kerberos. You can use w32tm or net time to sync to a common time server. Synching to an external source is not required by w2k but may be required by your company. Watch it when you move a DC between time zones!

Sites and Services Snap-in
Check for duplicate connection objects. KCC generating >1 connection between 2 DCs. Delete all connections and select “check replication topology” option to regenerate them. If they come back, find out why. Usually a DNS problem. Breaks FRS and AD replication. If duplicate connection objects continue to reappear, it is most likely due to DNS resolution problems. Not a bad idea to clean up connection objects by deleting them and letting the KCC regenerate them either by forcing it or waiting for normal replication. It's not uncommon for stale connection objects to not get cleaned up by the KCC – such as when a DC is down and the KCC reroutes around it temporarily. Keep an eye on these!

Check for sites with no DC’s… OK to have a site with no servers if you plan it that way. If there should be a server in that site, find it and move it there. Make sure all subnets are mapped to correct sites. Keep up on IP addressing changes. Mapping IP subnets to appropriate sites is critical to clients finding local DCs for authentication and DFS servers, and to the DC mapping to the correct site. Not required but will break things if you don’t and replication will not be efficient.

Make sure site links are correct. Link correct sites per design (need a drawing). Cost, schedule, replication frequency. Force replication between DCs. All connections are inbound. Use “check replication topology.” Create new site, user named for the DC. Checks Configuration NC and Domain NC. Force Replication Between Replication Partners. On DC1 from DC2 and on DC2 from DC1. One of the best tests for replication is to force replication between DCs in the sites and services snapin. You can create Manual connections between DCs that the KCC didn’t generate them for. Create User and site on good and bad DC – name them for the DC (i.e. DC1). Let replication occur. See if all objects replicate to all DCs. If they fail, it will show if inbound or outbound replication is broken.

Validate inbound, outbound replication on all DCs. Create new site, user named for the DC. Checks Configuration NC and Domain NC. Wait for replication (don’t force it). Check each DC for copy of these users, sites. For example, create Users DC1, DC2 and DC3 on DC1, DC2, and DC3 respectively. Create sites n similar manner Let replication happen (don't’ force it) see which DCs have which objects. Here we see that there is no inbound replication from DC3 to DC1 for the domain configuration (no user3 on DC1), no inbound replication for domain or configuration container on DC2 for DC1, and no inbound replication for either NC of DC2 to DC3. DC1 DC2 DC3 User Site DC DC1 DC2 DC2 DC3 User Site DC1 DC1 DC3 DC3 User Site DC2 DC2 DC3 DC3

Check Cname DNS Records
In root _msdcs zone (only), alias record mapping DC’s FQDN to its server GUID. Only one record. Delete duplicates. Match GUID in alias record to GUID reported by Repadmin /showreps. If in doubt, delete DC’s Alias record(s) and re-start netlogon on broken DC to re-register . Cname records are only in the root zone of a multi-domain configuration. They provide name resolution that is required for every replication operation

Age Of Directories Tool - Demo
If interested, contact me Age of Directories (AOD) is an unsupported internal tool written by HP that gives a 3D view of replication topology, including metadata information and errors. Send me if you would like a copy –

Replication Monitor Status report (replication health report)
List of all GCs, BHS, Trusts List of all replication errors on all DCs in domain Changes not replicated Replication partners Force push/pull replication Meta-data Group Policy Object status FSMO validation Inbound connections (including reason) Replication Monitor (ReplMon) provides a wealth of tools for debugging replication problems. Add a server to the configuration in Replmon, right click and the properties will be displayed. Select Generate Status report for a detailed listing of replication functions including DNS. Take time to just browse the options in all menus and play with them. All output can be sent to a text file with the Save As button on each screen. Some features Replmon does that no other tool does: View object metadata in a clear, easy to see table format Force a push operation of replication Not only tells which machines are FSMO role holders, but tells if they can be contacted Lists all replication (only) events on every DC in a selected domain. Puts them in one list without sorting thru all those event viewers Lists GPO status – whether a DC has received the gpo update Graphical view of DCs – replication partners (whether they are direct or transitive), GCs, and which NCs they replicate List of all GCs in the domain or enterprise

Replication Monitor

Command-Line Utilities
RepAdmin In Support Tools. Perhaps the most useful tool for troubleshooting replication. /showreps - lists inbound, outbound connections. Only one to list outbound connections. Lists Server GUID (used for replication). Lists successful replication messages. Lists replication errors. Lists Replication partner used to replicate every naming context – inbound and outbound. RepAdmin is one of the best tools for quickly troubleshooting Replication – particularly the \showreps switch. This switch will list the server Guid (used for validating the Cname DNS record), success/failure of Replication of each partition, identifies direct replication partners and is the only command to list Outbound connections, though it doesn’t list outbound success/error.

NTDS Diagnostic Logging
HKLM\system\CCS\Services\NTDS\diagnostics Set value = 0-5 0 = off 5=very verbose Start with 3 to begin with Reported in Event log Important Values 1 Knowledge Consistency Checker 13 Name Resolution 5 Replication Events 8 Directory Access 9 Internal Processing 18 Global Catalog This is invaluable to diagnosing problems. The values here can be set with data from zero (0) to 5 with zero (0) being “off” and 5 being most verbose. Usually start with setting it to 3. Careful – these will fill up event logs fast! Turn it to zero after you are finished testing. Replication events use Replication Events, Knowledge Consistency Checker and Name Resolution. Internal Processing can be set to dump additional information about internal errors.

Things that break Replication (or indicate that it’s broken)
Duplicate connection objects Orphaned objects Esp. DC objects, caused by a DC being removed from the domain without successful DCPromo. Garbage Collection initiated manually before all DCs and GCs are fully replicated. Reported in event logs. Should never be duplicate connection objects – i.e. two connection objects between the same two computers Orphaned objects – DCs that are removed from the network without graceful demotion. This will break replication to other DCs Forcing garbage collection too soon may delete a parent when a child object still exists – will cause a lot of problems until cleaned up – hard to find..

DC unavailable Down Name Resolution Network problem DNS misconfigured TCP/IP addresses change Delegation Client resolver configuration (including name servers) DHCP scope configuration for DNS registration Failure to Contact a DNS server (for SRV records) If a DC is unavailable – due to network problems cpu or resource problems or DNS name resolution or whatever, replication can’t take place. The KCC will ultimately route around it but that DC will be out of date. If TCP/IP address for a DC or DNS server changes, all affected machines must be updated – esp in the TCP/IP properties on the machine, in DNS delegations and forwarders on the DNS servers, etc. Fail to do so will cause replication failures since this will prevent resolution of the Alias (Cname) records or SRV records which will in turn cause client failures for authentication or application failures such as Exchange since it can’t find a Global Catalog Server.

KCC doesn’t do it’s job Routes around inaccessible DCs by creating duplicate connection objects. When DCs come back on line, KCC should clean up the duplicate connection objects. Usually doesn’t… Causes replication errors. Events in the DS Log. Need to clean them up manually. When a DC is unavailable, the KCC, after a couple of cycles, will create replication topology around it. These are temporary connection objects that, in theory, will be deleted when the DC is available again. However the KCC doesn’t do a great job in cleaning those up so the Admin must monitor and keep them cleaned up. Note that this behavior is much better in Windows .NET.

Lingering Object Behavior
Basics Scenerios

Object Deletions Deleted objects turn into tombstones
Tombstones replicated to other DCs This is how replication partners learn that an object was deleted Tombstones purged from local database after tombstone lifetime has expired AD: 60 days, adjustable (2 days minimum) Sysvol: 60 days If tombstone does not replicate to a DC, object deletion is not replicated Object not deleted on this DC Object is now a Lingering Object Can be on DC or GC Rule: tombstone lifetime = Max time DC can be disconnected Max lifetime of Backup tape Before we cover lingering objects, lets review the process on how we delete objects within the Active Directory and how we replicate object deletions. When the administrator deletes an object in the directory service, the object is not removed from the db immediately, but is turned into something called a tombstone. Once the object is a tombstone we replicate the tombstone out to all the other Domain Controllers and they basically learn that the object was deleted when they replicate the tombstone in. So the arrival of the tombstone flags, marks that the object was deleted. After the tombstone lifetime which by default is 60 days, the tombstone is purged from the database, this is a local operation on the DC in the database itself. So 60 days after the deletion which has killed the tombstone from the database, it’s gone. In Active Directory you can change the tombstone lifetime either higher or lower, although Microsoft doesn’t recommend a lower value. But you can go higher if you think you need to, and it is recommended in some disconnected environments. SYSVOL also uses tombstones but you can’t adjust the tombstone lifetime- it’s fixed to 60 days. An issue that comes up here is if you have a DC and you disconnect it from the rest of the environment, more than 60 days pass, then the situation may happen that you delete an object on the DC, the DC replicates all the tombstones to all the replication partners it can reach, and after 60 days the tombstone is purged, and it’s gone. And if you reconnect the DC after 60 days say 100 days or so the tombstone will never be replicated to the DC so this DC will never learn that the object was deleted. For a while this object probably exist on this DC but when you change an attribute on this object the attribute change is replicated out to all the other Domain Controllers. The other Domain Controllers logged on have a copy of the object so it will request a full copy of the object and this is how these objects come back into the environment. So these objects are coming back as lingering objects .MS uses another term for these objects they call them zombies. Zombies are something that you shoot and they disappear for a while and then they come back and just hang around or they do weird stuff to the system. So the rule of thumb here is that the lifetime basically defines the next time when the DC can be disconnected and also defines the next lifetime of the backup tape.

Lingering Objects – Scenarios
Deleted object re-appears on all domain controllers in a domain and on all GCs Deleted account does not disappear from Exchange GAL Object was moved between domains and disconnected GC is brought online Replication error on GC when new object is created Lingering object still holds attribute where uniqueness is enforced (samAccountName) Exchange cannot create mailbox because object already exists What are some simple scenarios where you could run into some issues with lingering objects. For example, a deleted object reappears on all the DC in your domain on the Global Catalog server, for example if it’s a user that was fired and the object comes back Another scenario is you delete an account but it doesn’t disappear from the Exchange GAL which means its still hanging in one of the Global Catalog server and probably this GLOBAL CATALOG was disconnected for a long time. An object was moved between domains and now a disconnected GLOBAL CATALOG is brought online again while the object moved, this might create problems because we have the lingering object in the GLOBAL CATALOG server. Or we get into replication errors on the GLOBAL CATALOG server when you try to create a new object, like for example a lingering object or zombie still holds an attribute which must be unique within the domain or forest. These are some symptoms for lingering objects.

Why does this Happen???? DCs disconnected for more than tombstone lifetime Left in storage room for long time Replication failures I.e., bridgehead servers overloaded, no monitoring in place WAN connections down for a long time Tombstone lifetime abuse “Somebody” changed time on a DC to garbage collect an object Tombstone lifetime was changed to garbage collect objects on single servers Can this be avoided? YES, monitor KCC topology and replication Do not set tombstone lifetime to less than 60 days DCs offline > tombstone lifetime must be re-promoted Again the reasons why this can happen in the first place is because the Domain controller was disconnected for more than the tombstone lifetime obviously, it was left in the storage room for a long time, it has been seen by Microsoft in branch office deployment where someone created 1200 domain controllers put them in a storage room and wanted to send them out 3 or 4 months later. This is not a good idea, you are looking for trouble. Replication failures will occur – the domain controllers don’t replicate for a long time, nobody notices it because there is no monitoring in place or the tombstone lifetime was changed to 2 days and then all kinds of stuff can happened. Someone changed the time on the DC and now weird stuff is happening and of course you can never determine who that somebody was after the event. Can you avoid these situations – yes definitely. 1) get monitoring in place so you can make sure that you learn when Domain Controllers don’t replicate for a long time. 2) Don’t set the tombstone lifetime to less than 60 days, not even if you want to get space back from the database or you want to purge tombstones. This is not a good idea because the system doesn’t like that. One other situation. a DC is offline for more than the tombstone lifetime you then need to repromote it. (I’ll cover this more in the Best practices section ).

Lingering Objects Strict vs. Loose Replication Behavior
Defines how DC reacts if an update for an object is replicated in, and the object does not exist on DC Loose Behavior DC requests full copy from replication source Logs event ID: 1388 Strict Behavior DC stops replication from offending replication source Logs error code 8240 (ERROR_DS_NO_SUCH_OBJECT) embedded in event ID 1084 Requires logging level 1 Behavior can be set via registry key HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\NTDS\Parameters\Strict Replication Consistency Introduced in Q314282 SP3 changes were made to detect these lingering objects or zombies. Next I will talk about the concept of replication behavior – you can have loose behavior or strict behavior. Loose behavior basically is - you make a change to the attribute on the object; you replicate it out, the other DC‘s don’t have the object and they request a full copy - that is called loose behavior. And you see that this has happened by looking in the event logs and look for event id 1388. Of course you may not be able to look at the event logs on all systems so hopefully you have MOM in place that can pop these messages to an administrative console. There is a very good article that describes this in more detail – Q – “Lingering objects prevent Active Directory replication from occurring”. ”. In strict behavior when we see that we get an attribute replicated in from an object that doesn’t exist, in other words from a lingering object or from a zombie, then we isolate the offending DC from the rest of the environment. We stop replicating in from this guy- that’s what is called strict behavior. See Q In strict behavior event id 1084 will be logged and error code 8240 will be in the event logs. After you install SP3 you can put a registry key on the DC so you can either toggle between loose behavior and strict behavior. What should you do when you find a lingering object- how do you get rid of it? Well if you find a lingering object on the DC, on loose behavior just delete the object via one of your favorite admin UI’s, and then the object deletion will be replicated around. In strict behavior it’s a bit more complicated – there are some procedures in the kb article Q and Q on how to delete the objects in all the Domain Controllers and still have lingering objects. These 2 Q articles are the most useful when you are trying to help with a lingering object problems.

Deleting Lingering Objects
If found on a DC In loose behavior: Delete the object via users and computers In strict behavior: Follow procedures outlined in Q314282 On GC (in read-only NC) Object cannot be changed or deleted on GC Solution 1: Delete object on writeable replica (if possible) Solution 2: Use ldp to delete the object on the GC Support to remove lingering objects from GC added in Q314282 Follow procedures outlined in Q314282 You might have to set loose behavior temporarily On a Global Catalog server it’s a bit harder because the Global Catalog server is not a writeable copy of the ACTIVE DIRECTORY database. So you can’t delete an object on the GLOBAL CATALOG server unless it’s in the naming context for which the GC server is also a DC ,then of course you can. So what should you do if you find a lingering object on the GLOBAL CATALOG – solution 1) if you find the lingering object on the DC in the domain where the object exist then you can delete it there. But sometimes the lingering object only exist on the GLOBAL CATALOG’s servers only and not the domain . In those scenarios you need to use a tool like ldp and there are procedures in the kb article that will tell you how to delete this object from the GLOBAL CATALOG right away. In order to make this easier you might want to toggle to loose behavior temporarily.

Best Practice Recommendations
DC has not replicated for more than 60 days Tombstone lifetime default (60 days) Do not replicate, re-install OS Tombstone lifetime adjusted to > 60 days 60 days < time DC disconnected < tombstone lifetime Re-connect DC, restore sysvol Time DC disconnected > tombstone lifetime If you have to disconnect a DC Make sure that it replicates successfully before you take it off-line New deployments Add registry key to enforce strict replication behavior at DC OS installation time Now for existing deployments- in existing deployments you want to be careful. You don’t want to make a change to the whole environment and then see what happens. If you have a lot of lingering objects that means Domain Controllers stop replicating you want to be careful. Even when installing SP3 and the kb article ,the default behavior will still be loose replication. So you should get to strict mode as soon as possible.

More Best Practice Recommendations
Existing deployments Default setting: Loose replication (even on SP3) Goal: Get to strict mode asap Set registry key to strict mode on all DCs Watch event logs on DCs If you get many replication errors on single DCs, re-promote DC For small number of replication errors, clean-up the DC Delete lingering objects if necessary Follow procedures outlined in Q314282 If you were monitoring… Then don’t worry, you won’t see any replication errors  Don’t lower tombstone lifetime to less than 60 days Monitor! So what you should do is set the registry key to strict mode on all Domain Controllers or selected Domain Controllers and then watch the event logs on these Domain Controllers and see if you get replication errors or not. If you don’t get replication errors you are fine; if you see replication errors follow the article on deleting these lingering objects and then go back to strict mode and run strict mode forever and you won’t see these problems anymore or you will be able to react to them right away. If you are monitoring your environment don’t worry – you won’t have any lingering objects so you can go to strict behavior right away. Another one is don’t go with tombstone lifetime < 60 days and in case I didn’t mention it always monitor your environment.

Lingering Object Fix Q317097 (good instructions)
HKLM\System\CurrentControlSet\Services\NTDS\Parameters… Add Value Name = Correct Missing Object Data Type =REG_DWORD Value = 1 (tight) 0 (loose) Allows or Restricts AD replication when lingering objects are discovered. Tight when you want to know. Loose to inventory and remove the objects. the Regkey to control this is on the slide. Value of 0 is loose – value of 1 is tight. NOte that in KB Q there are good instructions on how to do this. Note also that you can use this key to repoplulate objects deleted by accident if they still exist in the AD (see the KB Q317097)

Value Level Replication
WNT: Object Replication change to attribute or value W2K: Attribute level replication Better than NT (more efficient) Change to attribute replicates attribute Change to value replicates attribute Problem: Multi-Valued Attributes Group = Attribute Member = Value Change Member = replicate attribute with all members Impacts network traffic Limit (per Microsoft) of 5,000 users/group .NET: Value Level Replication Replicates values – not attributes Eliminates 5,000 user/group limit In Windows NT, we replicated objects – so if we change the address attribute on a user object we have to replicate the whole object. This cause undue use of network bandwidth. In Windows 2000, we replicate attributes – so we can replicate the address attribute without the user object Problem is multi-valued attributes. These are attributes that contain more than one value. A good example is a group. A group is an attribute but group members are values in that single attribute. If we have 500 members of a group and we remove one member of the group, we have to replicate the entire group attribute and the 499 attributes. This lead to a recommendation by Microsoft that a group should not have more than 5000 members due to replication efficiency issues. Admins got around this by using nested groups. In Windows .NET, we can replicate individual values – so now we can remove or add a group member and just replicate that one member and not the whole group. This eliminates many restrictions such as the 5,000 users per group limit.

Domain Limit There is a limit of about 800 child domains to a single parent Child domains are unlinked, multi-valued attribute – stored in the crossref attribute of the domain object Jet database limits the data that can be stored. No way to patch – must change Jet “Might” be improved in Longhorn (not Whistler) Recently Microsoft discovered a hard limit on the number of child domains that can be created under a single parent. Because child domains are stored in the unlinked, multi-valued attribute CrossRef, there is a limit of how much data can be stored in the database cell that that attribute references. It turns out to be a limit of about 800 domains. Since this is a limit in the Jet Database, it isn’t an easy fix. It is not improved in Windows .NET but “may” be improved in Longhorn or Blackcomb versions.

Domain Limit One customer got to 900 domains Replication failed
Authentication failed Mission critical application failed Temporary Repair Demote all domains in reverse order of creation to return to 800 Fixed Replication Solution Redesign and redeployed to a single domain One customer had created 900 domains under a single parent. Replication stopped, authentication at some sites failed, mission critical apps failed and there were other problems. The solution was first to demote the last 100 domains in the reverse order they were created. You have to get the last 100 (to get back to the 800 limit). You can’t just take out ANY 100 domains. That will fix replication but leaves those 100 domains out of the forest. The real solution was to redesign the AD. They moved to a single domain with a bunch of OUs, but had to completely tear down the AD and migrate users to a new structure.

DCPromo Troubleshooting

DCPromo Basics First Test of: DNS registration and resolution .
LDAP query and response. Kerberos authentication. Active Directory replication. FRS replication. Application of group policy. Validation and Flow … Chapter 2, Active Directory Data Storage in the Windows 2000 Resource Kit DCpromo is the first test on a server of DNS registration, LDAP query, Kerberos authentication, replication, and group policy application. Thus, there is considerable opportunity for failure. See chapter 2 of the Resource Kit for an excellent discussion on DCPromo

DCPromo Logs %windir%\debug Set verbosity on dcpromoui.log Dcpromo.log
Dcpromoui.xxx.log Set verbosity on dcpromoui.log HKLM\Software\Microsoft\Windows\CurrentVersion\AdminDebug Values: DCpromo and DCPromoui Data = Default 0xFF003 – full file and debugger logging output 0xFF001 – maximum detail to DCPromoui.log Very helpful logs are created in %windir%\debug: Dcpromo.log – this log is created the first time dcpromo runs on a computer and is appended on each successive running of dcpromo. Most recent stuff is at the bottom. Dcpromoui.log and dcpromoui.xxx.log – This file contains more verbose information generated from the dcpromo wizard – contains all the messages seen in the gui during dcpromo. The 2nd time dcpromo runs, the existing dcpromoui.log is renamed to dcpromoui.001.log and a new dcpromoui.log is created. Thus dcpromoui.log is most recent. The slide describes the method to set verbosity on dcpromoui.log – recommended if dcpromo fails.

DCPromo Phases Initialization UI Input - DNS Name resolution
LDAP Query/resp - Kerberos Authentication AD Replication FRS Replication Wrap Up Apply policy - Upgrade Trusts Publish new DC in the DS The major steps in dcpromo are: UI Input DNS name resolution LDAP query/response Kerberos authentication AD replication (inbound from a DC to put a copy of the AD on this machine) After reboot, creates an outbount connection, copyies sysvol tree and group policy templates from a DC, upgrades trusts and publishes the new DC in the Active Directory.

Initialization Phase Authorization error
Enterprise Admin required to create new domain (or to remove the last one). Domain Admin required to add replica DC (or demote a replica). Can’t find DNS with Dynamic Updates. Prompt to let DCPromo configure DNS. Creating domain. Answer NO! Replicas, Child – must find DNS server to locate a “sourcing DC.” In the initialization phase, credentials are checked – must be EA (create a new domain) or domain admin (create a replica) If dcpromo can’t find a dns server that allows dynamic updates it will fail and ask if you want dcpromo to configure and install DNS. Select the option that states that you will do it later. If you let dcpromo do it, it will create a “.” (dot) zone and you wont’ be able to get above this level (to the internet or higher domains in the tree). Note that windows .net does not do this anymore. You must find a dns server who will return the srv record of a DC that will partner with you to get the AD.

Errors Creating the Computer Account
Need privileges to create the account. First creates the account, puts it in domain/computers container. Then puts it in domain controller’s OU. Source DC identified in DCPromo logs. These are steps dcpromo uses to create the computer account. If this step fails, make sure your acct has privs and that you can reach a DC. This information is contained in the dcpromo.log and dcpromoui.log files.

DCPromo Initialization Checklist
Privileges required Enterprise Admin if creating new domain. Domain Admin if creating a replica. System time configured properly Kerberos requires sync within five minutes. All parent, child domain DCs. Sufficient free disk space. ~850 MB Domain Naming Master FSMO required if creating new domain. These are requirements for dcpromo to succeed.

Everyone or Enterprise DC group has “Access this computer from network” Enterprise DC group rights: Manage Replication Topology. Replicating Directory Changes. Replication Synchronization. Sourcing DC Security policy applied. Enable Computer and user account to be trusted for delegation. More requirements for dcpromo – ensure all these exist (normally the default but strange things can happen  )

Target DC has valid Kerberos tickets. Kerbtray.exe utility from Resource Kit. GC must be contacted. Nltest /dsgetdc:compaq.com/GC Able to contact a functional existing DC. Uses UDP (watch for firewall issues). Can use TCP but it’s a Microsoft Secret! Use Ping, NLTest, Nslookup to find a DC. Kerberos is required for DCPromo to succeed. The kerbtray.exe will list kerberos tickets. If it is empty, kerberos isn’t working, which means: Can’t find a dns server with an appropriate srv record for a Kerberos KDC Or… Can’t connect to the DC returned by DNS For native mode domains, a GC must be able to be contacted. Use NLtest to see if you can find one that way. May need to make a local GC if connecting over a WAN. DCpromo uses UDP to find a DC – if UDP is blocked at the firewall, dcpromo will fail. Microsoft can tell you how to force it to tcp/ip but it is terribly slow and you don’t want to leave it like this.

If Source DC not Reachable...
See if one responds. Ping FQDN of domain (Ping compaq.com). NLTest /dsgetdc:compaq.com /ds Other: /gc /pdc /timeserv Check Site mapping for this computer. Nltest /server:<name> /dsgetsite Check Dcpromoui.log to see source. Force DCPromo to use a specific source Q224390 Turn off Netlogon on other DCs. Join the Server to the domain then DCPromo. Use Ping and NSLookup to see if you can resolve a DC by name. check for errors in dcpromo and dcpromoui.log You can us an unattended answer file to force dcpromo to source from a particular(local) DC if that’s the problem. Try joining the server to the domain as a member server then promoting to a DC (Proves that dns, etc is working and gives DCpromo a boost in not having to join the domain and create the computer account.

Info to Collect for Debug
Netdiag /v Problem DC Source DC (see dcpromo.log) DCDiag /v Source DC Replication working? (other DC in site) Look at Netdiag and DCdiag output for errors

AD & FRS Replication Phases
Initially inbound connection created to replicate from source DC. Machine acct (DC1$) moved to DC OU. UserAccountControl Attribute set 4096 (1000 hex) = Workstation/Server (82000 hex) = DC Account is moved. Error: DC1$ not found, access denied, etc. Credentials of account running Dcpromo Source must have computer object. Source must have security policy applied to itself. Q250874 During the first phase of Dcpromo, (before the reboot), an inbound connection is made from a good DC, the machine acct is created and moved to the domain controllers OU and the useraccount control attribute on the computer object is set to (82000 hex). If you get an error saying “DC1$” is not found, access denied, etc, (where DC1 is the name of the computer), check credentials of the account used for dcpromo, make sure the source DC is healthy (replicating to others), has a computer object in the domian controllers OU, and that it has security policy applied to itself (check the source DC’s event logs). See Q250874

AD & FRS Replication Phases
After first reboot… Outbound connection created. AD changes for new DC replicated to source. Including UserAccountControl attribute. Server (Replication) object. Replicated to other DCs. Sysvol is populated (policies copied to new DC). Sysvol and Netlogon Shares created. after the reboot, an outbound connection is created from the new DC to the source,and the new DC’s info is replicated to the source who then replicates it thru the forest and domain. The sysvol tree including group policies are copied to the new DC Sysvol and Netlogon shares are created.

Troubleshooting Missing Sysvol, Netlogon Shares
Outbound connection failed Look in Sites and Services or Repadmin UserAccountControl still 4096 on source [Q257338] – Good but … Build manual “outbound” connection Force KCC to “Check Replication Topology” Check UDP traffic if in a remote site. Dcpromo (to the reboot) may proceed without errors, yet will fail in the post-reboot phase but not report errors. check to see if the netlogon and sysvol shares are created (C:> net share) check to see if the useraccountcontrol attribute is correct (previous slide) for a DC (and make sure good DCs see this attribute correct for the new DC) You can create a manual replication link with the repadmin /add command and repair this

Missing Sysvol and Netlogon Shares
Create replication “links” manually then force replication: Repadmin /add (adds outbound link) Repadmin /sync (forces replication) Can’t create them manually. When Replication is fixed, they’ll get created. you can try creating manual replication links from the new failed DC to existing good DCs and try forcing replication across it. This works surprisingly well! You can’t create the netlogon and sysvol shares manually to fix the problem. It’s caused by replication failure. If replication works, the shares will be created.

Tracking Down a GUID Problem: GUID referenced in event log. What is it? Solution: (Q216359) LDP – search for the GUID Search.vbs in Support tools Orphaned Object (will kill replication) Turn up NTDS diagnostic logging Internal processing Replication Find object (GUID) in event logs Delete it via LDP many times, an object is referenced in event logs by it’s GUID. The AD can be searched via a tool like LDP to find out what object it is. KB tells how. Example: if an orphaned object exists, it will kill replication at least partially. You can turn up NTDS diagnostic logging for internal processing and replication (to 3) and usually get the guid of the object. You can then delete the object by it’s guid in LDP. Note that sometimes you can’t find an object by searching for it but you can delete it!

DCPromo Improvements in Windows .NET

Install From Media (IFM)
Source Replica AD from Media in DCPromo GCs or DCs (Replica only). No initial replication from a DC. Faster (no searching for a DC). Less network impact (No full sync on the WAN). Easy branch office installation. After initial load, replicates changes. Network connectivity still required. Unattended Answer File Support: ReplicateFromMedia ReplicationSourcePath Windows .NET allows you to backup the system state of a domain controller, restore the backup to a local disk on a member server, then Run DCPromo from a command line with the /adv switch. It will then prompt you to ask if you want to find the AD from the network (default and how w2k worked) or if you want to use restored backup file. If you select the latter, you can specify the location of the restored files. DCPromo will then use those files rather than the network to get it’s copy of the AD on that machine. this is a big deal in deploying remote DCs. in w2k you had 2 choices – ship a member server with w2k CD to the remote site and replicate over the WAN or install it as a DC and ship it but then it might be outside the tombstonelifetime period and then you have the lingering object problem. So…now with .NET you can restore the system state of a DC to a CD, ship the CD with the server to the remote site (and run it with an unattended answer file) and avoid the WAN. Of course the CD may be out of date if they delay installing it, but you would just have to ship them another CD or download the files from a share, web site or ftp site. You still have to contact a DC because at the end it will replicate changes made since the backup was made.

Install From Media (IFM)
Unattended Answer File Support ReplicateFromMedia ReplicationSourcePath Media must be local drive. Media useful life < 60 days. How?Use Backup Files/Media Create first DC in domain. Back up DC. Restore to Media (local disk, CD, …). C:>dcpromo /adv. Wizard produces an additional screen…

If you run from the command prompt:
Dcpromo /adv you will get this additional screen which you can select the option to use the restored backup for a source and point to the location.

DCPromo Answer File See Q223757 [Unattended]
Unattendmode=fullunattended [DCINSTALL] UserName=administrator Password=Password3 UserDomain=corp.net DatabasePath=c:\windows\ntds LogPath=c:\windows\ntds SYSVOLPath=c:\windows\sysvol SafeModeAdminPassword=Password2 CriticalReplicationOnly SiteName=Seattle ReplicaOrNewDomain=Replica ReplicaDomainDNSName=corp.net ReplicationSourceDC= ! Leave this blank for IFM ReplicateFromMedia=yes ReplicationSourcePath=e:\DSrestore RebootOnSuccess=yes The answer file can be used to run DCPromo unattended. You will be prompted for information for any options that are not defined – just like unattended install. For instance if enter all the information – username, password, etc. like it is here, you will not be prompted at all during DCPromo. Note that the Site, Seattle, must already exist. IN this case, using the restored backup files as a source, we have to set ReplicateFromMedia to Yes, and leave the ReplicationSourceDC value blank. Also set the ReplicationSourcePath to the directory that the restored backup is on (must be local HDD or CD drive).

File Replication Service (FRS) Basics

FRS Background File Replication Service
Replicates file system portion of policy Optional replication engine for DFS Concepts Challenges Journal wraps Staging File backlog Reconciliation / Morphed Directories Moving on to FRS it’s a big call and labor generator for us. 1st let me give you some background and concepts and then we will drill down to some specific problems

Concepts Objects in DS Members, Subscribers, Conn. objects, filters
Depends on AD replication Determines partners and schedule NTFS USN Journal Used by FRS to track changes to NTFS volumes Staging File and Directory Rename safe Compression support Database Record of incoming, outgoing & existing files When you add a machine and make it a DC or when you add a machine to a FRS replica set, we create a set of objects in the ACTIVE DIRECTORY. These include member objects that define which replica set you are a member of, subscriber objects that tell you who to replicate from, connection objects that move the data between machines. So all this is depending on ACTIVE DIRECTORY replication, so if ACTIVE DIRECTORY replication isn’t occurring then we don’t get these objects between Domain Controllers and nothing works. We’ve also had some problems with people improperly deleting objects or references to attributes in FRS replication. So obviously an important thing. Next thing is the way FRS uses the USN journal. Windows 2000 added a journal that reports all changes to NTFS formatted partitions. FRS hooks into this and this is how it learns the changes to a replica set, and this is also why DC promo requires that sysvol be located on a NTFS formatted partition. Replication sounds simple on a 1st glance but in fact as you know, it’s very complicated particularly when you have directories being renamed and moving in and out of the tree. And all sorts of change orders are arriving in various orders from different Domain Controllers. So to accommodate moving a file into place, we have this concept of a staging directory, which is a place where we create a file that moves between partners and gets renamed to a target directory. This is also where we can enable compression for these files so you get the benefit of less utilization on the wire. Finally we have this database that keeps track of all files in the tree and incoming and outgoing changes that are replicated between members. New slide

File Replica Service (FRS)
Replaces NT 3.X\4.0 LMREPL service Replicates SYSTEM Policy, Group Policy, DFS Group policy templates Ntconfig.pol & logon scripts for down-level clients NETLOGON Share DFS share contents Multi-threaded replication engine Replicate different files to different computers simultaneously. FRS replaces the old LMRepl Service and replicates NT4 System policy (the Netlogon share), Group Policy and DFS shares FRS is multi-master and while it actually runs on top of AD replication, it has it’s own FRS replication partners which may or may not be the replication partners AD uses

Terminology B is computer A’s outbound partner
Computer A and B replicate DFS+SYSVOL B is computer A’s outbound partner A is B’s inbound partner. A is B’s “upstream” partner Changes flow “downstream to B Upstream Downstream In this example, replication is inbound to B from A. All connection objects are unidirectional. In this case, note the inbound/outbound associations and that A is “Upstream” from B Replication Computer A Computer B B’s Inbound partner A’s Outbound partner

1 3 2 4 Basic Operation GPO GPO DC1 Pull DC2
Notify Replication partners (replicas) of changes GPO 2 Temp File moved to staging directory 1 DC1 GPO Change created on DC1 Pull Partners pull changes from DC1 DC2 4 FRS operation can be viewed from an example: A GPO is modified and saved on DC1. DC1 creates a temporary copy of the GPO and moves it into the staging directory FRS then notifies it’s FRS replication partners that there are changes that need to be replicate Partners pull changes. When all partners have pulled the changes, the temporary file is deleted.

File and Folder Filters
Excluded from FRS Replication: Computer specific EFS files/folders File names beginning with ~ Files with .bak or .tmp extensions NTFS Mount Points Reparse points Configurable for DFS shares Some files are excluded from FRS replication as shown on the slide You cannot configure this for FRS (i.e. add or delete the excluded files) You can configure excluded files for DFS – so you can tell it to not replicate *.exe files.

The Replication Process
AD Object version updated GPO DC1 This is similar to the previous flow chart – but this one shows the directories involved. the gpo is actually in \winnt\sysvol\sysvol\<domain name>\policies the temporary file is copied to \winnt\sysvol\staging\domain (hidden) and to \winnt\sysvol\staging areas\<domain name> The temporary files are deleted when replicated to all partners. If these staging directories contain large numbers of files or if they stay there for more than a few hours, there is a problem \winnt\sysvol\sysvol\compaq.com\policies \winnt\sysvol\staging\ domain \winnt\sysvol\staging areas \compaq.com Notify Partners

The Replication Process
DC2 Pull Sysvol version of GPO updated GPT.ini The replication partner pulls the changed file into it’s \winnt\sysvol\sysvol\DO_NOT_REMOVE_NTFRS_preInstall_Domain folder, then puts it in to it’s \winnt\sysvol\sysvol\<domain name>\policies folder. DC1 /\winnt\sysvol\sysvol\DO_NOT_REMOVE_ntfrs_PreInstall_Domain /\winnt\sysvol\ sysvol\compaq.com\policies

FRS Replication Observe File Replication Process
Edit a group policy – modify and save it. Copy of changed file goes to staging and staging areas directories. Copied to staging/staging areas directories on other DCs.. Moved to sysvol\sysvol directory on the DC. Group policy file is updated. You can actually see this happen. Turn off the FRS service. Open up explorer windows pointing to \winnt\sysvol\staging\domain and \winnt\sysvol\staging areas\<domain name> (where <domain name> is the name of your domain). Edit a group policy and save it. Watch the two windows – within a few minutes you should see a long, funny-looking name with _NTFRS_xxxxxxx (where xxxxxxx is a number) in the file name. re-start FRS service and in a minute or two those files should disappear (depending on how many DCs you have, etc.)

Distributed File System (DFS)

DFS Basics Domain-based (Win2K) vs Standalone (NT) Root
Must be on a DC. Contains PKT. DFS service. Replica PKT from DC, stored locally. DC or Member Server. FRS Replicates Data between DCs Member servers DFS replicate data to share via DFS service. Site Aware (clients locate “closest” DFS Replica) DFS is a way to share files and provide redundancy. one set of files are copied to two or more servers. Users connect to the share, but they really don’t know which server is providing the files. Thus if one server goes down, they would connect to the share via another server – tranparent to the user. the Root must be on a DC (as FRS is used to replicate the files) Root contains the PKT (configuration info) There is a DFS service Replicas can be a DC or a member server. The PKT is replicated from the root and stored locally. If on a member server, replication takes place via the DFS service since FRS only runs on DCs. DFS is site aware – so DFS servers know the site they are in and clients can find the DFS server that is “closest” to them by site.

The DFS Replication Process
DC1 - Root Data This shows that the root can replicate to a DC or a member server Replica. You can also have replicas of the root. DFS service FRS SVR1 Replica SVR2 Replica DC2 Replica Data Data

DFS Troubleshooting Symptom: Shared folders not in sync.
Make Sure DFS service is started on all servers and DCs. Make sure AD Replication is working. Make sure FRS is working. DFSUtil.exe. Watch for applications that keep files open. Anti-virus. Defragmenters. Biggest DFS problem issue is that Shared folders are not in sync. The slide notes troubleshooting tips. Antivirus and disk defrag programs should be configured to exclude the \winnt\sysvol directory as these programs will keep files open and prevent DFS from copying them

FRS Troubleshooting Techniques

Basics Remember… You MUST install latest service pack and hot fix.
Post SP2 (SP3) Hot fix Q307319 Don’t go any further until this is installed. “Multi Master” characteristics replicates changes (and problems) quickly. Turn off the FRS Service to get control. FRS depends on AD Replication, which depends on DNS. Be sure to install SP3 or SP2 + hotfix Q Use netdiag to find out which hotfixes are on them – on all DCs Use Netdiag to find the list of hotfixes installed on a DC

Diagnostic Tools Event Viewer: FRS log, DS Log NTFRSutl.exe
/outlog – outbound logs /inlog – inbound logs /ds – directory service NTFRSxxx.log in \winnt\debug NTFRS Health Check utility HP, Microsoft Netdiag, DCDiag AD replication tools The FRS and Directory services (DS) log are good sources for FRS troubleshooting NTFRSutl in the resource kit is great for gathering information – using the /outlog (for outbound replication), /inlog (inbound replication) and /DS (directory service information). These are difficult if not impossible to extract data from due to their fomatting, but we have some tools from Microsoft that help us (can’t give them out unfortunately) – but they just reformat the data from the ntfsutl output files. NTFRSxxx.log in the winnt\debug directory is also another source of information for problems.

FRS Replication What happens if it breaks?
Changes not replicated to all DCs, resulting in inconsistent AD Group policy gets out of sync and may not get applied. GPOTool: Version mismatch Logon scripts don’t get applied. DFS shares out of sync. FRS is responsible for replicating group policy changes to other DCs (which includes security changes), as well as login scripts, and DFS files If frs fails, changes aren’t propogated to all DCs resulting in inconsistence – and “version mismatch” on the GPOs which will cause group policy not to be applied.

FRS Replication How to tell if it’s broken Events in FRS log
Event 1000, 1001 in app log every five minutes. Files backed up in staging areas Get size of staging directories (MB). Get date of oldest file (how long it has been broken). Group Policy not applied (new changes) In the FRS log, look for Events 1000 and 1001 every 5 minutes – a sure sign FRS is broken look in %windir%\sysvol\staging\domain and %windir%\sysvol\staging areas\<domain name> - there should not be any files there normally – or files that are not more than 3 hours old. If so, FRS is broken – antivirus and defragmenter programs can cause this. They must be excluded from touching the \sysvol directory structure.

Replication Problems Ensure DNS is working.
DNS Lookup Failures in events (description). Ping, Nslookup to resolve names. Domain name DC, Server names Ensure AD Replication is working. Create New Objects and see if they replicate. Repadmin/showreps and /showconn DS Event Log DCDiag Look for Events with “DNS Lookup Failure” in the description – Always means DNS isnt working right. Resolve DNS names via Ping or NS Lookup look for errors in the DS event log, run DCDiag, Repadmin/showreps. Create a new user object on a suspect DC and see if it gets replicated to other DCs. Create a user on another DC and see if the problem DC gets it replicated.

Replication Problems Staging Areas should have no files
Common FRS problem. Check size of dir, date of files. Ensure FRS is working. Create text file on each DC, named for the DC. Put it in \winnt\sysvol\sysvol\<domain name>. All DCs should have copy of all DCs’ text files . Make sure Staging directories are clear To make sure FRS is working, create a text file on each DC – name it with the DC’s name (i.e. DC1.txt). Put it in the %windir%\sysvol\sysvol directory. Do this on all DCs. when replication finishes, all DCs should have a txt file for all other DCs. So if there are 6 DCs, each DC should get DC1.txt, DC2.txt, … DC6.txt in the %windir%\sysvol\sysvol directory. If any of these files fail to show up on a DC, frs replication is broken between those two DCs.

Replication Problems FRS Event Log 13508 – Normal…but watch them
13509 – success after having 13508s 13514 – When Sysvol share not created “FRS preventing computer from becoming a DC” 13553,13554 – FRS successfully added computer to replica set (DCPromo successful) 13557 – Duplicate Connection Objects 13522 – Staging area full Q264822 Lots of KB Articles: Search for “FRS and Event” In the FRS event log: 13508s say FRS replication isn’t working (but this may be normal). If is eventually followed by a (now working), there is no problem You get events when the Sysvol share is not created on DCpromo – fix AD replication 13552, – notice that FRS was configured folloing the DCPromo reboot. 13557 – duplicate connection objects – look in Sites and services on the ntds settings object on this computer to see if there are duplicate connections to the same computer. Delete them all and let the KCC regenerate them. This will break replication. If they keep coming back – it’s probably a DNS problem 13552 STaging area full – there is a limit to the size of the staging areas of about 600mb.This will cause FRS to stop. There is a KB – Q that tells how to increase the size of the staging area. HOwever, there should not normally be this many files generated by FRS before purging. If you have this problem find out why you are getting so many files. Possible causes; FRS is not working (or turned off) on one or more DCs Something (antivirus, defragmenter, etc) is modifying large numbers of files which causes them to be replicated in FRS

Interpreting the Logs NTFRS_000x.log
\WINNT\DEBUG Identify errors, warning messages and milestone events in the log files Very difficult to interpret best way to read these logs located in \winnt\debug, is to search for “error”, “warning”, “Failure”, etc. keep trying – eventually you’ll be able to do it.

NTFRSutl.exe Ntfrsutl inlog = Lists inbound log
Ntfrsutl outlog = Lists outbound log Ntfrsutl sets = Lists replica sets Ntfrsutl DS = FRS’s view of the DS Can execute remotely: Ntfrsutl sets DC1 The raw data from these logs is very difficult to decipher. Microsoft gave us some tools to view them. (presentation note: Put FRS demo logs and utilities in C:\frs. Must have Perl installed (laptop is fine) and put the list.exe in it. NTFRSutl \inlog output (ntfrs_inlog.txt) has been massaged by iologsum.cmd (perl script) and output to iolog-inlog.txt and iolog-outlog.txt. Show audience the raw output (ntfrs-inlog.txt) then run the following from a command line in the C:\frs directory – list iolog-inlog.txt. These utilities format the data. In the raw ntfrs_inlog.txt (which is produced by mpsreports utility as well), you see snapshots of data – showing files that were changed at a particular timestamp – shows the originating guid of the DC, the guid and name of the file, etc. List.exe and iologsum.cmd format the data so it puts all the info for each time stamp in a single row under columns – each row shows the entries for a single file showing the time it was changed and who changed it (orig guid). Use the Arrow keys and page up/down to scroll. You can see a bunch of rows of data – each a single file. viewed in this way you can see a number of files were modified at the same time (or within a minute or so). Problems would show up when you see hundreds of files modified at the same time or perhaps at the same time every day. This is a clue that some process is modifying sysvol files and forcing FRS replication. The point is, without these tools and viewing data that way, it would be nearly impossible to see that from the raw output.

Group Policy Troubleshooting

Group Policy Troubleshooting Basics
Policy isn’t getting applied Set something easy – Admin Templates User Settings: Log off/on Computer Settings: Reboot Client-side extensions act as separate policies – debug separately from Admin Templates Folder Redirection Scripts Disk Quotas Security IE Branding EFS Recovery IPSec Application Management If policy isn’t applied, you can test it by setting something in the admin templates – like change the wallpaper. If it’s a user setting, logoff/logon to see if the policy takes effect (you get the wall paper). If it’s a computer setting, reboot to get the change. remember that cloient side extensions act as separate policies. Thus if you have a logon script defined in a GPO called “Scripts GPO” and it isn’t working, if you set the wallpaper setting and log off/on and the wallpaper shows up but logon script doesn’t run – that’s because scripts is an extension and may not work even if the policy is applied. Use gpresult /v to see if the script is being applied. Also the Userenv.log (set to verbose mode) will give good clues. remember that there are special error logs for client side extensions written to %windir%\debug\usermode (per previous slide)

Group Policy Troubleshooting Basics
Policy applied, but settings not effective. Userenv.log (verbose) Q221833 Set Diagnostic logging Q186454 HKLM\software\Microsoft\WindowsNT\CurrentVersion\Diagnostics Value: RunDiagnosticLoggingGroupPolicy Value Type: REG_DWORD Value Data: 3 (value =off) Change One setting in GPO Logoff/on or reboot Verbose info in Application log Lists all registry settings applied to user Turn it off afterward – fills the event log fast! One good way to debug group policy is to set diagnostic logging for the userenv.log as noted previously in the slides Another is to set verbose logging per Q Set this to 3 (0-5 is valid – higher number, more verbose). This dumps verbose logging to the application event log. Easier to read that userenv.log but you have to read thru a lot of events. Suggest outputting to .txt form for the event log. To use this, you must change a setting in the gpo to trigger replication. Then logoff/on (user setting) or reboot (computer setting). You will have to change a gpo setting every time you want a new output to the event log.

Gpresult.exe Resource Kit command-line utility.
Reports applied policy for user, computer. DN Security groups Verbose mode – gpresult /v Registry settings Computer: Client-side extensions. WATCH: Logon server. Cached policy on client may mask solution. Refresh Policy – make sure it’s applied . GPresult is a client utility – in the reskit for Win2k, but built in to the OS in Windows .NET (demo gpresult output). Always use the \v (verbose) switch. Reports the DN of the user / computer so you know who really logged in and what machine they are on List security groups Lists GPOs applied Lists User settings vs computer settings – shows all registry and client side extension settings – logon scripts, etc. that are applied at that logon/startup Watch who the logon server is – remember you modify GPOs on the PDC emulator and it replicates to other DCs. If the logon server for the client (env. variable) is a DC that has’t been updated you may not be seeing the results of the gpo change for a while. If possible, do testing on a client in a site with only one DC – force replication from the PDC to that DC. Group policy is cached. If it appears your changes aren’t taking effect, reboot to make sure the cache is cleared. Make sure you aren’t seeing stale policy settings.

GPOtool Resource Kit command-line utility. Run on DC only.
Version Comparison: AD vs. Sysvol. AD version set immediately on change. Sysvol version set after FRS Replication. Friendly name /GUID association Policy {08FAB D5-B5A8-37A0F98D7E43} Policy OK Details: DC: Qtest-DC2.qtest.cpqcorp.net Friendly name: Folder Redirection Policy GPOtool is in the reskit and is run on a DC to verify the consistency of the GPOs. Shows the version of the AD object vs the version of the Sysvol template (they must be the same) Version mismatch means they don’t match.

Solving Version Mismatch
Small mismatch is normal. After change until FRS Replication completes. Be patient – see if it resolves. Big mismatch is bad. Prevents application of policy. Unreplicated changes. Manually set FRS version = AD version. %windir%\sysvol\sysvol\<domain>\policies\{guid}\gpt.ini Will lose changes. Version mismatch is normal until a GPO change is replicated since it updates the AD object as soon as you save the GPO change, but takes time to update Sysvol on all DCs. But it shouldn’t be that way for longer than the replication latency of the network (or large differences) You can fix version mismatch by editing the gpt.ini file and change that value (the sysvol version ) to the AD version (seen in the GPOTool output). However, this will prevent current changes in the que from being replicated. Unresolved version mismatch can cause GPO replication to fail.

Resetting Default Domain Policy or Default DC Policy
These policies are always same (GUID). Default Domain: {31B2F D-11D2-945F-00C04FB984F9} Default DC: {6AC1786C-016F-11D2-945F-00C04FB984F9} Changes are a mess – need to restore default. To restore security defaults only, import the BasicDC.inf template (Q258595). If settings are hosed, copy an original copy of the policy to winnt\sysvol\sysvol\ <domain>\policies. Copying policies only supported for these two cases. Other will have different GUIDs. Can’t copy other policies from one forest to another for debug. Recommended to never modify settings in the Default domain or default domain controllers policy. Create new policies. If you modified the default domain policy for security and want to go back to the original version, you can restore the default security template, BasicDC.inf – see Q258595 It is possible, if you mess up the default domain policy or default domain controllers policy , to restore the original. You cannot copy a group policy from one domain to another – other than the two default policies. Since the guids are not predictable (other than these two), the AD won’t know about them.

How to copy the Default Domain and Default DC policy
Get a copy of a clean, default policy folder. Restore the policy folder (GUID) from backup. Create new domain and copy the GUID folder from that machine . Don’t zip it . Delete existing policy. Wait for replication. Copy new policy folder to winnt\sysvol\sysvol\<domain>\policies. Run GPOtool to make sure it shows up on all DCs. Since these are the same on every W2K domain in the world, you can import them between domains. Get a member server, promote it to a new domain. Go to %windir%\sysvol\sysvol\<domain name>\policies on this new machine and there will be two folders with GUID names – {31B2F…} and {6AC17…}. Copy those folders to the same directory on the PDC of the real domain (deleted the existing ones first).

Unable to Edit Group Policy
Group policy changed on PDC by default. If PDC is not available. Dialog: Change on any DC, current DC or not. Error: Unable to contact Domain (no DC). Solution: Transfer or seize the PDC role to another DC. Can set policy to NOT use PDC …. Don’t! Group policy is changed on the PDC emulator. IF it is unavailable you will get the option to save it to another DC. This isn’t recommended since it would allow another admin to change the same policy – since “last writer wins”, the changes by the first admin might be lost. You can set policy to use any DC rather than the PDC but it’s not recommended.

Using Userenv.log to solve Group Policy problems
Turn on Verbose Logging Q221833 interpreting group policy information in userenv.log Q tells how to turn on verbose logging. Set that registry key, then… go to %windir%\debug\usermode and delete or rename userenv.log Repro the problem (refresh group policy, etc.) examine the userenv.log file Note: There are timestamps in the userenv.log but no dates – deleting or renaming the log will force another to be created. Newest information is at the bottom – read bottom up! Take time to study it – it really does make sense if you take some time

Debugging Logon Scripts (script doesn’t apply)
Configure it via group policy snap-in. Make sure policy is applied. Set a desktop setting. Use Gpresult /v. Enable verbose logging for Userenv.log. Turn on “Run logon scripts visible.” Create simple logon script as a .bat file to make sure it’s not the script failing. Example: Using Userenv.log to find script errors. if a logon script doesn’t apply, make sure you configure it correctly via group policy (Demo) change a desktop setting (to trigger policy) Set the policy to “run logon scripts visible” (causes cmd window to display during script execution) Run gpresult/v – see if the logon script is listed If it is listed, the script is being processed ok – check to see if the script works try creating a simple bat file that creates a text file called logon.txt to c:\ - see if that works. If it does, then there is a problem with the script. Look at the userenv – search on “script” – see if you get errors (access denied, etc). See if there is a gptext.log in %windir%\debug\usermode – it contains info about script failures.

Can’t find FSMO Role Holder
Problem: Operation trying to contact a FSMO role holder – PDC Emulator or…? Can ping by name – seems to be ok Operation can’t find it Solution: Find out who has that role: netdom query fsmo (returns a quick list) Transfer the role to a local DC if you get an error that a fsmo role holder can’t be found, just moving to another DC will fix the problem

Group Policy Refresh Anomaly
Users complain of a 5-25 second “hang” intermittently in any application – Outlook, Word, 3rd party apps. Keystrokes are buffered and they can continue to work Noticed direct correlation between the 1704 events (GP Refresh) and the “hang”. Change refresh interval via group policy and the frequency of the “hang” changed. pretty much described in the slide. Note also: Group Policy is refreshed when a change takes place or every 16 hours if there are no changes. This is the Group Policy refresh interval and can be changed via group policy. Note that the refresh is every 5 minutes for Domain controllers. In this actual case, users noticed a “hang” on their workstations where whatever application they were in would be suspended for about 5-25 seconds and then wake back up – all keystrokes were buffered so they didn’t’ loose anything, but it was very annoying. Admin noticed that when the 1704 event in the Application Log occurred, the hang occurred. They changed the refresh interval and the hang followed the change.

Group Policy Refresh Anomaly
Cause: SceCli applies group policy every 16 hrs (default) if no gpo changes have occurred. (DCs are every 5 minutes) Broadcasts WM_settingschanged to all top level windows Wakes up sleeping processes causing massive paging in/out of memory – causing hangs More pronounced on “slower” computers Solution: Configure Policy Refresh Interval in Group Policy so refresh occurs every 12 hrs at midnight/noon so users don’t notice it.

Account Lockout Background Finding locked out user accounts
Client Bugs and Fixes Server Bugs and Fixes Resolution and Futures Account Lockouts is a common call generators. We will cover tools and best practices and ways to troubleshoot this problem

Lockout Reasons & Options
Prevent spoofing or hijacking account Optional event logging in Audit Policy Account Lockout Options Timed lockout Account enabled after admin defined time Hard lockout Account disabled until reset by admin Lockout policy defined in group policy Single lockout and password policy per domain Location: default domain policy 1st – why you might enable account lockout. The primary reason is to prevent spoofing or hijacking of accounts and enable logging when your accounts are being attacked. So account lockout is enabled as defined if you have a single policy for the entire domain, defined in the default domain policy for your domain. When you enable lockouts you have 2 options – soft or timed lockout. Where the account gets enabled after a certain amount of time defined by the administrator or you can have hard account lockouts where the account is permanently disabled until its reset by the administrator.

Account Lockout on DC’s
Each DC records # of bad password attempts BDC check PDC for latest password All Bad password attempts seen by PDC PDC always 1st to lock out account PDC urgently replicates lockout when threshold reached Bad password attempts not replicated by DC BadPasswordCount reset to 0 on 1st good password Let’s talk about how lockout pass thru on the system. There are 3 account lockout attributes stored on each DC. 1) Bad password account (each DC on the domain maintains a local version of this attribute). A BDC that receives a bad password or attempt forwards this request to the PDC – we call this PDC chaining, so the PDC is Authorative for all bad passwords attempts logged on the domain and is also the 1st DC to lockout an account. When an account is locked out the PDC urgently notifies all Domain Controllers that the account has been locked out. So bad password attempts are not replicated by domain controllers. The bad password count is reset from 0 to 1 anytime a user logs in with a good password.

PDC chaining operations
If BDC fails authentication with: STATUS_WRONG_PASSWORD STATUS_PASSWORD_EXPIRED STATUS_PASSWORD_MUST_CHANGE STATUS_ACCOUNT_LOCKED_OUT Referred to as “BadPasswordStatus” BDC chains authentication to PDC Return status from PDC if status = success or listed above Otherwise, ignore PDC status and use local status Exception to PDC chaining AvoidPDCOnWan enabled and PDC in remote site (Q225511) 10 “BadPasswordStatus”events logged in 10 minutes NegativeCache enhancement Q263821 Cache reset after good password entered I mentioned this PDC chaining operation and there are 4 status’s that if a BDC receives it we chain up to the PDC to see if the PDC has a newer version of the password. So when the PDC takes a crack because its authorative for the latest updates it will forward its status back to the authenticating BDC and the BDC returns this status to the client.. Now there are 2 exceptions to when we don’t do this PDC chaining. 1 is if you have the registry key to avoid PDC WAN enabled, then we avoid this and post SP2 hot fix’s Q added a feature called negative password caching. The goal of this was to prevent PDC overload. If we have such a service account on the client that’s doing repitive bad password attempts, the BDC avoids the PDC chaining for a period of 10 minutes or 10 bad password attempts and once that interval has expired then the PDC is checked again.

Troubleshooting account lockouts
Your goal: Answer the 4 W’s Who, Where, When and Why Environment setup Enable Auditing in domain policy Account Logon Events – Failure Account Management – Success Logon Events – Failure Security Event log on DC’s: 10K events + over-write Enable netlogon logging (ntlm clients) NLTEST /DBFLAG:2080FFFF (no reboot) Enable Kerberos Logging Q262177: Kerberos logging (kerb clients) Ok lets say we now have an account lockout- there is a process to get that enabled. And your goal is to answer 4 simple questions about this account lockout – the questions are Who’s account is locked out, Where the account was locked out, When the account locked out and Why it happened. To answer a couple of the questions we need to get the right environment in place on the Domain Controllers and we do this by enabling logging either in policy or in some cases the Netlogon log. And here’s kind of a recipe on the way you want to configure this DC.

Account Lockout – Where
DC Resources NTLM Clients Search DC & CLIENT NETLOGON.LOG for lockouts 0xC000006A = bad passwords 0xC = account lockout NTLM + Kerberos Clients Search DS Event Logs Q230254, Q299475, Q and Q for description 644: NTLM + Kerberos Lockout Event 675: Kerberos badd password 681: NTLM bad password 529: Failed logon 531: Account disabled Tools EVENTCOMB AL.EXE NETMON.EXE We have 2 types of authentications that can take place in a domain. NTLM authentications and Kerberos authentication. NTLM authentication can be logged in the Netlogon log of each authenticating DC in the domain. 2 deals you are looking for in the Netlogon log – the 6A which is a bad Directory password account and the 234 event which is the account lockout. NTLM clients are Win9X and NT4 clients in any domain or Win2000 and XP clients that have secure channels to a down-level domain a NT4 DC. Kerberos clients are Win2000 and XP clients that are in Win2000 or .NET domain And there is a unique set of events that are logged in the event log for those machines. Now we’ve talked about each DC in the domain logging these account lockouts across each DC in the Enterprise or in the domain. So we now need a way to find these. 3 tools which are helpful for this. EVENTCOMB, AL.EXE and NETMON.EXE list here.

EVENTCOMB Here’s a screen shot of event COM which is available on the web. You may also want to review Chapt 6.of the Security Operations Guide for Windwos 2000 Server/ The url is listed at the end, but what EVENTCOM can do is search all event logs on all Domain Controllers in some defined scope, including all Domain Controllers on the domain and can bind events of interest for you. So in this case we have a precanned search in the built in services pull down menu so it looks for events 529,539 & 644 across all Domain Controllers in the domain. So this lets you quickly find out which accounts are getting locked out or if bad password attempts are getting logged on in terms of the frequency.

AL.EXE The 2nd utility is called AL.EXE and what this does is freezes all Domain Controllers and seize the last time the account was logged in. Useful when you have bad password problems. You have to request this the utility thru Microsoft.

Account Lockout: Why Attack, “Pilot Error” or Bug
Wrong Password entered, mis-configured Service Account Scenario Account type: user, computer or service account Lockout trigger? logon, drive access, following p/w change) Drill Down: Look at TOD, pattern & frequency Process related lockouts Structured pattern Logged when users not present Look for: common services, applications, client configuration User related lockouts Random pattern, Fewer events logged Look at: shortcuts, mapped drives, logon scripts, applications So your next question is WHY – why did the account lockout happen – there are 3 possibilities. It could be a legitimate attack against the account, it could be pilot error where the user fat fingered the account or the service account password or it could be a bug on the client or server. So you want to understand the scenario that the lockout is occurring on. This is a user account and we encounter the lockout sometimes after a password change? Did it happen when accessing a drive, accessing the home directory logged on, etc? So now we start our drilldown that we’re looking at, the time of day lockouts are happening, the pattern or frequency for these things. When you look at these event logs and the net logs and such you are looking for this pattern. If you see a structured pattern then likely it’s some kind of processor service account generating the lockouts. If you see this randomly occurring or at normal periods of the day like at 8-5 it’s a good chance it’s a user.

Account Lockout – Client
Win9X Q278558: Access denied to a mapped drive after disconnect Q272594: Client can't log on after log off w/o reboot Q293793: VREDIR looses file tracking structures Q271496: One unsuccessful logon attempt triggers lockout (1:3) Net use + dsgetdc + logon attempt. Q266772: Logon fails if Unicode string password to NTLM SSPI DS Client on Win95, Windows 98, 98 Second Ed DSCLIENT *MUST be installed before any hotfixes! Q301344, Q283261 DS Client lets WIN98 account lockout fixes work on Win95 Win2K Q275508: User locked when accessing home dir after changing p/w Hotfix or SP2 Windows XP None I mentioned some bugs and we’ll talk about those. There are a set of bugs primarily on Win9x clients, mainly associated with immediately after a password change, you access some kind of network resource and the previous versions of the password is used instead of the next one. So here are a set of kb articles that list those issues and the fixes for them. So neither NT4 nor WinXP have any fixes or problems Microsoft is aware of. One thing to note on the Win9X clients, there’s a strict order for installation of lockout fixes with the DS client installed.

Account Lockout: Server Fixes
Read server side KB articles Q287639: Win9x Clients Locked Out after unlock MSV1 package does password check against BDC with old password during 2nd phase of logon Q278299: Bad p/w count not reset to 0 (ntlm) Original hotfix had regression. Confirm latest version deployed. Q263821: Bad p/w count not reset to 0 (kerb) Q292573: DSA.MSC and ADSI may not use same DC to WinSERaid:16662 (post SP2 hotfix) Resolution Windows 2000 DC’s: Install SP2 + Q314282 Same QFE as lingering object and other good DC fixes Service Pack 3 On the server side fixes we have a little history there but an easy way to solve all these problems is to install Windows 2000 SP2 with this single hot fix, Q This hot fix also prepares your Windows 2000 domain controllers for upgrade to .NET. It is available in SP3 as well. So SP2 and this hot fix Q are recommended Take a look and inventory your clients and find out which versions you have and get the relevant hot fixes installed on them. As an alternative to doing a hard account lockout, we recommend you investigate doing a timed lockout. You want to configure your domain for account lockouts via the logging, and definitely monitor your environment. Also if you’ll practice finding lockouts in a lab it will familiarize you with the tool. New slide

PDC FSMO Load Reduction
Windows 2000 domains are much larger than their NT 4 predecessors i.e. > 50,000 clients NT 4 and WIN9X clients still deployed and target PDC only for updates Windows 2000 / XP clients use Windows 2000 DCs in mixed mode domains (Q284937) Older applications select PDC only rather than any DC Applications may enumerate whole domain ( NT 4 usrmgr, srvmgr ) Result: PDC gets more load Next topic is a PDC overload scenarios. One of the things Compaq/HP has observed in early Win2000 deployment is there is a tendency for the PDC to be overloaded in certain scenarios. When will this occur and how do you deal with it? When it tends to occur in very large domains as we moved to Windows 2000 we consolidated the many NT4 account domains into one larger active directory domain. This larger domain means that more operations are concentrated at the PDC FISMO and particarliy in the case where there are a lot of downloadable clients. So if you have a large base of WIN9x clients, NT4 clients consolidated into a large domain greater than 50,000 clients there are techniques we will cover on how to reduce the load at the PDC FISMO that will be useful to you. Why does this happen? Domains are larger, have more down-level client activity concentrated at the single server, so with the activity occurring on WIN2000 in a mix environment were not accounted for in some of the earlier deployments. If you deploy your WIN2000 and Windows XP clients 1st then you upgrade to Active Directory those WIN clients will use only Windows Domain controllers for authentication. So if you upgrade the whole company to Windows 2000 desktop 1st then deploy Active Directory later you may not have enough Domain controllers in place during the early migration stages. Some of these can be fairly intensive activities such as enumerating all users and groups in the domain. Older applications such as UserMgr in NT4 and computer manager load all the objects into the tools when you 1st run them. If you have 80,000 users in your domain and you still have a helpdesk that’s migrated but are still using the down-level tools that will put a load on the PDC as it’s enumerating the domain for this operation. So you will want to move people to the newer tools. The result is, with all these factors here, the PDC is overloaded

Symptoms of Overload High CPU utilization for long period
Greater than 70% High average disk queue Disk queue > number spindles Timeout of requests Password changes What are the symptoms of overload? High CPU utilization is one of them. A target of about 70% utilization for very long periods of time, Domain controllers are going to have peaks of utilization and thats normal, but if you see that over long periods of time (many minutes) then you may want to say that this PDC is getting overloaded. The other type of overload we have seen is the scenario where the system needs to have a very fast disk and depending on how fast the disk is and how many activities occur at the PDC FISMO, you may see a very high disk Que. If the diskQue is higher than the # of spindles which contain the ACTIVE DIRECTORY on the disk array holding that, then you probably have a bottleneck on the disk controller. We will talk about how we handle that. Another symptom is timeouts. If the diskQue is higher than the # of spindles which contain the ACTIVE DIRECTORY on the disk array holding that, then you probably have a bottleneck on the disk controller. We will talk about how we handle that. Another symptom is timeouts. Maybe some of the PDC only operations are timing out. Maybe the password changes are failing for your down-level clients. So this gives you some idea of what an overloaded PDC may look like. What can we do to optimize it?

Steps to Optimize PDC Optimize hardware and software
Hide PDC from DNS clients Implement WINS optimizations Block down-level enumeration PDC in dummy site Well there are really several main categories or techniques. 1st is to optimize the hw and sw to make sure the PDC is handling the most it can for a given configuration. The next set of techniques are hidden. The PDC is going to be hidden so it’s not going to be found by clients who can do their operations within the domain. This has to do with DNS. Then we will look at the settings we can do to optimize WINS so that when WINS gives out information to clients to the PDC it isn’t avoided. We will talk about blocking down-level enumeration. If you have a large company and you want to control who can enumerate all the users in the domain thru older inefficient API’s you need the ability to block that and control that. Then we will look at moving the PDC to a dummy site.

Optimize Hardware & Software
Run Windows 2000 Advance Server with /3gb switch Enables ESE cache of 1.5 gb 4 Processor Server is optimal 2 Gb RAM Disk RAID 1 set for OS and Page File RAID 1 set for Log Files RAID 0+1 for NTDS.DIT and sysvol Run only core DC services So the 1st is optimizing the h/w and sw/ configurations, assuming the sw is at SP2 and we are optimizing beyond that. The 1st thing we often see is, people have Active Directory equate to h/w but they are running WIN2000 server. If you run Win2000 advanced server you are able to take advantage of allocating more virtual memory to each process thru the use of the /3gig switch. So if you have a Win2000 domain controller and you would like to use the maximum amount of memory so that ACTIVE DIRECTORY can cache info in RAM and reduce the load on your disk you need to run Windows 2000 advanced server with the /3gig switch. This will enable up to 1.5 gb of RAM for the cache for ACTIVE DIRECTORY. If you don’t set this /3gig switch on advanced Server or you are running server you are limited to 512k of cache. In terms of processing power, 4 cpu’s is an optimal configuration for large machines. Going beyond that doesn’t have a great advantage to Win2000. There will be a greater advantage with WINdows.NET. 4 CPU’s is a very powerful configuration for the money In terms of RAM 2gb of RAM on a dedicated domain controller is the maximum amount of memory that is useful. We can only use up to 1.5gb for the ESE engine for the ACTIVE DIRECTORY, going beyond 2gb really doesn’t buy you much unless you are running other applications on the server. In these configurations we recommend not running other applications on your domain controllers. Next area to optimize is the disk subsystem. Would you like to see a RAID mirror, a set of mirrored drives, dedicated to the operating system and the pagefile. Then the same for the log files where they are on a pair of their own.

Disk RAID 1 set for OS and Page File RAID 1 set for Log Files
RAID 0+1 for NTDS.DIT and sysvol Run only core DC services Next area to optimize is the disk subsystem. Would you like to see a RAID mirror, a set of mirrored drives, dedicated to the operating system and the pagefile. Then the same for the log files where they are on a pair of their own. And then for optimum thru put RAID 0+1 give the highest thru-put for ACTIVE DIRECTORY operations, so its fast for Read and even fast for writes in that configuration. That will give us the most I/O’s per second that are possible on a disk configuration. So if you need to get the most out of the PDC FISMO, make sure it’s one of the biggest boxes, the proper RAM configuration, proper operating system configuration and make sure its running Domain Controller core services. Don’t install WINS, DHCP, and DNS, all this other stuff, leave all the other services off, and just optimize it to handle the down-level clients.

Hiding Techniques (DNS)
Lower PDC SRV Priority Reduce chance of DS aware clients selecting PDC before other DCs HKLM\System\CurrentControlSet\Services\Netlogon\Parameters\LdapSrvPriority=1000 Data type: Reg_DWORD PDC only Site Clients will use it only as last resort Create a site-link to real site Disable AutoSite Coverge on PDC HKLM\System\CurrentControlSet\Services\Netlogon\Parameters\AutoSiteCoverage=0 Next technique is hiding clients thru DNS publication. The 1st thing you should do is adjust the SRV record priority. SRV records have priority ratings and when the Domain Controller locator service on clients request a DC, records are returned and sorted based on these priorities. By default, all Domain controllers are at the same priority which is zero. So by putting in a larger priority making the PDC FISMO have a lower priority, we make clients not use PDC unless other responses have failed. By changing the setting to something like a 1000 thru the registry this will help clients avoid the PDC. By setting this registry key LDAPSRV priority when Netlogon registers in the dynamic DNS SRV record, it will be at a lower priority for that box and that will take some of the client load off for non PDC operations. The next technique is to put the PDC in its own site. If the 1st technique wasn’t enough by isolating the PDC in its own site, we can help isolate it even more from clients which are looking for a Domain controller. Based on a site operations and one other technique is to disable Auto site coverage. Often the PDC is located in a key hub and the key hub may be covering for other sites in the corporation that don’t have a domain. We have seen customers have their PDC FISMO actually project themselves into other sites. So we can turn off auto site coverage. So by following these techniques we can have ACTIVE DIRECTORY aware clients not use the PDC for operations that can be handled by other Domain controllers.

Hiding Techniques (WINS)
Down-level clients locate DCs through 1C queries WINS always adds PDC first in 1C list Remove PDC from top of list (SP2) Q269424 HKLM\System\CCS\Services\WINS\Parameters Value name: Add1Bto1CQueries Data type: Reg_DWORD Value data: 0 = disabled, 1 = Enabled (default) Randomize 1C list for general load balancing Value name: Randomize1cList Value data: 0 = disabled, 1 = Enabled Q (NT4 SP4 and later) Now let’s optimize things for the down-level clients. Down-level clients use WINS. They use WINS by querying the WINS server for the list of 1C records. The list of 1C records is a list of up to 25 domain controllers usually somewhat near that machine, but not necessarily, depending on the WINS topology. One of the characteristics that worked well in smaller environments in the past was to put the PDC 1st in the list, so WINS always takes the PDC and prepends it to the list and that didn’t hurt in the NT 4 world, but when you think about it, its going to naturally direct more traffic to the PDC rather than to any other machine because its always 1st in the list. . So as of SP2 there’s a new registry key that you can set ‘add1BTo1C’ queries by disabling that feature in WINS we can make sure that WINS doesn’t put the PDC 1st in the list. That will reduce the down-level client load looking for any domain controller thru WINS. There’s also a 2nd registry key which can allow you to randomize the WINDS 1C list on every response from the WINS server. By default, WINS orders the records base on time of last update and whether they’re local or remote in terms of registration. You change that to randomize to get a more even distribution across the servers if that makes sense in the environment. So this technique will take some of the down-level client load away from the PDC.

Block Enumeration Old (non DS enabled) applications often call SAM APIs to enumerate entire domain Hard to control Block unauthorized users from seeing more than 100 objects per call New access control right determines access HKLM\System\CCS\Control\Lsa\SamDoExtendedEnumerationAccessCheck=1 Q268339 Next technique that’s useful in some environments and was very useful at Microsoft is blocking enumeration thru down-level API’s. Older applications like UserMgr, Server Mgr, even other older versions o SMS, SQL Server activities, these applications enumerate all the users or all the groups in the domain and they often do it when they don’t even need to. They need to get some small information and they ignore the responses. And often in a large company it’s hard to control and isolate who these people are, so a feature was added so we can use the access control model to restrict who can do these less efficient operations, so you can protect your PDC FISMO or any other DC. But the PDC FISMO is where it’s most important. So what we do is block the enumeration of the entire domain. And the way we do that is if you aren’t authorized thru extended rights in the ACTIVE DIRECTORY what will happen is that when you are asked for users in the domain you get back a successful query, but you get back no more than 100 objects. So take a small domain of a 100 users or 100 groups or 100 computers or whatever you’re looking for in that API, so your application won’t fail and won’t cause the stress on the DC. And then by using the extended enumeration rights you can control which user accounts have the permissions to do that for the ones you are will to manage, so thats one way to take the load off a PDC FISMO. And the process is described in the KB article Q new slide

Misc. – Server Applications
Server based applications can create frequent changes in the directory Agent based systems Create and delete accounts Grant accounts rights in the domain Changes create replication AD replication for frequent group changes FRS changes for policy changes Apply SMS hot fixes Q311127, Q278345 Read articles, configuration necessary The other thing to think about is miscellaneous server applications. Servers frequently use the directory service and if they aren’t written efficiently they may create accounts, delete accounts and sometime may do this extensively. One of the areas to look for is SMS. SMS creates and manipulates accounts quite extensively. And there are some hot fixes to reduce the amount of manipulation SMS does in the environments – so you should be aware of these 2 hot fixes list here and apply. Then you won’t be putting stress to group enumeration, user account creations and policy manipulations on the PDC FISMO and some of your other domain controllers. So it’s important to look for applications that may cause excessive changes in directory or excessive queries or inefficient queries of the directory. Try to isolate those and educate people or apply fixes as necessary. New slide

Distributed Link Tracking
Purpose Used to track moves of linked files across volumes and servers (shell shortcuts) Uses AD objects to track files and volumes Objects stored in DS linkTrackVolentry object for each NTFS volume in the domain linkTrackOMTEntry created for each linked item that is moved Clients query service when a shell shortcut or OLE link can’t be resolved Clients refresh links every 30 days DCs scavenge objects older than 90 days This is a mysterious little service that most people have no idea what it is or what it’s doing on your domain controllers and why it exist. Microsoft’s new philosophy is to turn off unneeded services that are considered non-essential. They would like customers to turn this service off now. It’s going to be turned off by default in Windows .NET. What this service does is track file moves. The concept it was designed for is, - say I have a link or a shortcut to a file but that file has been moved somewhere else, and I still want my shortcuts to work. Well if we publish information in the ACTIVE DIRECTORY about where files reside and where they are after their moved, I can find a file no matter what system it’s located on in the domain. Well this has a cost, the cost is it puts the object in the ACTIVE DIRECTORY and increases your database size, increases your replication traffic and you may not even want this feature turned on. So what do we see, we see a number of objects published in the directory; we see an object added to the ACTIVE DIRECTORY for each disk partition for each NTFS disk volume that exist in the domain, so there’s an object represented in the ACTIVE DIRECTORY. For every file that’s been linked to it, so that means there’s a shortcut pointing to the object and the objects been moved, an object gets created representing that in the ACTIVE DIRECTORY. Overtime this can build up to be quite a few objects. There is an automatic process to refresh these links and delete them, but still that process may not be efficient. There’s an automatic process - refresh these links and delete them but still that process may not be efficient and you may find more of these in the environment.

Distributed Link Tracking
DLT is an optional service Enabled by default Typically not included in DS capacity planning Best Practices Disable on all DCs Reduces AD replication traffic Reduces AD database size Use Group Policy to disable DLT server service on DCs Remove objects from DS Use staggered approach Q312403 By default this server is enabled and Microsoft recommends disabling and cleaning up the existing objects which exist. At Microsoft they cleaned up well over a million objects from their directory which took up a lot of database space in the ACTIVE DIRECTORY, which they weren’t taking advantage of . They have seen 10’s of thousands that were not being used. Whats’ the best practice way of doing this? 1st we use group policy to disable the service on all DC - ‘distributed link tracking’ server service, not the client service. We can disable that thru group policy in the default Domain Controllers policy and reduce the replication and reduce the database size. If you find you have a very large number of objects you want to delete, you may want to delete them overtime using a staggered process so you don’t impact replication.

DC/GC Promotion Consideration
DC Promotion / Demotion Process to cleanup after failed promotion GC Promotion GC Demotion Next topic is DC Promotion/Demotion issues. DC/GC Promotion Consideration Quickly I’ll talk about some things you should consider in these processes that we’ve seen in actual deployment.

DC Promotion / Demotion
Create proper sites before hand Failed promotion or removing server Manually clean out metadata from any failed attempt When replacing a failed DC When a DCPROMO has failed To clean meta data Use NTDSUTIL FRS member / subscriber objects Machine account in domain Allow replication to all DCs before promoting again 1st – when you are deploying new DC in Active Directory, it’s really important to plan your site topology and get your subnet information into the ACTIVE DIRECTORY. Unfortunately companies still deploy without worrying about replication topology, it’s something they will deal with later. It’s more work to clean up something later. So if you create your proper sites and subnets in the ACTIVE DIRECTORY 1st when these Domain Controllers are added to the directory they will automatically be in the proper site. Next thing to think about is the manual cleanup of failed promotions or removal of metadata Domain Controllers which have been removed. I’ve seen companies have Domain Controllers in their environment, the box isn’t used anymore, they turn it off and never bring it back and leave the data in the directory. This causes inefficiency’s in the directory. 1st of all the DC‘s think it’s there and they try to replicate, FRS thinks its still there. So it’s not a good practice to leave this native data in the ACTIVE DIRECTORY. So it’s important to clean it up. And how do we clean up old DC info? We of course use NTDSUTIL to clean up the core metadata , it cleans up the ntds settings, and some of the other information about the DC. It’s also important to go in and manually delete the computer objects from the domain and delete FRS member objects; subscriber objects are deleted when you delete the account. . And this is important to do before repromoting. we have seen this in 2 cases – 1) an old DC which was removed without cleaning up, and the other case was a failed Dcpromo and they wanted to try again. When you clean this up , you want to allow time for this info to replicate because we don’t want dcpromo to use just 1 DC you’ve cleaned up and then another DC where the information hasn’t yet been replicated. Then you will get collisions on these objects and that will create other types of failures that you don’t want to have to deal with.

GC Promotion First GC in site may go online before all partitions are replicated Default: GC will advertise after all partitions in site replicate Exchange may use GC before ready Mail may bounce Best Practice Stop Netlogon Mark DC as GC Use repadmin to monitor success Start Netlogon all NCs replicated SP3 will wait for all partitions to replicate before advertising Ok on the GLOBAL CATALOG area, the 1st thing to consider on the GC’s is their promotion process. This is most important on Exchange 2000 server environments because Exchange 2000 relies heavily on the GLOBAL CATALOG as a global address list. By default in Windows 2000 a GLOBAL CATALOG is put on line after it’s replicated all partitions which exist on its local site. The issue here is you may not have all partitions available in your site and the partition may be available in other domains. So the GLOBAL CATALOG will come online before it has the full copy of all information in the entire forest in that scenario. Exchange doesn’t like that will start bouncing if Exchange discovers that GLOBAL CATALOG before the replication is completed from the other partitions and sites. So for now the best practices to stop the Netlogon service on the new GLOBAL CATALOG to allow it not to answer queries for a period of time, clients won’t discover it. Then mark that DC as a GLOBAL CATALOG allow GLOBAL CATALOG replication to complete and monitor for that completion by looking with Repadmn to see if all the partitions have been replicated. Then enable the Netlogon service. This procedure is not convient and something that has been overlooked in many environments but this will get you by for now. The good news is SP3 and Windows.NET they’ve changed the default so this won’t occur, that the GLOBAL CATALOG has to have all partitions in the forest before it will advertise, so this will allow those upgrades to proceed more carefully. So when you apply SP3 you won’t have to do this procedure. New slide

GC Demotion GC removal requires time for object removal
The KCC removes 500 objects per default 15 min cycle Best Practice Monitor for event 1069 to record progress Forced GC removal when needed (Q297935) Remove each partition with repadmin repadmin /delete DC=globalit,DC=unity,DC=com %destgc% /nosource Next slide talks about GLOBAL CATALOG process in which we want to remove a GLOBAL CATALOG. It takes TIME to remove data from a GLOBAL CATALOG. Deleting data from the ACTIVE DIRECTORY can take time just like creating new objects take time. So when you demote a GLOBAL CATALOG to a plain DC we only remove 500 objects every 15 minutes not to overload the GLOBAL CATALOG. One of the things we need to consider is when you do this that if you want to promote that GLOBAL CATALOG back to a GLOBAL CATALOG quickly, the partitions which are still there will cause the promotion to fail. So one of the things you may want to do if you have to rapidly demote and promote into GLOBAL CATALOG functionality is to manually remove the partitions using Repadmn. That’s using Repadmn /delete option which is deleting forcibly the partition. If you check the kb article it will show you a vb script to do this and how to that for all the partitions in the forest.

Container Inheritable ACE’s
ACE that applies to either all objects or objects of a specific class in a container Example: Delegate right to reset user passwords in one OU Security Descriptor propagation copies ACE to all objects Makes access check very fast All information is on directory object Also class specific ACEs are copied to all objects Example: ACE used to delegate right to reset user passwords also copied to computer and container objects Increases object size – database size Increase proportional to size of subtree If set on domain root: Highest impact If set on OU: Lower impact (depends on number of objects in OU) Low impact if set on schema or configuration container SD propagation is asynchronous Takes time to propagate (i.e., 3 hours in 50,000 user domain) Next slide, Container inheritable ACE’s. In ACTIVE DIRECTORY every time you want to access an object in the directory service, it authorizes whether you have Read or Write access to this object. In order to do this authorization we store a security descriptor on the object. And in the security descriptor we find all the information for what users can read or write to the object. Sometime you will want to manipulate a single object, for example, you want to grant or delegate the rights to recent user passwords to a specific group on a specific organizational unit, and then this write applies to all users in this OU and all the child OU’s and so on. . The way to do this is to user Container inheritable Ace’s. If you want to delegate this right you send it on the OU as a container inheritable ACE and then what we do is we have a process that is called a security propagator. And this process will now take the new ACE and put it in the security descriptor of all the objects that reside within the OU and all the child OU’s, if defined. So we are basically touching the security descriptor on every single object. In ACTIVE DIRECTORY you can also define that you want to set an ACE on only a specific type of object and the same example here. You only want to delegate the rights to recent passwords on user objects, so this is tied to the user object class onto objects of the user object class. However, the way the security propagator works is that it still adds this ACE to the security descriptor of all objects in the OU. . So even to all workstations, groups and whatever. So it doesn’t distinguish it, it just means that these ACE’s only apply to user objects. So there is a trade-off we are doing here. The trade off is we have a very fast access to the security descriptor information because we have it available on the object. We never have to walk a tree to get more information , it’s always there. But the disadvantage is that we have to touch every single object when we add a container inheritable ACE and that might increase the security descriptor on this object which eventually also increases the size of the database file. Hence if you do this in the right way its’ not a big issue, if you do it into wrong way this can create some issues for the size of your DIT file. So quite frankly when you set the container inheritable right ACE on the root of the domain then you touch every single object in your domain. If you sit it on a OU you only touch the objects within the OU, so the scope is very important.

Container Inheritable ACEs Best Practices
Don’t add container inheritable ACEs to domain root Add on OUs as appropriate Best Practice Documentation recommends OUs for Users Groups Computers Container inheritable ACEs on these OUs have small impact only Watch SD propagator events SD propagation running: 1257 (Level 2) SD propagation report (objects touched): 1258 (Level 2) SD propagation terminated abnormally: 1262 (Level 0) Always leave sufficient disk space on database partition 20% of database size, at least 500 MB Monitor! Test ACL changes in lab or pilot domain to bracket size increase So what are some best practice recommendations? 1st of all – you should never set a container inheritable ACE to the domain root, usually you wouldn’t want to do that because you have OU’s for a reason and the reason is delegation. So if you delegate, do it on the per OU level. In the best practice recommendations, Microsoft recommends to have specific OUs for users, groups, and computers. So if you delegate rights, do it on these subOU’s and then you will not see the impact of the security propagation. When the security descriptor propagator runs maybe you might want to detect when it starts up to see how long it takes to propagate the events. There are a couple of events that you can monitor here, so when it starts running you can find all the objects that the security descriptor propagator attached to. Another thing is you should always leave sufficient disk space on your database partition. That’s basically a minimum recommendations; there should always be at least 20% of your database size left on the partition where the database file resides. Or at least 500 megabytes. Even if your database is small it will give you headroom, disk are cheap anyway. Another thing is you should monitor the behavior of your dit file so you get some history information over time. You can find out if your dit file is actually growing over time or not. If you have very complicated changes to your ACL infrastructure, like for example you want to change the whole structure of auditing, and then maybe you want to do that on the domain root and if you need to do that then you want to test this in a lab before you actually do it in your deployment environment. So that gives you a good feeling of what to expect the size to increase by in your dit file.

Container Inheritable ACEs The Future
Windows .NET will have single-instance store for Security Descriptors Objects have links to security descriptors If container inheritable ACE changes, only one SD changes No impact on disk size Does not require .NET only forest SD propagation happens on local DC Transparent to other DCs Feature available immediately Monitor SD prop events after upgrading a DC SD propagator will build single instance store after the domain controller boots .NET for the first time Database will shrink after OS upgrade Need to off-line defrag database to see changes Going back to our initial example if you delegate the rights to reset passwords to a specific user on a Organizational unit, and this rolls down to all the objects within the OU, we will actually only manipulate a single security descriptor and then all the objects will have a link to the security descriptor, so the impact will be minimum. There will be a lot of improvements in the .NET server for this. Since this is an operation in the local db, this is a feature that will be available as soon as you install the .NET server. As soon as you upgrade a domain controller to .NET you will have the single instance store. For Windows .NET server is going to have a single instance store for security descriptors. Which means that for all objects that have a identical security descriptor, only one instance will be stored of the security descriptor and then the object just has a pointer basically to the security descriptor. The security descriptor propagator is the process that will create the single instance store so you can monitor the Security descriptor propagator after you upgrade the domain controller to see when it’s done creating the single instance store So once the single instance store is created you database will definitely be smaller, the amount of data you use in your database file will be much smaller. If you want to see the benefits then you need to do a offline defrag of the database and then you can see how the database shrinks just by upgrading this DC to .NET. Microsoft’s internal deployments have seen benefits of around 30-40% of the database shrinking. It’s a big improvement. New slide

Forest Recovery Imagine the unthinkable
All domain controllers crash and won’t reboot Data corruption replicates through the forest Schema becomes unavailable Somebody made changes to the schema that prevent standard applications from installing Malicious administrator performs irreparable damage to the schema that replicates through the forest You lose your root domain You win the lottery So far, this has never happened But you want to be prepared Let’s switch gears and talk about Forest Recovery Imagine the unthinkable happens, you are responsible for the Active Directory deployment, you are responsible for operating the ACTIVE DIRECTORY. Then a couple of bad Directory things happen. For example all domain controllers crash at the same time and they won’t reboot. Data corruption replicates thru the forest and your database because corrupted and they don’t work anymore. Schema becomes unavailable, you can’t create objects anymore. Somebody made changes to the schema that prevents standard applications from installing. Or you have a malious Administrator who performs unreconible damage to the schema that replicates thru the forest. You loose your root domain; what all these scenarios have in common is they have not happened so far. But you want to be prepared, you want to have a plan B in your pocket if one of these things that we think are unthinkable, maybe besides a malious Administrator that can always happen, the others are hopefully unthinkable but you need to be prepared, have a plan. So we will talk about Forest recovery. What is Forest recovery? New slide It’s rolling back in time. Time goes on, you make changes to your directory service and hopefully you make backups from time to time, so that you have backups of your data. Then we see this catastrophic event happen, they all crash. What is important is to figure out what happened. What was the root cause of this. You need to know when it happened. If we know that then we can restore the DC’s and we know we will lose some changes but maybe we can backfill the changes by exporting the 1st before we roll back. But it’s important to know when this root cause happened because you need to identify the right moment and the right backup tape to which you want to rollback.

Forest Recovery Rolling back in time
Restore – Changes lost Identified Root Cause Catastrophic Event Changes So we call this forest business recovery – the high level steps to recover from an event like this is basically you shutdown all Domain Controllers in the forest and in each domain you restore one DC from a backup tape then you reinstall the OS on all the other machines and you start with the root domain. We will go thru some of these steps in more detail. Time Backup Backup Backup Backup Backup Backup Backup Backup Backup Backup

Forest Business Recovery High Level Steps
Shutdown all domain controllers in forest In each domain Restore one DC from good backup tape Re-install OS on all other domain controllers Re-promote all other domain controllers Start with root domain first Let’s say we have a forest, it’s working, and its fine. Bad Directory things start happening. What are we going to do 1) Shutdown all the Domain Controllers, 2) restore one DC per domain off the network and it might be good it its just a DC not a GLOBAL CATALOG server so we have one DC, if it is a GLOBAL CATALOG, disable the Global Catalog service, get all the objects down from this machine. The next step will be we break replication to all Domain Controllers except the ones we restored here. So we delete the replication metadata to all the other DC, we delete the computer account of all the other Domain Controllers. We need to make sure that even if one of these bad Directory Domain Controllers should come back on line, or should still be online, we aren’t going to replicate from these groups anymore because we want to have clean data in our environment. Then we’re going to seize all the FSMO roles in these good Domain Controllers; we increase the root pool by 100,000 just to make sure, and then we bring on the restored Domain Controllers back on the network and we start replicating between them, enable the GLOBAL CATALOG servers on at least one DC in the root domain. New slide

Forest Recovery Shutdown all DCs
Restore one DC per domain (off-network) Disable GC service Break replication Seize FSMO roles Increase RID by 100,000 Bring restored DCs back on the network Enable GC on at least one root DC Next step we reinstall the o/s on all the other Domain Controllers that were bad Directory Domain Controllers. So now these machines are coming back online, then we promote these guys so they are becoming real servers again (they are domain controllers). And we enable the GLOBAL CATALOG server as needed on these and we move the FSMO role around as needed.

Forest Recovery Re-install OS on all other DCs Promote all other DCs
Enable GC service as needed Move FSMOs as needed Now if this sounds scary, Microsoft is working on a whitepaper right now and will be on the Microsoft.com that has step by step instructions on how to do this forest recovery. Microsoft has tested this in their internal forest for the Windows development group, this isn’t a small forest-more than 5000 users in this forest. They know it will work and expect a whitepaper soon.

Forest Recovery Detailed steps available very soon in white paper on microsoft.com Best Practice for Recovering your Active Directory Forest Administrative things that have changed. Perhaps some of you upgraded from Win2000 to WinXP and found your administration tools no longer work. Recently Microsoft release a .NET admin pak and place it on the web, the Admin pak can be installed on your XP client to administering Windows 2000 domains. There are some new features that are enabled in these tools, they include the drag and drop for users and computers in the computer snap-in, also some enhancements in the dfs GUI. Another enhancement that was added was the ability to view/save DNS/FRS/ and DS event logs on a non domain controller. You do this by using the event viewer /opsource pointing to name of a Windows 2000 DC in your domain. Now to do this you have to have admin rights on the Domain Controller that you are pointing to. You can do this by one of 2 ways; you can either be an admin on the workstation or the domain you logged on with. You can either Net use to the DC with admin credentials or create duplicate accounts that give you admin rights.

FRS Concepts revisited
Objects in DS Members, Subscribers, Conn. objects, filters Depends on AD replication Determines partners and schedule NTFS USN Journal Used by FRS to track changes to NTFS volumes Staging File and Directory Rename safe Compression support Database Record of incoming, outgoing & existing files When you add a machine and make it a DC or when you add a machine to a FRS replica set, we create a set of objects in the ACTIVE DIRECTORY. These include member objects that define which replica set you are a member of, subscriber objects that tell you who to replicate from, connection objects that move the data between machines. So all this is depending on ACTIVE DIRECTORY replication, so if ACTIVE DIRECTORY replication isn’t occurring then we don’t get these objects between Domain Controllers and nothing works. We’ve also had some problems with people improperly deleting objects or references to attributes in FRS replication. So obviously an important thing. Next thing is the way FRS uses the USN journal. Windows 2000 added a journal that reports all changes to NTFS formatted partitions. FRS hooks into this and this is how it learns the changes to a replica set, and this is also why DC promo requires that sysvol be located on a NTFS formatted partition. Replication sounds simple on a 1st glance but in fact as you know, it’s very complicated particularly when you have directories being renamed and moving in and out of the tree. And all sorts of change orders are arriving in various orders from different Domain Controllers. So to accommodate moving a file into place, we have this concept of a staging directory, which is a place where we create a file that moves between partners and gets renamed to a target directory. This is also where we can enable compression for these files so you get the benefit of less utilization on the wire. Finally we have this database that keeps track of all files in the tree and incoming and outgoing changes that are replicated between members.

FRS Replication Operation
Create / Modify file NTFS NTFS Drive FRS learns of file changes from the NTFS “USN Change journal” Drive Filter out unwanted files Age Cache waits 3s Rename + move file to final location Write OB Log Copy file into Pre-install area Write entry in FRS ID Table Here is a quick slide that shows how files are moved between members. So if we create a file in a replicated directory, it hits the drive and the USN journal tracks it. So FRS learns about the change from the USN journal, and replies and filters those defined in the operating system and administrators can define to throw out any unwanted changes. Then we wait about 3 seconds, write to the outbound log and add this file to the id table which is a track of all files in the tree. We build a staging file, send this change order to the downstream partners that are machines that have outbound connections or inbound connections from this machine. The downstream partners request the change order and receive the staging file, it goes into its staging directory, we copy it into a pre installation directory and finally rename it into its final location. And we are done. Build staging file Replica copies file to staging dir Write to OB log for other replicas Send change order to partner Request change Write to Inbound and ID log

Journal Wraps / Staging backlog
NTFS USN Journal is a fixed-size log of file changes FRS Service must run to keep up with these changes Last ∆ in FRS DB must exist in NTFS journal If not, FRS cannot know all changes. Called ‘journal wrap’ Resolution Keep Service running (especially during bulk modifications) Increase size of USN journal (automatic in SP3 rollup) Staging File backlog Before SP3, staging files stored until all direct partners receive the staged files Associated with connections Common causes of backlogs: Offline downstream partners Full SYNCS by Administrators or applications Antivirus , Disk Optimizers, File system policy Sharing violations / Move-In problems Now let’s drill down to one specific piece of this specific thing which is the way we use the USN journal. The USN journal keeps track of all changes to the tree and for this reason its critical the FRS service always be kept running to learn of these changes. So if you stop this service for long periods of time and make enough changes such as the database wraps, the last database change no longer exist in the USN journal, then we are no longer authoritative in the id table about what changes took place in the tree and we are out of sync and we have to resync from scratch. This is a bad thing. We call this a journal wrap. So there are a couple of solutions, one is to keep the service running at all times, don’t make large scale deletions, additions while its turned off and increase the size of the USN journal, which is as simple as installing the latest ho9t fix that Microsoft has for FRS. Next problem we have is staging file backlogs. We had this problem where if a successive number of changes, change orders took place, where downstream partners were off for long enough then the staging file backed up and we never recovered So a common cause of backlogs where if the downstream partner was off for long periods of time, or if failed syncs were occurring on all the data in the tree. There were 3 major causes for this. Antivirus solutions were modifying security descriptors for files, disk optimizations were doing a similar type thing, and file system policies were being applied against the tree causing failed syncs. There are some solutions for which I’ll show you in a moment.

Reconcilation & Morphed Directories
Files: Last-writer wins All change orders have event times (UTC) Event time of CO compared to ID Table Event time > 30 minutes, last writer wins Event time < 30 minutes, highest version wins Folders: Last-writer wins Conflicting change gets morphed name Preserves files associated with directory First-writer wins for name conflicts of folders Causes BURFLAGS abuse Conflicting creates on replication failure Another problem is morphed directories for files. Let’s cover the method of reconciliation for files and directories used by FRS. For FRS files, FRS uses the last write WINS algorithm such that in the event of no conflict, the last change orders within a certain period of time, then we are going to take the version with the highest version number. And this is stored in the replicated data between targets. Now for folders it’s a whole nother matter. It’s possible when you move a folder in that it has files underneath it, so if we threw the change order away for the conflicting folder we would also loose the files that might have been replicated with it. For this reason we morph the conflicting directory and so the intention here is for the administrator to reconcile the content of the normal appearing directory and the conflicted directory and make some decisions about what should happen. As far as causes for these morphed directories or directory conflicts there’s 2 primary things -1) is what we call burflags abuse. There is a registry key called burflags, it stands for backup restore flags and it defines if a machine is authorative or not for the data in its replica tree. There is a registry key D4, so what happens is if you mark a machine as D4 without stopping the service or all other members of the set, and start bringing those members back into the set with a D2, then you cause this morphed directory and it’s not fun to recover from it. 2nd cause is when you are creating duplicate writes on multiple partners in the set. A common cause of this is when replication halts, users go back to their regular SMB share instead of a dfs namespace when they are creating directories.

FRS Enhancements (Q319473) QFE roll-up of coming Service Pack 3 changes
Increases NTFS USN journal: 128 MB Dynamic staging file relocation LRU staging files deleted: 60 / 90 rule Staging files for offline partners deleted SYSTEM = Full Control / NTFS bug Duplicate changes not sent on wire + event Office XP (Excel) data deletion fix Here are some things that have been done in the SP3 hotfix version of FRS. This is available today. To avoid journal wraps they’ve increase the default size o the USN journal to 128 megabytes. Staging directories used to fill up and administrators would want to move them from 1 logical drive to another. This is very difficult to do, in the old release. You basically had to resync all the data. We can now change the attribute that defines the staging directory location and events get logged telling you to stop the service and drag and drop the directory from the old place to the new. To solve the staging file backup problem we now have a method where we delete once 90% of the staging directory is full, we delete least recently used files down to a 60 % threshold to keep the directory from filling up and blocking replication. This latest SP3 hotfix recently fixed staging . Offline partners are now deleted. If a downstream partner is offline for greater than a week then we delete all change orders in the staging directory associated with that member. This latest version of FRS exposed a bug in NTFS where if the system account doesn’t have full control of the tree, can’t move files into the directory. So there is a new NTFS system fix that you will want to download with this – Q (changes to the FRS service). Alternatively you need to give the system account full control of the tree, one of the 2 will do. I mentioned antivirus and diskkeeper programs causing excessive replication. So now we have a method of learning of duplicate changes to files and if the staging file of new change orders have a duplicate. MB5 checksum in the id table we don’t send this change order across the wire, and we keep from propagating this data. We also log an event telling you got duplicate changes going on and you need to identify the ultimate cause. Microsoft had a rare timing specific problem where office files, including Excel and Word documents did get deleted and we now have a fix for this in the SP3 release. So this is really a non-optional release and Microsoft thinks you will find more reliability and service with it.

Topology Enhancements
DFSGUI from .NET Server Runs on XP clients in Windows 2000 domains Available on microsoft.com now: Q304718 New topology options Full Mesh, Ring, Simple Hub & Spoke Custom Topologies Connection Tuning Enable / disable individual connections Change orders are associated with connections Disabling connections deletes associated backlog Connection Priority (may pull this) Bit on options attribute of connection object Defines partners used during initial / recovery sync High: “Must” source all connections in class Medium: Source from at least 1 connection in class Low: “best effort” sync Topology is a key factor in the scalability of FRS, and always the Windows 2000 dfs GUI built a full mesh topology.. This is bad because it means the downstream partners have to swap down duplicate change orders from alternate paths and it also meant you had to monitor more servers. So in this new .NET service GUI Microsoft adds new trusted topologies or the ability to find a ring, a simple hub and spoke or custom topology with the custom topologies you can do things like multi-tier to your hub and spoke or redundant hub and spoke with stagger schedules. Today there’s been this unused attribute on connection objects called the options attribute and in it you can define the priority with which a downstream partner would replicate with upstream partners when you had multiple inbound connections. This is ideal when you add a machine to a replica set with a mature topology and it has to do the initial sync or it hits some error statement it has to recover from. Currently this hasn’t been exposed but I want to make you aware of it and look for this in future versions of DFS GUI.

FRS best practices Run Q307319 + new NTFS.SYS Keep service running
Avoids journal wraps Join empty replica sets Don’t place DFS targets on OS partition DFS: enable replication on child links Targets can be taken offline Incremental sourcing & advertisement of data Replica set specific burflags Properly size staging dir 128 largest files + 50% or 650 MB minimum Don’t delete files from staging directory Change orders, # of VV joins, file size Here’s some best practices, starting with deploying this special SP3 hotfix and get the new NTFS.sys. Certainly on all your Domain Controllers and for sure on machines that replicas have large amounts of dfs data. We talked about the risk of journal wraps and there is some relief there so keep this service running. There are some optimizations you get when you join machines to empty replica sets, more important when you’re replication multi gigabytes of dfs data. Recommend you not place the dfs targets on the same partition as the operating system. And when you do enable replication in dfs do so on child links only. There are a couple of advantages here 1) the targets for links can be taken offline. So if that data is inconsistent you don’t advertise data to machines in some kind of error state. Also you can incremently source data in smaller chunks at the linklevel versus a one size fits all at the root. Finally it’s not advertised, but Microsoft has a way to re-initialize individual replica sets for a machine rather than reinitializing all replica sets it participates in. They just have’nt made it public to us yet. For staging file size, Microsoft recommends you take a look at the largest 128 files in your replica tree and add some amount of overhead or set it to 650 mb minimum, whichever is larger. Microsoft has seen customers and administrators delete file from the staging directory and this is a bad Directory thing. Microsoft thinks in this SP3 release you should have to do this less, and as a matter of fact they haven’t seen anyone have to do it in this release so far. But an alternative to deleting files in the staging directory, just delete or disable connections to the downstream partners you are holding the files for. The dfs GUI has a way to do that .Also there is an enable True or False attribute on each connection and if you just toggle that those changes will disappear anyway.

FRS best practices Topology management No full mesh
SYSVOL: requires 1 in / outbound CO Forceful deletion of FRS members Delete member and subscriber objects For topology get away from this full mesh business, realize what you have to have for Domain Controllers 1 inbound and 1 outbound in the domain. There have been problems where Domain Controllers get wedged. There’s a process to force demote the DC so when you do this and do the replication metadata cleanup remembers to get rid of the DNS entries and also the FRS members and subscriber objects covered earlier.

Tools NTFRSUTL NTFRSUTL DS NTFRSUTL SETS
Repadmin /showconn for FRS DS Object inventory + topology review NTFRSUTL SETS Repadmin showreps for FRS Status of downstream partner sync status NTFRSUTL INLOG | OUTLOG: IDTABLE Inbound + outbound changes + tree inventory Debug Logs: systemroot%\debug\ntfrs_*.log Two way conversation between partners As far as tools to monitor all this stuff, probably you’re familiar with the ntfrsutl and I’ll talk about those and some analogies to some ACTIVE DIRECTORY replication tools. NTFRUTIL is kind of the Swiss army knife of FRS troubleshooting. There are 2 switches that important. NTFRSUTIL DS is analogues to Repadmn /showconn for FRS. And its thru this we can inventory member objects, subscriber objects and connection objects and the balance of schedules between Domain Controllers or member servers in the set. . NTFRSUTL Sets is the Repadmn show reps for FRS. There we have ways to dump the in log and outbound id cable and so forth using these debug logs.

Summary All deployments should run SP2 Deploy SP3 when available
Q provides roll-up fix for many issues Lingering objects Account lockouts PDC overload situations Monitor Active Directory

New Documentation Available on microsoft.com
Best Practices for Active Directory Delegation Coming soon Active Directory Monitoring Guidelines and Key Indicators Active Directory Forest Recovery Eventcomb

Advanced Active Directory Design and Troubleshooting

Similar presentations

Presentation on theme: "Advanced Active Directory Design and Troubleshooting"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Active Directory Design and Troubleshooting

Similar presentations

Presentation on theme: "Advanced Active Directory Design and Troubleshooting"— Presentation transcript:

Similar presentations

About project

Feedback