Thursday, July 7, 2011

Windows 2008 R2 Cluster Restart Issue

We had two of our Windows 2008 R2 clusters, each hosting two SQL 2008 R2 Enterprise Edition instances, failover one of the SQL resource groups in each of the two clusters at about the same time (within seconds).  The event log contained an Event ID 1135 on one cluster and Event ID 4201 on the other cluster as the pertinent system events just prior to failover.  How do you analyze these and is there a common issue?

To analyze Event ID 1135, see the Technet article here.  Essentially, I ran the "Validate this Cluster..." configuration function and found a NIC intended for backups only was used by the cluster as a network resource.  In our environment, we specifically configure the backup NICs to not be able to allow access from peer to peer node (just to backup devices), so this is most likely the cause.  I also found on one of the nodes that the backup NIC driver needed to be re-installed.  I'm guessing NIC drivers were updated and network resources on the cluster was a side-effect.

To analyze Event ID 4201, the Technet article was of no use since it states the message is a NIC start message and the message accompanying the event stated the NIC could not be started.  I'll have to report that one.  Further searching resulted in finding similar messages wherein the NIC was causing problems.  Looking at all NICs on the servers in the cluster, I saw no problems.  The message contained a GUID which referenced the NIC.  Searching the registry I found the problem NIC was the Microsoft Failover Cluster Virtual Adapter.  

I am still investigating this one.

In the meantime, I've disabled the use of all backup NICs from the clusters and validated that all clustered resources do function and properly failover to/from each node.

<Update>The network crew came clean and stated they restarted a network applicance (firewall) which caused the SQL Instances on both clusters to restart.  Mystery solved.

No comments:

Post a Comment