|
|
Having a problem with one of my Windows 2008 clusters, cluster CL1. We have an identical cluster system up and running, configured and working perfectly, CL2. It now has Exchange Mailbox Cluster Service installed. The second cluster is giving us fits. Here is what is going on:
• Cluster servers are IBM blades • Node one is P11, Node two is P12 • Using a SAN for shared storage • The NIC configuration on the nodes is as follows: o Two Broadcom NICS each with their own MAC address o Broadcom NICs are teamed to a third virtual MAC address o 2 VLAN’s sit on top of the teamed NICs, one for Public 192.168.45.x, one for Private 10.168.45.x (Heartbeat) o IPv4 Connectivity only o Windows cluster validation complains about the duplicate MAC address but Layer 3 switching should not have a problem with this
The issue that we are seeing on CL1 is that we can form the cluster successfully on P11 and then add P12 into the cluster. However, after that, intermittently P12 gets kicked out of the cluster and shuts down the cluster service generating an 1177 error in the event log. Basically, this means that it lost quorum due to losing connectivity with the cluster nodes. We can generally force this to occur by failing over a disk from P11 to P12.
The problem does NOT look like it is isolated to P12 however. If we make P12 the cluster owner, then P11 exhibits the problem, if P11 is the owner, then P12 exhibits the problem. Using Node and Disk Majority for quorum setting.
Steps we have taken to troubleshoot: • Verified all Windows network configuration settings and bindings are the same on the working cluster as the non-working cluster • Verified that the Broadcom teaming configuration settings are the same on the working cluster as the non-working cluster • Verified the the NIC drivers and software versions are the same on the working cluster as the non-working cluster • Attempted to create the cluster on P12 first and then add P11. This completely blew up when trying to add P11 • We have evicted P12 and re-added it to the cluster • We have destroyed the cluster, uninstalled the Windows Failover Clustering feature on both boxes, rebooted, reinstalled the Failover clustering feature and reformed the cluster • We broke the teaming on the Broadcom NICs, destroyed the cluster, etc. Reformed the cluster and get the same exact issue, so teaming is not the problem and, as I indicated, we have one up and running without problems with the teaming.
It should be noted that this machine, CL1, was originally installed with teamed NICs and I raised a concern that this configuration might not work based upon the errors generated in cluster validation. We broke the teaming and formed the cluster successfully. Everything appeared to be fine although I cannot say that we would have noticed this problem originally. Then, to appease the network folks, we tried forming a cluster on CL2, which still had teamed NICs. This worked beautifully so we destroyed the cluster on CL1, reteamed the NICs and tried reforming the cluster.
I am currently at a loss to figure out what is going on. My current suspects are: • Something configured incorrectly with base network settings? • Something configured incorrectly with NIC teaming? • Something configured incorrectly in Layer 3 switching? • Something configured incorrectly with the SAN? • Some kind of issue caused by destroying the cluster and reforming it, maybe something was not cleaned up properly in AD or on the machines? • Some kind of hardware issue?
Anyone seen anything like this? We are working on about 3 days of troubleshooting this thing. It looks for all the world like the nodes are losing network connectivity but we have run “ping –t†from both boxes to the other node’s public and private IP’s and never lost connectivity during failures of the non-owning cluster node.
Any help is greatly appreciated.
|
|
Does Validate run without any errors ?
"Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message news:C086D541-B2F7-4DB1-899A-A0FF03F55183[ at ]microsoft.com...
[Quoted Text] > Having a problem with one of my Windows 2008 clusters, cluster CL1. We
have > an identical cluster system up and running, configured and working perfectly, > CL2. It now has Exchange Mailbox Cluster Service installed. The second > cluster is giving us fits. Here is what is going on: > > . Cluster servers are IBM blades > . Node one is P11, Node two is P12 > . Using a SAN for shared storage > . The NIC configuration on the nodes is as follows: > o Two Broadcom NICS each with their own MAC address > o Broadcom NICs are teamed to a third virtual MAC address > o 2 VLAN's sit on top of the teamed NICs, one for Public > 192.168.45.x, one for Private 10.168.45.x (Heartbeat) > o IPv4 Connectivity only > o Windows cluster validation complains about the duplicate > MAC address but Layer 3 switching should not have a problem with this > > The issue that we are seeing on CL1 is that we can form the cluster > successfully on P11 and then add P12 into the cluster. However, after that, > intermittently P12 gets kicked out of the cluster and shuts down the cluster > service generating an 1177 error in the event log. Basically, this means that > it lost quorum due to losing connectivity with the cluster nodes. We can > generally force this to occur by failing over a disk from P11 to P12. > > The problem does NOT look like it is isolated to P12 however. If we make P12 > the cluster owner, then P11 exhibits the problem, if P11 is the owner, then > P12 exhibits the problem. Using Node and Disk Majority for quorum setting. > > Steps we have taken to troubleshoot: > . Verified all Windows network configuration settings and bindings are the > same on the working cluster as the non-working cluster > . Verified that the Broadcom teaming configuration settings are the same on > the working cluster as the non-working cluster > . Verified the the NIC drivers and software versions are the same on the > working cluster as the non-working cluster > . Attempted to create the cluster on P12 first and then add P11. This > completely blew up when trying to add P11 > . We have evicted P12 and re-added it to the cluster > . We have destroyed the cluster, uninstalled the Windows Failover Clustering > feature on both boxes, rebooted, reinstalled the Failover clustering feature > and reformed the cluster > . We broke the teaming on the Broadcom NICs, destroyed the cluster, etc. > Reformed the cluster and get the same exact issue, so teaming is not the > problem and, as I indicated, we have one up and running without problems with > the teaming. > > It should be noted that this machine, CL1, was originally installed with > teamed NICs and I raised a concern that this configuration might not work > based upon the errors generated in cluster validation. We broke the teaming > and formed the cluster successfully. Everything appeared to be fine although > I cannot say that we would have noticed this problem originally. Then, to > appease the network folks, we tried forming a cluster on CL2, which still had > teamed NICs. This worked beautifully so we destroyed the cluster on CL1, > reteamed the NICs and tried reforming the cluster. > > I am currently at a loss to figure out what is going on. My current suspects > are: > . Something configured incorrectly with base network settings? > . Something configured incorrectly with NIC teaming? > . Something configured incorrectly in Layer 3 switching? > . Something configured incorrectly with the SAN? > . Some kind of issue caused by destroying the cluster and reforming it, > maybe something was not cleaned up properly in AD or on the machines? > . Some kind of hardware issue? > > Anyone seen anything like this? We are working on about 3 days of > troubleshooting this thing. It looks for all the world like the nodes are > losing network connectivity but we have run "ping -t" from both boxes to the > other node's public and private IP's and never lost connectivity during > failures of the non-owning cluster node. > > Any help is greatly appreciated. >
|
|
Yes, validation runs without any errors or warnings. I wouldn't bother to post if it didn't.
We are currently thinking it may be a SAN multipath issue so we are going to take the nodes down to single path and see what happens.
"Edwin vMierlo [MVP]" wrote:
[Quoted Text] > Does Validate run without any errors ? > > > > > "Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message > news:C086D541-B2F7-4DB1-899A-A0FF03F55183[ at ]microsoft.com... > > Having a problem with one of my Windows 2008 clusters, cluster CL1. We > have > > an identical cluster system up and running, configured and working > perfectly, > > CL2. It now has Exchange Mailbox Cluster Service installed. The second > > cluster is giving us fits. Here is what is going on: > > > > . Cluster servers are IBM blades > > . Node one is P11, Node two is P12 > > . Using a SAN for shared storage > > . The NIC configuration on the nodes is as follows: > > o Two Broadcom NICS each with their own MAC address > > o Broadcom NICs are teamed to a third virtual MAC address > > o 2 VLAN's sit on top of the teamed NICs, one for Public > > 192.168.45.x, one for Private 10.168.45.x (Heartbeat) > > o IPv4 Connectivity only > > o Windows cluster validation complains about the duplicate > > MAC address but Layer 3 switching should not have a problem with this > > > > The issue that we are seeing on CL1 is that we can form the cluster > > successfully on P11 and then add P12 into the cluster. However, after > that, > > intermittently P12 gets kicked out of the cluster and shuts down the > cluster > > service generating an 1177 error in the event log. Basically, this means > that > > it lost quorum due to losing connectivity with the cluster nodes. We can > > generally force this to occur by failing over a disk from P11 to P12. > > > > The problem does NOT look like it is isolated to P12 however. If we make > P12 > > the cluster owner, then P11 exhibits the problem, if P11 is the owner, > then > > P12 exhibits the problem. Using Node and Disk Majority for quorum setting. > > > > Steps we have taken to troubleshoot: > > . Verified all Windows network configuration settings and bindings are the > > same on the working cluster as the non-working cluster > > . Verified that the Broadcom teaming configuration settings are the same > on > > the working cluster as the non-working cluster > > . Verified the the NIC drivers and software versions are the same on the > > working cluster as the non-working cluster > > . Attempted to create the cluster on P12 first and then add P11. This > > completely blew up when trying to add P11 > > . We have evicted P12 and re-added it to the cluster > > . We have destroyed the cluster, uninstalled the Windows Failover > Clustering > > feature on both boxes, rebooted, reinstalled the Failover clustering > feature > > and reformed the cluster > > . We broke the teaming on the Broadcom NICs, destroyed the cluster, etc. > > Reformed the cluster and get the same exact issue, so teaming is not the > > problem and, as I indicated, we have one up and running without problems > with > > the teaming. > > > > It should be noted that this machine, CL1, was originally installed with > > teamed NICs and I raised a concern that this configuration might not work > > based upon the errors generated in cluster validation. We broke the > teaming > > and formed the cluster successfully. Everything appeared to be fine > although > > I cannot say that we would have noticed this problem originally. Then, to > > appease the network folks, we tried forming a cluster on CL2, which still > had > > teamed NICs. This worked beautifully so we destroyed the cluster on CL1, > > reteamed the NICs and tried reforming the cluster. > > > > I am currently at a loss to figure out what is going on. My current > suspects > > are: > > . Something configured incorrectly with base network settings? > > . Something configured incorrectly with NIC teaming? > > . Something configured incorrectly in Layer 3 switching? > > . Something configured incorrectly with the SAN? > > . Some kind of issue caused by destroying the cluster and reforming it, > > maybe something was not cleaned up properly in AD or on the machines? > > . Some kind of hardware issue? > > > > Anyone seen anything like this? We are working on about 3 days of > > troubleshooting this thing. It looks for all the world like the nodes are > > losing network connectivity but we have run "ping -t" from both boxes to > the > > other node's public and private IP's and never lost connectivity during > > failures of the non-owning cluster node. > > > > Any help is greatly appreciated. > > > > >
|
|
"Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message news:03895074-C067-4EE1-A0C6-8EC19CFF4709[ at ]microsoft.com...
[Quoted Text] > Yes, validation runs without any errors or warnings. I wouldn't bother to > post if it didn't.
It was just a question... many people ignore the Validate, which is not a good thing. I am glad to hear that all is well with your validate
> > We are currently thinking it may be a SAN multipath issue so we are going to > take the nodes down to single path and see what happens.
Usually the behaviour you describe is in the networking realm.
First I would try to break any and all teaming, and retest.
Rgds, Edwin.
> > "Edwin vMierlo [MVP]" wrote: > > > Does Validate run without any errors ? > > > > > > > > > > "Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message > > news:C086D541-B2F7-4DB1-899A-A0FF03F55183[ at ]microsoft.com... > > > Having a problem with one of my Windows 2008 clusters, cluster CL1. We > > have > > > an identical cluster system up and running, configured and working > > perfectly, > > > CL2. It now has Exchange Mailbox Cluster Service installed. The second > > > cluster is giving us fits. Here is what is going on: > > > > > > . Cluster servers are IBM blades > > > . Node one is P11, Node two is P12 > > > . Using a SAN for shared storage > > > . The NIC configuration on the nodes is as follows: > > > o Two Broadcom NICS each with their own MAC address > > > o Broadcom NICs are teamed to a third virtual MAC address > > > o 2 VLAN's sit on top of the teamed NICs, one for Public > > > 192.168.45.x, one for Private 10.168.45.x (Heartbeat) > > > o IPv4 Connectivity only > > > o Windows cluster validation complains about the duplicate > > > MAC address but Layer 3 switching should not have a problem with this > > > > > > The issue that we are seeing on CL1 is that we can form the cluster > > > successfully on P11 and then add P12 into the cluster. However, after > > that, > > > intermittently P12 gets kicked out of the cluster and shuts down the > > cluster > > > service generating an 1177 error in the event log. Basically, this means > > that > > > it lost quorum due to losing connectivity with the cluster nodes. We can > > > generally force this to occur by failing over a disk from P11 to P12. > > > > > > The problem does NOT look like it is isolated to P12 however. If we make > > P12 > > > the cluster owner, then P11 exhibits the problem, if P11 is the owner, > > then > > > P12 exhibits the problem. Using Node and Disk Majority for quorum setting. > > > > > > Steps we have taken to troubleshoot: > > > . Verified all Windows network configuration settings and bindings are the > > > same on the working cluster as the non-working cluster > > > . Verified that the Broadcom teaming configuration settings are the same > > on > > > the working cluster as the non-working cluster > > > . Verified the the NIC drivers and software versions are the same on the > > > working cluster as the non-working cluster > > > . Attempted to create the cluster on P12 first and then add P11. This > > > completely blew up when trying to add P11 > > > . We have evicted P12 and re-added it to the cluster > > > . We have destroyed the cluster, uninstalled the Windows Failover > > Clustering > > > feature on both boxes, rebooted, reinstalled the Failover clustering > > feature > > > and reformed the cluster > > > . We broke the teaming on the Broadcom NICs, destroyed the cluster, etc. > > > Reformed the cluster and get the same exact issue, so teaming is not the > > > problem and, as I indicated, we have one up and running without problems > > with > > > the teaming. > > > > > > It should be noted that this machine, CL1, was originally installed with > > > teamed NICs and I raised a concern that this configuration might not work > > > based upon the errors generated in cluster validation. We broke the > > teaming > > > and formed the cluster successfully. Everything appeared to be fine > > although > > > I cannot say that we would have noticed this problem originally. Then, to > > > appease the network folks, we tried forming a cluster on CL2, which still > > had > > > teamed NICs. This worked beautifully so we destroyed the cluster on CL1, > > > reteamed the NICs and tried reforming the cluster. > > > > > > I am currently at a loss to figure out what is going on. My current > > suspects > > > are: > > > . Something configured incorrectly with base network settings? > > > . Something configured incorrectly with NIC teaming? > > > . Something configured incorrectly in Layer 3 switching? > > > . Something configured incorrectly with the SAN? > > > . Some kind of issue caused by destroying the cluster and reforming it, > > > maybe something was not cleaned up properly in AD or on the machines? > > > . Some kind of hardware issue? > > > > > > Anyone seen anything like this? We are working on about 3 days of > > > troubleshooting this thing. It looks for all the world like the nodes are > > > losing network connectivity but we have run "ping -t" from both boxes to > > the > > > other node's public and private IP's and never lost connectivity during > > > failures of the non-owning cluster node. > > > > > > Any help is greatly appreciated. > > > > > > > > >
|
|
I agree with you regarding the networking and that was the very first thing that we did was to break the teaming. Then we went through all the network settings with a fine tooth comb, verified the network drivers, swapped out blades to try different NIC's, tried different blade slots to try different ports in the internal blade switch, swapped the public and private networks, ran sniffer traces of the traffic during failure. We also disabled TCP off-loading and completely reset TCP using netsh.
It was about this time that I posted to this list in order to see if anyone had any ideas other than the blatantly obvious things to try or the things that come back by a well-formed Google search.
Now, having run out of troubleshooting items for the server configuration and networking configuration, we have moved on to see if it is disk configuration issue. But, open to any novel ideas with regards to what the heck is going on. This isn't that complicated of a task.
"Edwin vMierlo [MVP]" wrote:
[Quoted Text] > > "Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message > news:03895074-C067-4EE1-A0C6-8EC19CFF4709[ at ]microsoft.com... > > Yes, validation runs without any errors or warnings. I wouldn't bother to > > post if it didn't. > > It was just a question... many people ignore the Validate, which is not a > good thing. > I am glad to hear that all is well with your validate > > > > > We are currently thinking it may be a SAN multipath issue so we are going > to > > take the nodes down to single path and see what happens. > > Usually the behaviour you describe is in the networking realm. > > First I would try to break any and all teaming, and retest. > > Rgds, > Edwin. > > > > > > > "Edwin vMierlo [MVP]" wrote: > > > > > Does Validate run without any errors ? > > > > > > > > > > > > > > > "Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message > > > news:C086D541-B2F7-4DB1-899A-A0FF03F55183[ at ]microsoft.com... > > > > Having a problem with one of my Windows 2008 clusters, cluster CL1. We > > > have > > > > an identical cluster system up and running, configured and working > > > perfectly, > > > > CL2. It now has Exchange Mailbox Cluster Service installed. The second > > > > cluster is giving us fits. Here is what is going on: > > > > > > > > . Cluster servers are IBM blades > > > > . Node one is P11, Node two is P12 > > > > . Using a SAN for shared storage > > > > . The NIC configuration on the nodes is as follows: > > > > o Two Broadcom NICS each with their own MAC address > > > > o Broadcom NICs are teamed to a third virtual MAC > address > > > > o 2 VLAN's sit on top of the teamed NICs, one for > Public > > > > 192.168.45.x, one for Private 10.168.45.x (Heartbeat) > > > > o IPv4 Connectivity only > > > > o Windows cluster validation complains about the > duplicate > > > > MAC address but Layer 3 switching should not have a problem with this > > > > > > > > The issue that we are seeing on CL1 is that we can form the cluster > > > > successfully on P11 and then add P12 into the cluster. However, after > > > that, > > > > intermittently P12 gets kicked out of the cluster and shuts down the > > > cluster > > > > service generating an 1177 error in the event log. Basically, this > means > > > that > > > > it lost quorum due to losing connectivity with the cluster nodes. We > can > > > > generally force this to occur by failing over a disk from P11 to P12. > > > > > > > > The problem does NOT look like it is isolated to P12 however. If we > make > > > P12 > > > > the cluster owner, then P11 exhibits the problem, if P11 is the owner, > > > then > > > > P12 exhibits the problem. Using Node and Disk Majority for quorum > setting. > > > > > > > > Steps we have taken to troubleshoot: > > > > . Verified all Windows network configuration settings and bindings are > the > > > > same on the working cluster as the non-working cluster > > > > . Verified that the Broadcom teaming configuration settings are the > same > > > on > > > > the working cluster as the non-working cluster > > > > . Verified the the NIC drivers and software versions are the same on > the > > > > working cluster as the non-working cluster > > > > . Attempted to create the cluster on P12 first and then add P11. This > > > > completely blew up when trying to add P11 > > > > . We have evicted P12 and re-added it to the cluster > > > > . We have destroyed the cluster, uninstalled the Windows Failover > > > Clustering > > > > feature on both boxes, rebooted, reinstalled the Failover clustering > > > feature > > > > and reformed the cluster > > > > . We broke the teaming on the Broadcom NICs, destroyed the cluster, > etc. > > > > Reformed the cluster and get the same exact issue, so teaming is not > the > > > > problem and, as I indicated, we have one up and running without > problems > > > with > > > > the teaming. > > > > > > > > It should be noted that this machine, CL1, was originally installed > with > > > > teamed NICs and I raised a concern that this configuration might not > work > > > > based upon the errors generated in cluster validation. We broke the > > > teaming > > > > and formed the cluster successfully. Everything appeared to be fine > > > although > > > > I cannot say that we would have noticed this problem originally. Then, > to > > > > appease the network folks, we tried forming a cluster on CL2, which > still > > > had > > > > teamed NICs. This worked beautifully so we destroyed the cluster on > CL1, > > > > reteamed the NICs and tried reforming the cluster. > > > > > > > > I am currently at a loss to figure out what is going on. My current > > > suspects > > > > are: > > > > . Something configured incorrectly with base network settings? > > > > . Something configured incorrectly with NIC teaming? > > > > . Something configured incorrectly in Layer 3 switching? > > > > . Something configured incorrectly with the SAN? > > > > . Some kind of issue caused by destroying the cluster and reforming > it, > > > > maybe something was not cleaned up properly in AD or on the machines? > > > > . Some kind of hardware issue? > > > > > > > > Anyone seen anything like this? We are working on about 3 days of > > > > troubleshooting this thing. It looks for all the world like the nodes > are > > > > losing network connectivity but we have run "ping -t" from both boxes > to > > > the > > > > other node's public and private IP's and never lost connectivity > during > > > > failures of the non-owning cluster node. > > > > > > > > Any help is greatly appreciated. > > > > > > > > > > > > > > > >
|
|
Did you try disabling any TCP/IP offload settings in your NICs? I've seen these cause this type of behavior.
Have you tried replacing the Broadcom NICs with another brand?
Regards, John
Visit my blog: http://msmvps.com/blogs/jtoner
"Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message news:6468079C-AB1D-49DA-B243-B76035F65325[ at ]microsoft.com...
[Quoted Text] > I agree with you regarding the networking and that was the very first
thing > that we did was to break the teaming. Then we went through all the network > settings with a fine tooth comb, verified the network drivers, swapped out > blades to try different NIC's, tried different blade slots to try different > ports in the internal blade switch, swapped the public and private networks, > ran sniffer traces of the traffic during failure. We also disabled TCP > off-loading and completely reset TCP using netsh. > > It was about this time that I posted to this list in order to see if anyone > had any ideas other than the blatantly obvious things to try or the things > that come back by a well-formed Google search. > > Now, having run out of troubleshooting items for the server configuration > and networking configuration, we have moved on to see if it is disk > configuration issue. But, open to any novel ideas with regards to what the > heck is going on. This isn't that complicated of a task. > > "Edwin vMierlo [MVP]" wrote: >
|
|
Offloading. Yes, that was in the note you responded to..."We also disabled TCP off-loading..."
Can't really do much about the Broadcom NIC's. Apparently they are integrated into the blade motherboard. However, we have an identical system with identical NIC's installed and working perfectly. Also, we swapped out the HBA's into different blades and that didn't seem to work so probably not the brand or bad NIC's.
However, by disabling multipath, we suddenly increased cluster stability by a factor of x12. In the previous week, the cluster has not stayed stable for more than about 5 minutes and now it stays up for over an hour before exhibiting the issue. And, we get an event log message now that says that the quorum disk failed on the node that fails.
So, we are thinking the evidence points to some kind of storage issue. Our current plan is to replace the HBA's with different HBA's and tie their worldwide name to the existing LUNs.
I guess I will keep everyone in the loop here and post repeated information as necessary.
"John Toner [MVP]" wrote:
[Quoted Text] > Did you try disabling any TCP/IP offload settings in your NICs? I've seen > these cause this type of behavior. > > Have you tried replacing the Broadcom NICs with another brand? > > Regards, > John > > Visit my blog: http://msmvps.com/blogs/jtoner> > "Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message > news:6468079C-AB1D-49DA-B243-B76035F65325[ at ]microsoft.com... > > I agree with you regarding the networking and that was the very first > thing > > that we did was to break the teaming. Then we went through all the network > > settings with a fine tooth comb, verified the network drivers, swapped out > > blades to try different NIC's, tried different blade slots to try > different > > ports in the internal blade switch, swapped the public and private > networks, > > ran sniffer traces of the traffic during failure. We also disabled TCP > > off-loading and completely reset TCP using netsh. > > > > It was about this time that I posted to this list in order to see if > anyone > > had any ideas other than the blatantly obvious things to try or the things > > that come back by a well-formed Google search. > > > > Now, having run out of troubleshooting items for the server configuration > > and networking configuration, we have moved on to see if it is disk > > configuration issue. But, open to any novel ideas with regards to what the > > heck is going on. This isn't that complicated of a task. > > > > "Edwin vMierlo [MVP]" wrote: > > > > >
|
|
Reading is obviously not our strongpoint in this group ;-)
Have any snippets of the cluster log during the failure that you'd care to share?
Regards, John
Visit my blog: http://msmvps.com/blogs/jtoner
"Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message news:02D9CD8E-3CE0-4CC0-81F8-A62D79BF463B[ at ]microsoft.com...
[Quoted Text] > Offloading. Yes, that was in the note you responded to..."We also disabled > TCP off-loading..." > > Can't really do much about the Broadcom NIC's. Apparently they are > integrated into the blade motherboard. However, we have an identical
system > with identical NIC's installed and working perfectly. Also, we swapped out > the HBA's into different blades and that didn't seem to work so probably not > the brand or bad NIC's. > > However, by disabling multipath, we suddenly increased cluster stability by > a factor of x12. In the previous week, the cluster has not stayed stable for > more than about 5 minutes and now it stays up for over an hour before > exhibiting the issue. And, we get an event log message now that says that the > quorum disk failed on the node that fails. > > So, we are thinking the evidence points to some kind of storage issue. Our > current plan is to replace the HBA's with different HBA's and tie their > worldwide name to the existing LUNs. > > I guess I will keep everyone in the loop here and post repeated information > as necessary. >
|
|
What version of the Broadcom NIC drivers? Before 4.x, all the TCP offload features are enabled and it causes all sorts of issues. Update to 4.x and all the tcp offload features will be disabled and things actually work.
"Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message news:02D9CD8E-3CE0-4CC0-81F8-A62D79BF463B[ at ]microsoft.com...
[Quoted Text] > Offloading. Yes, that was in the note you responded to..."We also disabled > TCP off-loading..." > > Can't really do much about the Broadcom NIC's. Apparently they are > integrated into the blade motherboard. However, we have an identical > system > with identical NIC's installed and working perfectly. Also, we swapped out > the HBA's into different blades and that didn't seem to work so probably > not > the brand or bad NIC's. > > However, by disabling multipath, we suddenly increased cluster stability > by > a factor of x12. In the previous week, the cluster has not stayed stable > for > more than about 5 minutes and now it stays up for over an hour before > exhibiting the issue. And, we get an event log message now that says that > the > quorum disk failed on the node that fails. > > So, we are thinking the evidence points to some kind of storage issue. Our > current plan is to replace the HBA's with different HBA's and tie their > worldwide name to the existing LUNs. > > I guess I will keep everyone in the loop here and post repeated > information > as necessary. > > "John Toner [MVP]" wrote: > >> Did you try disabling any TCP/IP offload settings in your NICs? I've seen >> these cause this type of behavior. >> >> Have you tried replacing the Broadcom NICs with another brand? >> >> Regards, >> John >> >> Visit my blog: http://msmvps.com/blogs/jtoner>> >> "Seth Moupre" <SethMoupre[ at ]discussions.microsoft.com> wrote in message >> news:6468079C-AB1D-49DA-B243-B76035F65325[ at ]microsoft.com... >> > I agree with you regarding the networking and that was the very first >> thing >> > that we did was to break the teaming. Then we went through all the >> > network >> > settings with a fine tooth comb, verified the network drivers, swapped >> > out >> > blades to try different NIC's, tried different blade slots to try >> different >> > ports in the internal blade switch, swapped the public and private >> networks, >> > ran sniffer traces of the traffic during failure. We also disabled TCP >> > off-loading and completely reset TCP using netsh. >> > >> > It was about this time that I posted to this list in order to see if >> anyone >> > had any ideas other than the blatantly obvious things to try or the >> > things >> > that come back by a well-formed Google search. >> > >> > Now, having run out of troubleshooting items for the server >> > configuration >> > and networking configuration, we have moved on to see if it is disk >> > configuration issue. But, open to any novel ideas with regards to what >> > the >> > heck is going on. This isn't that complicated of a task. >> > >> > "Edwin vMierlo [MVP]" wrote: >> > >> >> >>
|
|
Was there any resolution to this issue? We are experiencing the same problem.
Onboard Broadcom drivers are 4.4.16. We also 2 intel adapters.
Please pass on any information you might have. Thanks.
Dale Kiefer
"Seth Moupre" wrote:
[Quoted Text] > Having a problem with one of my Windows 2008 clusters, cluster CL1. We have > an identical cluster system up and running, configured and working perfectly, > CL2. It now has Exchange Mailbox Cluster Service installed. The second > cluster is giving us fits. Here is what is going on: > > • Cluster servers are IBM blades > • Node one is P11, Node two is P12 > • Using a SAN for shared storage > • The NIC configuration on the nodes is as follows: > o Two Broadcom NICS each with their own MAC address > o Broadcom NICs are teamed to a third virtual MAC address > o 2 VLAN’s sit on top of the teamed NICs, one for Public > 192.168.45.x, one for Private 10.168.45.x (Heartbeat) > o IPv4 Connectivity only > o Windows cluster validation complains about the duplicate > MAC address but Layer 3 switching should not have a problem with this > > The issue that we are seeing on CL1 is that we can form the cluster > successfully on P11 and then add P12 into the cluster. However, after that, > intermittently P12 gets kicked out of the cluster and shuts down the cluster > service generating an 1177 error in the event log. Basically, this means that > it lost quorum due to losing connectivity with the cluster nodes. We can > generally force this to occur by failing over a disk from P11 to P12. > > The problem does NOT look like it is isolated to P12 however. If we make P12 > the cluster owner, then P11 exhibits the problem, if P11 is the owner, then > P12 exhibits the problem. Using Node and Disk Majority for quorum setting. > > Steps we have taken to troubleshoot: > • Verified all Windows network configuration settings and bindings are the > same on the working cluster as the non-working cluster > • Verified that the Broadcom teaming configuration settings are the same on > the working cluster as the non-working cluster > • Verified the the NIC drivers and software versions are the same on the > working cluster as the non-working cluster > • Attempted to create the cluster on P12 first and then add P11. This > completely blew up when trying to add P11 > • We have evicted P12 and re-added it to the cluster > • We have destroyed the cluster, uninstalled the Windows Failover Clustering > feature on both boxes, rebooted, reinstalled the Failover clustering feature > and reformed the cluster > • We broke the teaming on the Broadcom NICs, destroyed the cluster, etc. > Reformed the cluster and get the same exact issue, so teaming is not the > problem and, as I indicated, we have one up and running without problems with > the teaming. > > It should be noted that this machine, CL1, was originally installed with > teamed NICs and I raised a concern that this configuration might not work > based upon the errors generated in cluster validation. We broke the teaming > and formed the cluster successfully. Everything appeared to be fine although > I cannot say that we would have noticed this problem originally. Then, to > appease the network folks, we tried forming a cluster on CL2, which still had > teamed NICs. This worked beautifully so we destroyed the cluster on CL1, > reteamed the NICs and tried reforming the cluster. > > I am currently at a loss to figure out what is going on. My current suspects > are: > • Something configured incorrectly with base network settings? > • Something configured incorrectly with NIC teaming? > • Something configured incorrectly in Layer 3 switching? > • Something configured incorrectly with the SAN? > • Some kind of issue caused by destroying the cluster and reforming it, > maybe something was not cleaned up properly in AD or on the machines? > • Some kind of hardware issue? > > Anyone seen anything like this? We are working on about 3 days of > troubleshooting this thing. It looks for all the world like the nodes are > losing network connectivity but we have run “ping –t†from both boxes to the > other node’s public and private IP’s and never lost connectivity during > failures of the non-owning cluster node. > > Any help is greatly appreciated. >
|
|
Hi Guys,
has there been any solution to this problem. I am facing the same issue with my clusters, running windows 2008 x64 on IBM Blades HS-21 & the 2nd node keeps getting kicked out>????
I would be very grateful if someone can help me with a solution.
"Dale Kiefer" wrote:
[Quoted Text] > Was there any resolution to this issue? We are experiencing the same problem. > > Onboard Broadcom drivers are 4.4.16. We also 2 intel adapters. > > Please pass on any information you might have. > Thanks. > > Dale Kiefer > > "Seth Moupre" wrote: > > > Having a problem with one of my Windows 2008 clusters, cluster CL1. We have > > an identical cluster system up and running, configured and working perfectly, > > CL2. It now has Exchange Mailbox Cluster Service installed. The second > > cluster is giving us fits. Here is what is going on: > > > > • Cluster servers are IBM blades > > • Node one is P11, Node two is P12 > > • Using a SAN for shared storage > > • The NIC configuration on the nodes is as follows: > > o Two Broadcom NICS each with their own MAC address > > o Broadcom NICs are teamed to a third virtual MAC address > > o 2 VLAN’s sit on top of the teamed NICs, one for Public > > 192.168.45.x, one for Private 10.168.45.x (Heartbeat) > > o IPv4 Connectivity only > > o Windows cluster validation complains about the duplicate > > MAC address but Layer 3 switching should not have a problem with this > > > > The issue that we are seeing on CL1 is that we can form the cluster > > successfully on P11 and then add P12 into the cluster. However, after that, > > intermittently P12 gets kicked out of the cluster and shuts down the cluster > > service generating an 1177 error in the event log. Basically, this means that > > it lost quorum due to losing connectivity with the cluster nodes. We can > > generally force this to occur by failing over a disk from P11 to P12. > > > > The problem does NOT look like it is isolated to P12 however. If we make P12 > > the cluster owner, then P11 exhibits the problem, if P11 is the owner, then > > P12 exhibits the problem. Using Node and Disk Majority for quorum setting. > > > > Steps we have taken to troubleshoot: > > • Verified all Windows network configuration settings and bindings are the > > same on the working cluster as the non-working cluster > > • Verified that the Broadcom teaming configuration settings are the same on > > the working cluster as the non-working cluster > > • Verified the the NIC drivers and software versions are the same on the > > working cluster as the non-working cluster > > • Attempted to create the cluster on P12 first and then add P11. This > > completely blew up when trying to add P11 > > • We have evicted P12 and re-added it to the cluster > > • We have destroyed the cluster, uninstalled the Windows Failover Clustering > > feature on both boxes, rebooted, reinstalled the Failover clustering feature > > and reformed the cluster > > • We broke the teaming on the Broadcom NICs, destroyed the cluster, etc. > > Reformed the cluster and get the same exact issue, so teaming is not the > > problem and, as I indicated, we have one up and running without problems with > > the teaming. > > > > It should be noted that this machine, CL1, was originally installed with > > teamed NICs and I raised a concern that this configuration might not work > > based upon the errors generated in cluster validation. We broke the teaming > > and formed the cluster successfully. Everything appeared to be fine although > > I cannot say that we would have noticed this problem originally. Then, to > > appease the network folks, we tried forming a cluster on CL2, which still had > > teamed NICs. This worked beautifully so we destroyed the cluster on CL1, > > reteamed the NICs and tried reforming the cluster. > > > > I am currently at a loss to figure out what is going on. My current suspects > > are: > > • Something configured incorrectly with base network settings? > > • Something configured incorrectly with NIC teaming? > > • Something configured incorrectly in Layer 3 switching? > > • Something configured incorrectly with the SAN? > > • Some kind of issue caused by destroying the cluster and reforming it, > > maybe something was not cleaned up properly in AD or on the machines? > > • Some kind of hardware issue? > > > > Anyone seen anything like this? We are working on about 3 days of > > troubleshooting this thing. It looks for all the world like the nodes are > > losing network connectivity but we have run “ping –t†from both boxes to the > > other node’s public and private IP’s and never lost connectivity during > > failures of the non-owning cluster node. > > > > Any help is greatly appreciated. > >
|
|
We don't have a resolution yet. We are still actively working on it. I will post a solution if/when we find one.
Just curious: are you running Exchange on your cluster or do you have an IBM SAN attached for storage?
"Shahim Khan" wrote:
[Quoted Text] > Hi Guys, > > has there been any solution to this problem. > I am facing the same issue with my clusters, running windows 2008 x64 on IBM > Blades HS-21 & the 2nd node keeps getting kicked out>???? > > I would be very grateful if someone can help me with a solution. > > "Dale Kiefer" wrote: > > > Was there any resolution to this issue? We are experiencing the same problem. > > > > Onboard Broadcom drivers are 4.4.16. We also 2 intel adapters. > > > > Please pass on any information you might have. > > Thanks. > > > > Dale Kiefer > > > > "Seth Moupre" wrote: > > > > > Having a problem with one of my Windows 2008 clusters, cluster CL1. We have > > > an identical cluster system up and running, configured and working perfectly, > > > CL2. It now has Exchange Mailbox Cluster Service installed. The second > > > cluster is giving us fits. Here is what is going on: > > > > > > • Cluster servers are IBM blades > > > • Node one is P11, Node two is P12 > > > • Using a SAN for shared storage > > > • The NIC configuration on the nodes is as follows: > > > o Two Broadcom NICS each with their own MAC address > > > o Broadcom NICs are teamed to a third virtual MAC address > > > o 2 VLAN’s sit on top of the teamed NICs, one for Public > > > 192.168.45.x, one for Private 10.168.45.x (Heartbeat) > > > o IPv4 Connectivity only > > > o Windows cluster validation complains about the duplicate > > > MAC address but Layer 3 switching should not have a problem with this > > > > > > The issue that we are seeing on CL1 is that we can form the cluster > > > successfully on P11 and then add P12 into the cluster. However, after that, > > > intermittently P12 gets kicked out of the cluster and shuts down the cluster > > > service generating an 1177 error in the event log. Basically, this means that > > > it lost quorum due to losing connectivity with the cluster nodes. We can > > > generally force this to occur by failing over a disk from P11 to P12. > > > > > > The problem does NOT look like it is isolated to P12 however. If we make P12 > > > the cluster owner, then P11 exhibits the problem, if P11 is the owner, then > > > P12 exhibits the problem. Using Node and Disk Majority for quorum setting. > > > > > > Steps we have taken to troubleshoot: > > > • Verified all Windows network configuration settings and bindings are the > > > same on the working cluster as the non-working cluster > > > • Verified that the Broadcom teaming configuration settings are the same on > > > the working cluster as the non-working cluster > > > • Verified the the NIC drivers and software versions are the same on the > > > working cluster as the non-working cluster > > > • Attempted to create the cluster on P12 first and then add P11. This > > > completely blew up when trying to add P11 > > > • We have evicted P12 and re-added it to the cluster > > > • We have destroyed the cluster, uninstalled the Windows Failover Clustering > > > feature on both boxes, rebooted, reinstalled the Failover clustering feature > > > and reformed the cluster > > > • We broke the teaming on the Broadcom NICs, destroyed the cluster, etc. > > > Reformed the cluster and get the same exact issue, so teaming is not the > > > problem and, as I indicated, we have one up and running without problems with > > > the teaming. > > > > > > It should be noted that this machine, CL1, was originally installed with > > > teamed NICs and I raised a concern that this configuration might not work > > > based upon the errors generated in cluster validation. We broke the teaming > > > and formed the cluster successfully. Everything appeared to be fine although > > > I cannot say that we would have noticed this problem originally. Then, to > > > appease the network folks, we tried forming a cluster on CL2, which still had > > > teamed NICs. This worked beautifully so we destroyed the cluster on CL1, > > > reteamed the NICs and tried reforming the cluster. > > > > > > I am currently at a loss to figure out what is going on. My current suspects > > > are: > > > • Something configured incorrectly with base network settings? > > > • Something configured incorrectly with NIC teaming? > > > • Something configured incorrectly in Layer 3 switching? > > > • Something configured incorrectly with the SAN? > > > • Some kind of issue caused by destroying the cluster and reforming it, > > > maybe something was not cleaned up properly in AD or on the machines? > > > • Some kind of hardware issue? > > > > > > Anyone seen anything like this? We are working on about 3 days of > > > troubleshooting this thing. It looks for all the world like the nodes are > > > losing network connectivity but we have run “ping –t†from both boxes to the > > > other node’s public and private IP’s and never lost connectivity during > > > failures of the non-owning cluster node. > > > > > > Any help is greatly appreciated. > > >
|
|
I would again suggest pushing away from looking at the SAN. IMO, this is a network related issue and this is where you should focus your troubleshooting.
Have you guys reproduced this issue using a single NIC? I realize that this is not the recommended network configuration for a cluster, but it would be very telling if the issue occurs with the network configuration simplified.
Regards, John
Visit my blog: http://msmvps.com/blogs/jtoner
"Dale Kiefer" <DaleKiefer[ at ]discussions.microsoft.com> wrote in message news:979CC385-EF78-49EC-897A-9CF379EB2DF0[ at ]microsoft.com...
[Quoted Text] > We don't have a resolution yet. We are still actively working on it. I
will > post a solution if/when we find one. > > Just curious: are you running Exchange on your cluster or do you have an > IBM SAN attached for storage?
|
|
Thanks Guys,
Yes. The 2-node exchange cluster (IBM Blades HS-21) are connected to a IBM DS-4800 Storage.
I had the SAN vendors yesterday, & they have upgraded both f/w & drivers to the latest levels. But the problem still exists.
I had also disabled the heartbeat NICs & currently running only on single NIC (Mixed mode), but still the problem persists.
Any other clues, What have you guys done for the problem?
"John Toner [MVP]" wrote:
[Quoted Text] > I would again suggest pushing away from looking at the SAN. IMO, this is a > network related issue and this is where you should focus your > troubleshooting. > > Have you guys reproduced this issue using a single NIC? I realize that this > is not the recommended network configuration for a cluster, but it would be > very telling if the issue occurs with the network configuration simplified. > > Regards, > John > > Visit my blog: http://msmvps.com/blogs/jtoner> > "Dale Kiefer" <DaleKiefer[ at ]discussions.microsoft.com> wrote in message > news:979CC385-EF78-49EC-897A-9CF379EB2DF0[ at ]microsoft.com... > > We don't have a resolution yet. We are still actively working on it. I > will > > post a solution if/when we find one. > > > > Just curious: are you running Exchange on your cluster or do you have an > > IBM SAN attached for storage? > > >
|
|
Hey guys,
I can understand your pain and frustration on this item as we were experiencing this issue. We tried absolutely everything including phone calls to microsoft and everyone always wants you to either upgrade the drivers or the firmware to solve your problem. What seemed to stabilize our environment was FLOW CONTROL. Here is how our environment is set up and what you need to do:
We have 2 nodes in the cluster. Each node is an IBM Blade and they are located in independent chassis'. Each node has 6 nics, 5 of which we are using. We are teaming 2 on the public side, we use 2 nics on the iSCSI side (SAN) utilizing MPIO and 1 nic is on the private net for a heartbeat. We use Equalogics for our SAN Storage. Inside of the IBM Chassis we are using Cisco Switches and our Core Switch Stack is composed of 3750's.
The way we came to our resolution, (although we did experience an issue with our cluster yesterday but it doesn't look like it is related) is on our Chassis we noticed that there were informative logs talking about duplicate routes existing. So we decided to call our SAN vendor and they immediately told us that since we didn't have Flow Control enabled on the iSCSI nics as well as the switchports it was causing packet drops. THis would explain the disconnects to the Quorum drive that was located on the SAN.
To solve this issue you need to make sure of the following: -enable flow control on the virtual ports on the internal Chassis Switches that your iSCSI nics are connected to. (you can do this for the entire iSCSI Vlan) -Enable Flow Control on the ports of your core switch that your Blade Switches plug into. -Within your iSCSI nic card settings make sure that Flow control is set for RX & TX Enabled, Checksum Offload is set to none, and Large Send Offload is set to Disable. -Depending on your SAN will depend on how you set it up for Flow Control. Since we are using Equalogics with the latest firmware they automatically adjust to the Network Settings of the Switch Port. -We use the iSCSI initiator. We made sure that we had 2 target portals setup underneath the Discovery Tab. 1 for each iSCSI nic card going to the group address of our SAN. Then for each of the Targets listed within the Targets tab we made 2 connections for each.
This is what helped us and I would like to thank Jill Mansfield of Dell for providing us with this fix. She saved us. Hopefully this will save you. As an FYI you do not need to enable FLow control on your public facing nics or virtual switchports.
Merry Christmas
|
|
|