Werbung: SecurityConsole.de verwaltet Ihre Computer mit Security Essentails aus der Cloud!
30 Tage kostenfrei testen und 20% Rabatt für Ihre Bestellung mit Promocode: WBF2685582
(Promocode gültig bis 31.12.2011)

Group:  English: Windows Server » microsoft.public.windows.server.clustering
Thread: Server 2008 Cluster Issues

HTVi
TV Discussion Newsgroups

Server 2008 Cluster Issues
Dale Kiefer 11/21/2008 7:39:02 PM
There was an identical incident posted at: http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c086d541-b2f7-4db1-899a-a0ff03f55183

However, I thought I would pose it as a seperate question in the hope of
bringing additional attention to this problem. I am basically copying and
editing the previous posters' information as it is so close to our issue.

We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
with Cumulative Rollup 4 installed. The second node is continually going
down. Here is what is going on:

• Cluster servers are IBM xSeries 3650s
• Using a IBM DS4800 SAN for shared storage
• The NIC configuration on the nodes is as follows:
o Onboard Broadcom adapter - v4.4.16.0
o 2 Intel PCI-X adapters
o 3 network connections setup: public - 10.2.105.x Intel
switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
192.168.1.x Broadcom crossover cable
o We setup the 3 network connections to help eliminate the
network as the issue.
o IPv4 Connectivity only, no teaming
o Windows cluster validation does not report any issues.

The issue that we are seeing is that intermittently Node 2 gets kicked out
of the cluster and shuts down the cluster service generating an 1177 error in
the event log. Basically, this means that it lost quorum due to losing
connectivity with the cluster nodes. This sometimes happens 3 times an hour,
but might not happen for a few hours. The cluster service will always
automatically restart and everything is fine again for a period of time.

The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
exhibits the problem. Using Node and Disk Majority for quorum setting.

It looks like the nodes are losing network connectivity to each other based
on the cluster logs indicating the routes as down, but we now have 3 network
connections between the 2 nodes using 3 different adapters from 2 different
vendors. So I doubt this is the issue.

MS believes the issue to be storage related due to "error 170" appearances
in the cluster logs and indicates these are related to persistent reservation
problems. We have installed the latest MPIO from IBM which supposedly
resolves some of these types of issues. However, the problem continues. IBM
is also looking into this, but we await a solution.

Has anyone else ran into this problem? Suggestions? Any help is greatly
appreciated.
Re: Server 2008 Cluster Issues
"John Toner [MVP]" <jtoner[ at ]DIE.SPAM.DIE.mvps.org> 11/21/2008 9:38:29 PM
Dale,

This is not an issue with your Quorum disk. Your cluster node SHOULD fail
with a 170 error on the disk (The requested resource is in use) because the
resource is in use by the other cluster node. This is called quorum
arbitration and is supposed to happen when network connectivity fails
between the nodes.

If it were a disk issue, depending on your quorum model, either both nodes
would fail or just the disk resource would fail. Quorum disk failure
wouldn't cause just one node to fail in any quorum model. I'm a little
surprised that Microsoft PSS would even suggest this...you mustn't be
dealing with a cluster person. I'd request escalation to speak with someone
that is familiar with clustering so you have a chance at working towards a
resolution.

I would say that this is some sort of network issue. Have you disabled any
NIC offloading that your netword adapters might be doing? Disable this on
all 6 of your NIC cards.

Are all of your network connections going thru a common switch/router/hub?
If so, you might want to consider spreading these connections across
switches/routers to see if you can isolate a specific piece of hardware.

Make sure your NIC drivers are running the latest versions.

I'd also recommend dropping down to a single (Intel) NIC and see if the
issue persists. If your issue goes away, this is definitely a NIC driver
issue. You might then try adding back NICs until the issue reoccurs. I've
seen instances, specifically with Broadcom, where having a second network
connection caused the links to constantly bounce...that issue was corrected
in an update NIC driver.

Hope this helps.

Regards,
John

Visit my blog: http://msmvps.com/blogs/jtoner

"Dale Kiefer" <DaleKiefer[ at ]discussions.microsoft.com> wrote in message
news:F354C29E-05C4-4DF7-94C8-85D38C55CFE8[ at ]microsoft.com...
[Quoted Text]
> There was an identical incident posted at:
>
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c08
6d541-b2f7-4db1-899a-a0ff03f55183
>
> However, I thought I would pose it as a seperate question in the hope of
> bringing additional attention to this problem. I am basically copying and
> editing the previous posters' information as it is so close to our issue.
>
> We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007
SP1
> with Cumulative Rollup 4 installed. The second node is continually going
> down. Here is what is going on:
>
> . Cluster servers are IBM xSeries 3650s
> . Using a IBM DS4800 SAN for shared storage
> . The NIC configuration on the nodes is as follows:
> o Onboard Broadcom adapter - v4.4.16.0
> o 2 Intel PCI-X adapters
> o 3 network connections setup: public - 10.2.105.x Intel
> switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
> 192.168.1.x Broadcom crossover cable
> o We setup the 3 network connections to help eliminate the
> network as the issue.
> o IPv4 Connectivity only, no teaming
> o Windows cluster validation does not report any issues.
>
> The issue that we are seeing is that intermittently Node 2 gets kicked out
> of the cluster and shuts down the cluster service generating an 1177 error
in
> the event log. Basically, this means that it lost quorum due to losing
> connectivity with the cluster nodes. This sometimes happens 3 times an
hour,
> but might not happen for a few hours. The cluster service will always
> automatically restart and everything is fine again for a period of time.
>
> The problem is NOT isolated to Node 2 however. If we make Node 2 the
cluster
> owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node
2
> exhibits the problem. Using Node and Disk Majority for quorum setting.
>
> It looks like the nodes are losing network connectivity to each other
based
> on the cluster logs indicating the routes as down, but we now have 3
network
> connections between the 2 nodes using 3 different adapters from 2
different
> vendors. So I doubt this is the issue.
>
> MS believes the issue to be storage related due to "error 170" appearances
> in the cluster logs and indicates these are related to persistent
reservation
> problems. We have installed the latest MPIO from IBM which supposedly
> resolves some of these types of issues. However, the problem continues.
IBM
> is also looking into this, but we await a solution.
>
> Has anyone else ran into this problem? Suggestions? Any help is greatly
> appreciated.


Re: Server 2008 Cluster Issues
Elden 11/23/2008 7:07:00 AM
Also look for any applications which may be temporarily blocking all network
traffic... such as a Firewall... or IPSec policies getting refreshed.

If you have a group policy which is causing the IPSec policy to get
refreshed, when IPSec updates it's policies it closes all network traffic...
and since clustering is watching the health of the network, when it can't
communicate with the node it considers it a failure.
Re: Server 2008 Cluster Issues
Dale Kiefer 11/24/2008 11:58:03 PM
Thanks for the replies. I will let you know where we get to in the next few
days.

"Elden" wrote:

[Quoted Text]
> Also look for any applications which may be temporarily blocking all network
> traffic... such as a Firewall... or IPSec policies getting refreshed.
>
> If you have a group policy which is causing the IPSec policy to get
> refreshed, when IPSec updates it's policies it closes all network traffic...
> and since clustering is watching the health of the network, when it can't
> communicate with the node it considers it a failure.
Re: Server 2008 Cluster Issues
Dale Kiefer 11/25/2008 7:35:01 PM
It appears we have fixed the problem. IBM noticed in the support logs that
the server wasn't seeing 52 LUNs (26 being presented, 2 HBAs). This led us
to the HBA BIOS/drivers. We were running the drivers/BIOS from the IBM
xSeries site and not the IBM storage subsystem site for our DS4800.
Apparently this is an important distinction. We could not update the HBA
drivers because our DS4800 is not running the latest firmware (this is a
major upgrade and only recommended if required). We did update our HBA BIOS
version and it has now been 24 hours without any problems. The problem
occurred every 1-2 hours and sometimes multiple times in an hour prior to
this update so we are very hopeful that a fix has been found.

Thank you for your responses. I will post any additional relevant
information if available.

"Dale Kiefer" wrote:

[Quoted Text]
> Thanks for the replies. I will let you know where we get to in the next few
> days.
>
> "Elden" wrote:
>
> > Also look for any applications which may be temporarily blocking all network
> > traffic... such as a Firewall... or IPSec policies getting refreshed.
> >
> > If you have a group policy which is causing the IPSec policy to get
> > refreshed, when IPSec updates it's policies it closes all network traffic...
> > and since clustering is watching the health of the network, when it can't
> > communicate with the node it considers it a failure.
Re: Server 2008 Cluster Issues
Dale Kiefer 11/26/2008 4:36:04 PM
Disregard my previous post, the issues have returned. I will provide further
information later.

"Dale Kiefer" wrote:

[Quoted Text]
> It appears we have fixed the problem. IBM noticed in the support logs that
> the server wasn't seeing 52 LUNs (26 being presented, 2 HBAs). This led us
> to the HBA BIOS/drivers. We were running the drivers/BIOS from the IBM
> xSeries site and not the IBM storage subsystem site for our DS4800.
> Apparently this is an important distinction. We could not update the HBA
> drivers because our DS4800 is not running the latest firmware (this is a
> major upgrade and only recommended if required). We did update our HBA BIOS
> version and it has now been 24 hours without any problems. The problem
> occurred every 1-2 hours and sometimes multiple times in an hour prior to
> this update so we are very hopeful that a fix has been found.
>
> Thank you for your responses. I will post any additional relevant
> information if available.
>
> "Dale Kiefer" wrote:
>
> > Thanks for the replies. I will let you know where we get to in the next few
> > days.
> >
> > "Elden" wrote:
> >
> > > Also look for any applications which may be temporarily blocking all network
> > > traffic... such as a Firewall... or IPSec policies getting refreshed.
> > >
> > > If you have a group policy which is causing the IPSec policy to get
> > > refreshed, when IPSec updates it's policies it closes all network traffic...
> > > and since clustering is watching the health of the network, when it can't
> > > communicate with the node it considers it a failure.
RE: Server 2008 Cluster Issues
Priit Vosu 12/1/2008 1:32:19 AM
I seem to have come across similar problems:

We have 2 datacenters both running identical hardware and we are testing
upgrade for exchange to exchange 2007.
I built 2 x Windows 2008 SP1 clusters one in each DC and put exchange 2007
clustered mailbox role on them and standby continuous replication between
them.

Hardware is running on IBM BladeCenter HS21 blades
Storage is on IBM System Storage DS4700 storage boxes.

Have latest drivers and patches applied and also Exchange SP1 Rollup Update 5.
The system that is on the receiving end of the standby relication does not
seem to suffer from the issues however the Exchange box that is running live
gets after couple of days the following errors:

Source: Microsoft-Windows-FailoverClustering
Event ID: 1069
Level: Error
Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 9)' in
clustered service or application 'E2K7CLUS02' failed.

Source: Microsoft-Windows-FailoverClustering
Event ID: 1230
Level: Error
User: SYSTEM
Computer: E2K7NODE2.domain.local
Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 10)'
(resource type '', DLL 'clusres.dll') either crashed or deadlocked. The
Resource Hosting Subsystem (RHS) process will now attempt to terminate, and
the resource will be marked to run in a separate monitor.

After these errors come, the Cluster administrator will hang and Windows
Explorer will also hang and not ba able to browse the drives.
Exchange will however continue to work and mail will keep flowing and
clients can still access mailboxes. To restore the functionality of admin
tools so far I have found nothing short of reboot that would do the trick.

When I move the cluster over to the standby cluster in the other DC the same
problems start there.
I will try the firmware upgrade of the storage in the next couple of days in
one of the DC-s to se it it has any impact. It does look like it is some sort
of issue between IBM storage or storage drivers and the Windows 2008 SP1
cluster. We have multiple Windows 2003 clusters running off the same storage
boxes and so far we have had no errors on any of them.



"Dale Kiefer" wrote:

[Quoted Text]
> There was an identical incident posted at:
> http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c086d541-b2f7-4db1-899a-a0ff03f55183
>
> However, I thought I would pose it as a seperate question in the hope of
> bringing additional attention to this problem. I am basically copying and
> editing the previous posters' information as it is so close to our issue.
>
> We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
> with Cumulative Rollup 4 installed. The second node is continually going
> down. Here is what is going on:
>
> • Cluster servers are IBM xSeries 3650s
> • Using a IBM DS4800 SAN for shared storage
> • The NIC configuration on the nodes is as follows:
> o Onboard Broadcom adapter - v4.4.16.0
> o 2 Intel PCI-X adapters
> o 3 network connections setup: public - 10.2.105.x Intel
> switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
> 192.168.1.x Broadcom crossover cable
> o We setup the 3 network connections to help eliminate the
> network as the issue.
> o IPv4 Connectivity only, no teaming
> o Windows cluster validation does not report any issues.
>
> The issue that we are seeing is that intermittently Node 2 gets kicked out
> of the cluster and shuts down the cluster service generating an 1177 error in
> the event log. Basically, this means that it lost quorum due to losing
> connectivity with the cluster nodes. This sometimes happens 3 times an hour,
> but might not happen for a few hours. The cluster service will always
> automatically restart and everything is fine again for a period of time.
>
> The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
> owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
> exhibits the problem. Using Node and Disk Majority for quorum setting.
>
> It looks like the nodes are losing network connectivity to each other based
> on the cluster logs indicating the routes as down, but we now have 3 network
> connections between the 2 nodes using 3 different adapters from 2 different
> vendors. So I doubt this is the issue.
>
> MS believes the issue to be storage related due to "error 170" appearances
> in the cluster logs and indicates these are related to persistent reservation
> problems. We have installed the latest MPIO from IBM which supposedly
> resolves some of these types of issues. However, the problem continues. IBM
> is also looking into this, but we await a solution.
>
> Has anyone else ran into this problem? Suggestions? Any help is greatly
> appreciated.
Re: Server 2008 Cluster Issues
"John Fullbright" <fjohn[ at ]donotspamnetappdotcom> 12/1/2008 9:59:55 AM
SCC. This sounds a lot like http://support.microsoft.com/kb/953652


"Dale Kiefer" <DaleKiefer[ at ]discussions.microsoft.com> wrote in message
news:DBE0F834-92CF-4517-B755-8E7A10141D16[ at ]microsoft.com...
[Quoted Text]
> Disregard my previous post, the issues have returned. I will provide
> further
> information later.
>
> "Dale Kiefer" wrote:
>
>> It appears we have fixed the problem. IBM noticed in the support logs
>> that
>> the server wasn't seeing 52 LUNs (26 being presented, 2 HBAs). This led
>> us
>> to the HBA BIOS/drivers. We were running the drivers/BIOS from the IBM
>> xSeries site and not the IBM storage subsystem site for our DS4800.
>> Apparently this is an important distinction. We could not update the HBA
>> drivers because our DS4800 is not running the latest firmware (this is a
>> major upgrade and only recommended if required). We did update our HBA
>> BIOS
>> version and it has now been 24 hours without any problems. The problem
>> occurred every 1-2 hours and sometimes multiple times in an hour prior to
>> this update so we are very hopeful that a fix has been found.
>>
>> Thank you for your responses. I will post any additional relevant
>> information if available.
>>
>> "Dale Kiefer" wrote:
>>
>> > Thanks for the replies. I will let you know where we get to in the
>> > next few
>> > days.
>> >
>> > "Elden" wrote:
>> >
>> > > Also look for any applications which may be temporarily blocking all
>> > > network
>> > > traffic... such as a Firewall... or IPSec policies getting refreshed.
>> > >
>> > > If you have a group policy which is causing the IPSec policy to get
>> > > refreshed, when IPSec updates it's policies it closes all network
>> > > traffic...
>> > > and since clustering is watching the health of the network, when it
>> > > can't
>> > > communicate with the node it considers it a failure.


RE: Server 2008 Cluster Issues
Dale Kiefer 12/4/2008 12:02:04 AM
We have the exact same issues you are experiencing with the cluster becoming
non-responsive. I have also posted about this in the Exchange forums at:
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.exchange.clustering&p=1&tid=f9ee94b3-d92c-4273-a962-f8f2de77d68f

Please let me know how you make out with the storage firmware upgrade or any
other changes. I am not running in a test environment so I don't have the
ability to make changes as easily.

What version of firmware are you upgrading from/to?

I'm glad to see a few more people are bumping into these same problems.
Hopefully we can find a resolution soon.

"Priit Vosu" wrote:

[Quoted Text]
> I seem to have come across similar problems:
>
> We have 2 datacenters both running identical hardware and we are testing
> upgrade for exchange to exchange 2007.
> I built 2 x Windows 2008 SP1 clusters one in each DC and put exchange 2007
> clustered mailbox role on them and standby continuous replication between
> them.
>
> Hardware is running on IBM BladeCenter HS21 blades
> Storage is on IBM System Storage DS4700 storage boxes.
>
> Have latest drivers and patches applied and also Exchange SP1 Rollup Update 5.
> The system that is on the receiving end of the standby relication does not
> seem to suffer from the issues however the Exchange box that is running live
> gets after couple of days the following errors:
>
> Source: Microsoft-Windows-FailoverClustering
> Event ID: 1069
> Level: Error
> Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 9)' in
> clustered service or application 'E2K7CLUS02' failed.
>
> Source: Microsoft-Windows-FailoverClustering
> Event ID: 1230
> Level: Error
> User: SYSTEM
> Computer: E2K7NODE2.domain.local
> Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 10)'
> (resource type '', DLL 'clusres.dll') either crashed or deadlocked. The
> Resource Hosting Subsystem (RHS) process will now attempt to terminate, and
> the resource will be marked to run in a separate monitor.
>
> After these errors come, the Cluster administrator will hang and Windows
> Explorer will also hang and not ba able to browse the drives.
> Exchange will however continue to work and mail will keep flowing and
> clients can still access mailboxes. To restore the functionality of admin
> tools so far I have found nothing short of reboot that would do the trick.
>
> When I move the cluster over to the standby cluster in the other DC the same
> problems start there.
> I will try the firmware upgrade of the storage in the next couple of days in
> one of the DC-s to se it it has any impact. It does look like it is some sort
> of issue between IBM storage or storage drivers and the Windows 2008 SP1
> cluster. We have multiple Windows 2003 clusters running off the same storage
> boxes and so far we have had no errors on any of them.
>
>
>
> "Dale Kiefer" wrote:
>
> > There was an identical incident posted at:
> > http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c086d541-b2f7-4db1-899a-a0ff03f55183
> >
> > However, I thought I would pose it as a seperate question in the hope of
> > bringing additional attention to this problem. I am basically copying and
> > editing the previous posters' information as it is so close to our issue.
> >
> > We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
> > with Cumulative Rollup 4 installed. The second node is continually going
> > down. Here is what is going on:
> >
> > • Cluster servers are IBM xSeries 3650s
> > • Using a IBM DS4800 SAN for shared storage
> > • The NIC configuration on the nodes is as follows:
> > o Onboard Broadcom adapter - v4.4.16.0
> > o 2 Intel PCI-X adapters
> > o 3 network connections setup: public - 10.2.105.x Intel
> > switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
> > 192.168.1.x Broadcom crossover cable
> > o We setup the 3 network connections to help eliminate the
> > network as the issue.
> > o IPv4 Connectivity only, no teaming
> > o Windows cluster validation does not report any issues.
> >
> > The issue that we are seeing is that intermittently Node 2 gets kicked out
> > of the cluster and shuts down the cluster service generating an 1177 error in
> > the event log. Basically, this means that it lost quorum due to losing
> > connectivity with the cluster nodes. This sometimes happens 3 times an hour,
> > but might not happen for a few hours. The cluster service will always
> > automatically restart and everything is fine again for a period of time.
> >
> > The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
> > owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
> > exhibits the problem. Using Node and Disk Majority for quorum setting.
> >
> > It looks like the nodes are losing network connectivity to each other based
> > on the cluster logs indicating the routes as down, but we now have 3 network
> > connections between the 2 nodes using 3 different adapters from 2 different
> > vendors. So I doubt this is the issue.
> >
> > MS believes the issue to be storage related due to "error 170" appearances
> > in the cluster logs and indicates these are related to persistent reservation
> > problems. We have installed the latest MPIO from IBM which supposedly
> > resolves some of these types of issues. However, the problem continues. IBM
> > is also looking into this, but we await a solution.
> >
> > Has anyone else ran into this problem? Suggestions? Any help is greatly
> > appreciated.
RE: Server 2008 Cluster Issues
Priit Vosu 12/8/2008 12:54:01 AM
I upgraded the IBM DS4700 firmware from 6.23.05.02 to 7.36.08.00
I also upgraded the storage MPIO drivers on the Windows 2008 server to 10.36
version and since then the system has been stabile for a week which is not
very long time, but previously the error occured way sooner then that.

I am also running standby replication off that server and I still get
occasional Event ID: 2082 errors, but the replication itself is not breaking
and the servers are working correctly so far.

We are planning to make the same upgrade in couple of days in the production
environment if all will be still ok at that time. So far it does seam that
the firmware and MPIO driver update have made the system more stabile and
maybe even fixed it.


"Dale Kiefer" wrote:

[Quoted Text]
> We have the exact same issues you are experiencing with the cluster becoming
> non-responsive. I have also posted about this in the Exchange forums at:
>
> http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.exchange.clustering&p=1&tid=f9ee94b3-d92c-4273-a962-f8f2de77d68f
>
> Please let me know how you make out with the storage firmware upgrade or any
> other changes. I am not running in a test environment so I don't have the
> ability to make changes as easily.
>
> What version of firmware are you upgrading from/to?
>
> I'm glad to see a few more people are bumping into these same problems.
> Hopefully we can find a resolution soon.
>
> "Priit Vosu" wrote:
>
> > I seem to have come across similar problems:
> >
> > We have 2 datacenters both running identical hardware and we are testing
> > upgrade for exchange to exchange 2007.
> > I built 2 x Windows 2008 SP1 clusters one in each DC and put exchange 2007
> > clustered mailbox role on them and standby continuous replication between
> > them.
> >
> > Hardware is running on IBM BladeCenter HS21 blades
> > Storage is on IBM System Storage DS4700 storage boxes.
> >
> > Have latest drivers and patches applied and also Exchange SP1 Rollup Update 5.
> > The system that is on the receiving end of the standby relication does not
> > seem to suffer from the issues however the Exchange box that is running live
> > gets after couple of days the following errors:
> >
> > Source: Microsoft-Windows-FailoverClustering
> > Event ID: 1069
> > Level: Error
> > Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 9)' in
> > clustered service or application 'E2K7CLUS02' failed.
> >
> > Source: Microsoft-Windows-FailoverClustering
> > Event ID: 1230
> > Level: Error
> > User: SYSTEM
> > Computer: E2K7NODE2.domain.local
> > Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 10)'
> > (resource type '', DLL 'clusres.dll') either crashed or deadlocked. The
> > Resource Hosting Subsystem (RHS) process will now attempt to terminate, and
> > the resource will be marked to run in a separate monitor.
> >
> > After these errors come, the Cluster administrator will hang and Windows
> > Explorer will also hang and not ba able to browse the drives.
> > Exchange will however continue to work and mail will keep flowing and
> > clients can still access mailboxes. To restore the functionality of admin
> > tools so far I have found nothing short of reboot that would do the trick.
> >
> > When I move the cluster over to the standby cluster in the other DC the same
> > problems start there.
> > I will try the firmware upgrade of the storage in the next couple of days in
> > one of the DC-s to se it it has any impact. It does look like it is some sort
> > of issue between IBM storage or storage drivers and the Windows 2008 SP1
> > cluster. We have multiple Windows 2003 clusters running off the same storage
> > boxes and so far we have had no errors on any of them.
> >
> >
> >
> > "Dale Kiefer" wrote:
> >
> > > There was an identical incident posted at:
> > > http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c086d541-b2f7-4db1-899a-a0ff03f55183
> > >
> > > However, I thought I would pose it as a seperate question in the hope of
> > > bringing additional attention to this problem. I am basically copying and
> > > editing the previous posters' information as it is so close to our issue.
> > >
> > > We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
> > > with Cumulative Rollup 4 installed. The second node is continually going
> > > down. Here is what is going on:
> > >
> > > • Cluster servers are IBM xSeries 3650s
> > > • Using a IBM DS4800 SAN for shared storage
> > > • The NIC configuration on the nodes is as follows:
> > > o Onboard Broadcom adapter - v4.4.16.0
> > > o 2 Intel PCI-X adapters
> > > o 3 network connections setup: public - 10.2.105.x Intel
> > > switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
> > > 192.168.1.x Broadcom crossover cable
> > > o We setup the 3 network connections to help eliminate the
> > > network as the issue.
> > > o IPv4 Connectivity only, no teaming
> > > o Windows cluster validation does not report any issues.
> > >
> > > The issue that we are seeing is that intermittently Node 2 gets kicked out
> > > of the cluster and shuts down the cluster service generating an 1177 error in
> > > the event log. Basically, this means that it lost quorum due to losing
> > > connectivity with the cluster nodes. This sometimes happens 3 times an hour,
> > > but might not happen for a few hours. The cluster service will always
> > > automatically restart and everything is fine again for a period of time.
> > >
> > > The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
> > > owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
> > > exhibits the problem. Using Node and Disk Majority for quorum setting.
> > >
> > > It looks like the nodes are losing network connectivity to each other based
> > > on the cluster logs indicating the routes as down, but we now have 3 network
> > > connections between the 2 nodes using 3 different adapters from 2 different
> > > vendors. So I doubt this is the issue.
> > >
> > > MS believes the issue to be storage related due to "error 170" appearances
> > > in the cluster logs and indicates these are related to persistent reservation
> > > problems. We have installed the latest MPIO from IBM which supposedly
> > > resolves some of these types of issues. However, the problem continues. IBM
> > > is also looking into this, but we await a solution.
> > >
> > > Has anyone else ran into this problem? Suggestions? Any help is greatly
> > > appreciated.
RE: Server 2008 Cluster Issues
Dale Kiefer 12/8/2008 4:50:01 PM
Thanks for letting me know. On Friday, we also upgraded our SAN firmware at
IBM's recommendation to version 7.36.08.00 (from 6.60). We are running the
latest MPIO as well. The issue still exists though.

I sure hope your issues have disappeared, but I might recommend doing some
failovers on your cluster if you already haven't. I had thought the issue
had been resolved after a HBA BIOS upgrade. However, after a few failovers
(using both the management console and simply rebooting the server), the
issue returned.

Keep me posted as to how things are going on your end.
Thanks.

"Priit Vosu" wrote:

[Quoted Text]
> I upgraded the IBM DS4700 firmware from 6.23.05.02 to 7.36.08.00
> I also upgraded the storage MPIO drivers on the Windows 2008 server to 10.36
> version and since then the system has been stabile for a week which is not
> very long time, but previously the error occured way sooner then that.
>
> I am also running standby replication off that server and I still get
> occasional Event ID: 2082 errors, but the replication itself is not breaking
> and the servers are working correctly so far.
>
> We are planning to make the same upgrade in couple of days in the production
> environment if all will be still ok at that time. So far it does seam that
> the firmware and MPIO driver update have made the system more stabile and
> maybe even fixed it.
>
>
> "Dale Kiefer" wrote:
>
> > We have the exact same issues you are experiencing with the cluster becoming
> > non-responsive. I have also posted about this in the Exchange forums at:
> >
> > http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.exchange.clustering&p=1&tid=f9ee94b3-d92c-4273-a962-f8f2de77d68f
> >
> > Please let me know how you make out with the storage firmware upgrade or any
> > other changes. I am not running in a test environment so I don't have the
> > ability to make changes as easily.
> >
> > What version of firmware are you upgrading from/to?
> >
> > I'm glad to see a few more people are bumping into these same problems.
> > Hopefully we can find a resolution soon.
> >
> > "Priit Vosu" wrote:
> >
> > > I seem to have come across similar problems:
> > >
> > > We have 2 datacenters both running identical hardware and we are testing
> > > upgrade for exchange to exchange 2007.
> > > I built 2 x Windows 2008 SP1 clusters one in each DC and put exchange 2007
> > > clustered mailbox role on them and standby continuous replication between
> > > them.
> > >
> > > Hardware is running on IBM BladeCenter HS21 blades
> > > Storage is on IBM System Storage DS4700 storage boxes.
> > >
> > > Have latest drivers and patches applied and also Exchange SP1 Rollup Update 5.
> > > The system that is on the receiving end of the standby relication does not
> > > seem to suffer from the issues however the Exchange box that is running live
> > > gets after couple of days the following errors:
> > >
> > > Source: Microsoft-Windows-FailoverClustering
> > > Event ID: 1069
> > > Level: Error
> > > Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 9)' in
> > > clustered service or application 'E2K7CLUS02' failed.
> > >
> > > Source: Microsoft-Windows-FailoverClustering
> > > Event ID: 1230
> > > Level: Error
> > > User: SYSTEM
> > > Computer: E2K7NODE2.domain.local
> > > Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 10)'
> > > (resource type '', DLL 'clusres.dll') either crashed or deadlocked. The
> > > Resource Hosting Subsystem (RHS) process will now attempt to terminate, and
> > > the resource will be marked to run in a separate monitor.
> > >
> > > After these errors come, the Cluster administrator will hang and Windows
> > > Explorer will also hang and not ba able to browse the drives.
> > > Exchange will however continue to work and mail will keep flowing and
> > > clients can still access mailboxes. To restore the functionality of admin
> > > tools so far I have found nothing short of reboot that would do the trick.
> > >
> > > When I move the cluster over to the standby cluster in the other DC the same
> > > problems start there.
> > > I will try the firmware upgrade of the storage in the next couple of days in
> > > one of the DC-s to se it it has any impact. It does look like it is some sort
> > > of issue between IBM storage or storage drivers and the Windows 2008 SP1
> > > cluster. We have multiple Windows 2003 clusters running off the same storage
> > > boxes and so far we have had no errors on any of them.
> > >
> > >
> > >
> > > "Dale Kiefer" wrote:
> > >
> > > > There was an identical incident posted at:
> > > > http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c086d541-b2f7-4db1-899a-a0ff03f55183
> > > >
> > > > However, I thought I would pose it as a seperate question in the hope of
> > > > bringing additional attention to this problem. I am basically copying and
> > > > editing the previous posters' information as it is so close to our issue.
> > > >
> > > > We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
> > > > with Cumulative Rollup 4 installed. The second node is continually going
> > > > down. Here is what is going on:
> > > >
> > > > • Cluster servers are IBM xSeries 3650s
> > > > • Using a IBM DS4800 SAN for shared storage
> > > > • The NIC configuration on the nodes is as follows:
> > > > o Onboard Broadcom adapter - v4.4.16.0
> > > > o 2 Intel PCI-X adapters
> > > > o 3 network connections setup: public - 10.2.105.x Intel
> > > > switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
> > > > 192.168.1.x Broadcom crossover cable
> > > > o We setup the 3 network connections to help eliminate the
> > > > network as the issue.
> > > > o IPv4 Connectivity only, no teaming
> > > > o Windows cluster validation does not report any issues.
> > > >
> > > > The issue that we are seeing is that intermittently Node 2 gets kicked out
> > > > of the cluster and shuts down the cluster service generating an 1177 error in
> > > > the event log. Basically, this means that it lost quorum due to losing
> > > > connectivity with the cluster nodes. This sometimes happens 3 times an hour,
> > > > but might not happen for a few hours. The cluster service will always
> > > > automatically restart and everything is fine again for a period of time.
> > > >
> > > > The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
> > > > owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
> > > > exhibits the problem. Using Node and Disk Majority for quorum setting.
> > > >
> > > > It looks like the nodes are losing network connectivity to each other based
> > > > on the cluster logs indicating the routes as down, but we now have 3 network
> > > > connections between the 2 nodes using 3 different adapters from 2 different
> > > > vendors. So I doubt this is the issue.
> > > >
> > > > MS believes the issue to be storage related due to "error 170" appearances
> > > > in the cluster logs and indicates these are related to persistent reservation
> > > > problems. We have installed the latest MPIO from IBM which supposedly
> > > > resolves some of these types of issues. However, the problem continues. IBM
> > > > is also looking into this, but we await a solution.
> > > >
> > > > Has anyone else ran into this problem? Suggestions? Any help is greatly
> > > > appreciated.
RE: Server 2008 Cluster Issues
Priit Vosu 12/19/2008 2:04:02 AM
It appears the problems have not gone. The system was up and running without
errors for 2 weeks and this morning it bluescreened and rebooted with
Bugcheck: 0x00000018
After the server came up again 2 hours later I got the Event ID: 1230 again
with the
Cluster resource 'FileServer-(EXCLUSTER02)(Cluster Disk 1)' (resource type
'', DLL 'clusres.dll') either crashed or deadlocked.
So the storage issues seam to be back. By now we already have a few hundred
users on that server, so it is somewhat of an annoyance. I will open a case
with IBM and see where we go from there.


"Dale Kiefer" wrote:

[Quoted Text]
> Thanks for letting me know. On Friday, we also upgraded our SAN firmware at
> IBM's recommendation to version 7.36.08.00 (from 6.60). We are running the
> latest MPIO as well. The issue still exists though.
>
> I sure hope your issues have disappeared, but I might recommend doing some
> failovers on your cluster if you already haven't. I had thought the issue
> had been resolved after a HBA BIOS upgrade. However, after a few failovers
> (using both the management console and simply rebooting the server), the
> issue returned.
>
> Keep me posted as to how things are going on your end.
> Thanks.
>
> "Priit Vosu" wrote:
>
> > I upgraded the IBM DS4700 firmware from 6.23.05.02 to 7.36.08.00
> > I also upgraded the storage MPIO drivers on the Windows 2008 server to 10.36
> > version and since then the system has been stabile for a week which is not
> > very long time, but previously the error occured way sooner then that.
> >
> > I am also running standby replication off that server and I still get
> > occasional Event ID: 2082 errors, but the replication itself is not breaking
> > and the servers are working correctly so far.
> >
> > We are planning to make the same upgrade in couple of days in the production
> > environment if all will be still ok at that time. So far it does seam that
> > the firmware and MPIO driver update have made the system more stabile and
> > maybe even fixed it.
> >
> >
> > "Dale Kiefer" wrote:
> >
> > > We have the exact same issues you are experiencing with the cluster becoming
> > > non-responsive. I have also posted about this in the Exchange forums at:
> > >
> > > http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.exchange.clustering&p=1&tid=f9ee94b3-d92c-4273-a962-f8f2de77d68f
> > >
> > > Please let me know how you make out with the storage firmware upgrade or any
> > > other changes. I am not running in a test environment so I don't have the
> > > ability to make changes as easily.
> > >
> > > What version of firmware are you upgrading from/to?
> > >
> > > I'm glad to see a few more people are bumping into these same problems.
> > > Hopefully we can find a resolution soon.
> > >
> > > "Priit Vosu" wrote:
> > >
> > > > I seem to have come across similar problems:
> > > >
> > > > We have 2 datacenters both running identical hardware and we are testing
> > > > upgrade for exchange to exchange 2007.
> > > > I built 2 x Windows 2008 SP1 clusters one in each DC and put exchange 2007
> > > > clustered mailbox role on them and standby continuous replication between
> > > > them.
> > > >
> > > > Hardware is running on IBM BladeCenter HS21 blades
> > > > Storage is on IBM System Storage DS4700 storage boxes.
> > > >
> > > > Have latest drivers and patches applied and also Exchange SP1 Rollup Update 5.
> > > > The system that is on the receiving end of the standby relication does not
> > > > seem to suffer from the issues however the Exchange box that is running live
> > > > gets after couple of days the following errors:
> > > >
> > > > Source: Microsoft-Windows-FailoverClustering
> > > > Event ID: 1069
> > > > Level: Error
> > > > Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 9)' in
> > > > clustered service or application 'E2K7CLUS02' failed.
> > > >
> > > > Source: Microsoft-Windows-FailoverClustering
> > > > Event ID: 1230
> > > > Level: Error
> > > > User: SYSTEM
> > > > Computer: E2K7NODE2.domain.local
> > > > Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 10)'
> > > > (resource type '', DLL 'clusres.dll') either crashed or deadlocked. The
> > > > Resource Hosting Subsystem (RHS) process will now attempt to terminate, and
> > > > the resource will be marked to run in a separate monitor.
> > > >
> > > > After these errors come, the Cluster administrator will hang and Windows
> > > > Explorer will also hang and not ba able to browse the drives.
> > > > Exchange will however continue to work and mail will keep flowing and
> > > > clients can still access mailboxes. To restore the functionality of admin
> > > > tools so far I have found nothing short of reboot that would do the trick.
> > > >
> > > > When I move the cluster over to the standby cluster in the other DC the same
> > > > problems start there.
> > > > I will try the firmware upgrade of the storage in the next couple of days in
> > > > one of the DC-s to se it it has any impact. It does look like it is some sort
> > > > of issue between IBM storage or storage drivers and the Windows 2008 SP1
> > > > cluster. We have multiple Windows 2003 clusters running off the same storage
> > > > boxes and so far we have had no errors on any of them.
> > > >
> > > >
> > > >
> > > > "Dale Kiefer" wrote:
> > > >
> > > > > There was an identical incident posted at:
> > > > > http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c086d541-b2f7-4db1-899a-a0ff03f55183
> > > > >
> > > > > However, I thought I would pose it as a seperate question in the hope of
> > > > > bringing additional attention to this problem. I am basically copying and
> > > > > editing the previous posters' information as it is so close to our issue.
> > > > >
> > > > > We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
> > > > > with Cumulative Rollup 4 installed. The second node is continually going
> > > > > down. Here is what is going on:
> > > > >
> > > > > • Cluster servers are IBM xSeries 3650s
> > > > > • Using a IBM DS4800 SAN for shared storage
> > > > > • The NIC configuration on the nodes is as follows:
> > > > > o Onboard Broadcom adapter - v4.4.16.0
> > > > > o 2 Intel PCI-X adapters
> > > > > o 3 network connections setup: public - 10.2.105.x Intel
> > > > > switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
> > > > > 192.168.1.x Broadcom crossover cable
> > > > > o We setup the 3 network connections to help eliminate the
> > > > > network as the issue.
> > > > > o IPv4 Connectivity only, no teaming
> > > > > o Windows cluster validation does not report any issues.
> > > > >
> > > > > The issue that we are seeing is that intermittently Node 2 gets kicked out
> > > > > of the cluster and shuts down the cluster service generating an 1177 error in
> > > > > the event log. Basically, this means that it lost quorum due to losing
> > > > > connectivity with the cluster nodes. This sometimes happens 3 times an hour,
> > > > > but might not happen for a few hours. The cluster service will always
> > > > > automatically restart and everything is fine again for a period of time.
> > > > >
> > > > > The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
> > > > > owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
> > > > > exhibits the problem. Using Node and Disk Majority for quorum setting.
> > > > >
> > > > > It looks like the nodes are losing network connectivity to each other based
> > > > > on the cluster logs indicating the routes as down, but we now have 3 network
> > > > > connections between the 2 nodes using 3 different adapters from 2 different
> > > > > vendors. So I doubt this is the issue.
> > > > >
> > > > > MS believes the issue to be storage related due to "error 170" appearances
> > > > > in the cluster logs and indicates these are related to persistent reservation
> > > > > problems. We have installed the latest MPIO from IBM which supposedly
> > > > > resolves some of these types of issues. However, the problem continues. IBM
> > > > > is also looking into this, but we await a solution.
> > > > >
> > > > > Has anyone else ran into this problem? Suggestions? Any help is greatly
> > > > > appreciated.

Home | Search | Terms | Imprint Contact
Newsgroups Reader - provided by WiredBox.Net
Suche nach Orten, Städten, Postleitzahlen, Vorwahlen, Kfz-Kennzeichen