> Thanks for letting me know. On Friday, we also upgraded our SAN firmware at
> IBM's recommendation to version 7.36.08.00 (from 6.60). We are running the
> latest MPIO as well. The issue still exists though.
>
> I sure hope your issues have disappeared, but I might recommend doing some
> failovers on your cluster if you already haven't. I had thought the issue
> had been resolved after a HBA BIOS upgrade. However, after a few failovers
> (using both the management console and simply rebooting the server), the
> issue returned.
>
> Keep me posted as to how things are going on your end.
> Thanks.
>
> "Priit Vosu" wrote:
>
> > I upgraded the IBM DS4700 firmware from 6.23.05.02 to 7.36.08.00
> > I also upgraded the storage MPIO drivers on the Windows 2008 server to 10.36
> > version and since then the system has been stabile for a week which is not
> > very long time, but previously the error occured way sooner then that.
> >
> > I am also running standby replication off that server and I still get
> > occasional Event ID: 2082 errors, but the replication itself is not breaking
> > and the servers are working correctly so far.
> >
> > We are planning to make the same upgrade in couple of days in the production
> > environment if all will be still ok at that time. So far it does seam that
> > the firmware and MPIO driver update have made the system more stabile and
> > maybe even fixed it.
> >
> >
> > "Dale Kiefer" wrote:
> >
> > > We have the exact same issues you are experiencing with the cluster becoming
> > > non-responsive. I have also posted about this in the Exchange forums at:
> > >
> > >
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.exchange.clustering&p=1&tid=f9ee94b3-d92c-4273-a962-f8f2de77d68f> > >
> > > Please let me know how you make out with the storage firmware upgrade or any
> > > other changes. I am not running in a test environment so I don't have the
> > > ability to make changes as easily.
> > >
> > > What version of firmware are you upgrading from/to?
> > >
> > > I'm glad to see a few more people are bumping into these same problems.
> > > Hopefully we can find a resolution soon.
> > >
> > > "Priit Vosu" wrote:
> > >
> > > > I seem to have come across similar problems:
> > > >
> > > > We have 2 datacenters both running identical hardware and we are testing
> > > > upgrade for exchange to exchange 2007.
> > > > I built 2 x Windows 2008 SP1 clusters one in each DC and put exchange 2007
> > > > clustered mailbox role on them and standby continuous replication between
> > > > them.
> > > >
> > > > Hardware is running on IBM BladeCenter HS21 blades
> > > > Storage is on IBM System Storage DS4700 storage boxes.
> > > >
> > > > Have latest drivers and patches applied and also Exchange SP1 Rollup Update 5.
> > > > The system that is on the receiving end of the standby relication does not
> > > > seem to suffer from the issues however the Exchange box that is running live
> > > > gets after couple of days the following errors:
> > > >
> > > > Source: Microsoft-Windows-FailoverClustering
> > > > Event ID: 1069
> > > > Level: Error
> > > > Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 9)' in
> > > > clustered service or application 'E2K7CLUS02' failed.
> > > >
> > > > Source: Microsoft-Windows-FailoverClustering
> > > > Event ID: 1230
> > > > Level: Error
> > > > User: SYSTEM
> > > > Computer: E2K7NODE2.domain.local
> > > > Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 10)'
> > > > (resource type '', DLL 'clusres.dll') either crashed or deadlocked. The
> > > > Resource Hosting Subsystem (RHS) process will now attempt to terminate, and
> > > > the resource will be marked to run in a separate monitor.
> > > >
> > > > After these errors come, the Cluster administrator will hang and Windows
> > > > Explorer will also hang and not ba able to browse the drives.
> > > > Exchange will however continue to work and mail will keep flowing and
> > > > clients can still access mailboxes. To restore the functionality of admin
> > > > tools so far I have found nothing short of reboot that would do the trick.
> > > >
> > > > When I move the cluster over to the standby cluster in the other DC the same
> > > > problems start there.
> > > > I will try the firmware upgrade of the storage in the next couple of days in
> > > > one of the DC-s to se it it has any impact. It does look like it is some sort
> > > > of issue between IBM storage or storage drivers and the Windows 2008 SP1
> > > > cluster. We have multiple Windows 2003 clusters running off the same storage
> > > > boxes and so far we have had no errors on any of them.
> > > >
> > > >
> > > >
> > > > "Dale Kiefer" wrote:
> > > >
> > > > > There was an identical incident posted at:
> > > > >
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c086d541-b2f7-4db1-899a-a0ff03f55183> > > > >
> > > > > However, I thought I would pose it as a seperate question in the hope of
> > > > > bringing additional attention to this problem. I am basically copying and
> > > > > editing the previous posters' information as it is so close to our issue.
> > > > >
> > > > > We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
> > > > > with Cumulative Rollup 4 installed. The second node is continually going
> > > > > down. Here is what is going on:
> > > > >
> > > > > • Cluster servers are IBM xSeries 3650s
> > > > > • Using a IBM DS4800 SAN for shared storage
> > > > > • The NIC configuration on the nodes is as follows:
> > > > > o Onboard Broadcom adapter - v4.4.16.0
> > > > > o 2 Intel PCI-X adapters
> > > > > o 3 network connections setup: public - 10.2.105.x Intel
> > > > > switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
> > > > > 192.168.1.x Broadcom crossover cable
> > > > > o We setup the 3 network connections to help eliminate the
> > > > > network as the issue.
> > > > > o IPv4 Connectivity only, no teaming
> > > > > o Windows cluster validation does not report any issues.
> > > > >
> > > > > The issue that we are seeing is that intermittently Node 2 gets kicked out
> > > > > of the cluster and shuts down the cluster service generating an 1177 error in
> > > > > the event log. Basically, this means that it lost quorum due to losing
> > > > > connectivity with the cluster nodes. This sometimes happens 3 times an hour,
> > > > > but might not happen for a few hours. The cluster service will always
> > > > > automatically restart and everything is fine again for a period of time.
> > > > >
> > > > > The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
> > > > > owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
> > > > > exhibits the problem. Using Node and Disk Majority for quorum setting.
> > > > >
> > > > > It looks like the nodes are losing network connectivity to each other based
> > > > > on the cluster logs indicating the routes as down, but we now have 3 network
> > > > > connections between the 2 nodes using 3 different adapters from 2 different
> > > > > vendors. So I doubt this is the issue.
> > > > >
> > > > > MS believes the issue to be storage related due to "error 170" appearances
> > > > > in the cluster logs and indicates these are related to persistent reservation
> > > > > problems. We have installed the latest MPIO from IBM which supposedly
> > > > > resolves some of these types of issues. However, the problem continues. IBM
> > > > > is also looking into this, but we await a solution.
> > > > >
> > > > > Has anyone else ran into this problem? Suggestions? Any help is greatly
> > > > > appreciated.