Friday, August 17, 2007

Strange behavior of Biztalk 2006 on SQL cluster environment

While testing Biztalk application on server farm mode at one of my client, I came across something interesting. This biztalk application used SQL adapter to connect from orchestration to execute stored proc. All tests for server farm went OK. Biztalk server fail over sequence worked fine. Net scala worked fine for web farm. We ran into issue when we tried failing over SQL cluster from original (initial) node to another. On the original node, everything worked perfect. On failing over to second node, I found that Biztalk application would not connect to the custom database to execute stored procedure. We thought may be it is a connection pool caching issue. We tried restarting the physical biztalk servers and hosts. Nothing worked. Then I thought that may be it is something to do with 32 Bit SQL adapter not working properly on 64 bit cluster (not a very smart assumption on retrospection). We also played with MSDTC settings by reducing the authentication level. Nothing worked. The fact that threw me off was that everything was working fine on original node. We only had one physical copy of entire Biztalk database on cluster and not multiple database instances. So, it did not make sense to me as if why the same database was acting differently with different instances of SQL on each node.

I thought logically for 4 hours and tried different things. Nothing worked. I came back next day and starting thinking out of box or rather illogically. We gave explicit permissions to Biztalk service account on the "custom" database that contained stored procedures. We made Biztalk service account database owner and wolla! it worked! Yeah that simple! We could not understand the behavior but it worked. We thought of doing further post mortem. We took the permission away and failed over to original cluster node. It worked as it was working before. For some reasons Biztalk did not care for permissions while working on original node. We failed over again to second node without the permissions and it did not work again. We granted the permissions and it worked!

We left the solution in that working state! But could not explain the reason for this behavior. We have couple of theories but none of them convincing. Is it a Bug? May be. Is it a cluster configuration? May be.

No comments: