One of the potential symptoms of the issue is that we have intermittent long running web method calls. The web methods are fairly simple - we make a sproc call, and then populate and return objects with the data. The SQL DBAs have checked the performance of the sproc calls during the time that the web methods are performing slowly and have assured us that the sproc calls are returning in 200ms or less.
The CPU load on the load balanced web servers seem to be reasonable, so we have been speculating that the slow web method calls are either tied to the mysterious SQL Server connection issue, or due to some unknown issue while trying to connect to the database.
We decided to take a number of steps to try to improve performance. If the issue is due to the MSC (mysterious SQL connection) issue, then our changes won't make much of a difference. However, we might find that our code is either exacerbating the MSC issue, or the direct cause of the slow web method calls. It's doubtful that our code is the sole cause of the MSC issue since multiple non-related SQL servers have experienced the MSC issue.
Step 1: Add extra performance counters to try to pinpoint where the slow performance issue is occurring
We added performance counters around some of our database connection code, and around the sproc call, to see if there are any performance issues. The tool we use to view performance counters was missing the new counters we added that would help show how often new connections are being made. That should hopefully be fixed soon.
Step 2: Change the connection from using named pipes to TCP
Is it better to use TCP than named pipes? According to MSDN, the performance of using named pipes and TCP is virtually identical on fast networks. However, if you are on a slow network, and the SQL Server database is running on a different machine than the application that will be connecting to it, then TCP could give you better performance. The documentation made it sound as though named pipes might give better performance if the SQL Server instance was on the same machine as the app that is accessing the database. That is not our configuration, so it appears that there is no benefit to using named pipes. However, using TCP will be less "talky" (needs less communication back and forth to perform reads) than named pipes, and it can also help streamline communication by using backlog queues.
I hadn't realized that there might be a potential performance benefit of using TCP. It shows me that I really should read through the documentation for technologies that I use regularly. It will help me be a lot more thoughtful about how and why I am doing something. Does using the default protocol (named pipes) work? Yes. However, we also need to be mindful of the performance our services have when fetching data from our databases.
Step 3: Change the connections to be in a form that will allow connection pooling (if the app isn't already using connection pooling)
I'm pretty new to C# / .Net, so some of what I've been told by my teammates has been taken on faith. However, I want to learn, so I did some reading. It appears that the connection pooling should happen regardless of using SqlConnection or DbConnection. I'll need to look closer at the code, and config files, to find out whether or not our app is disabling pooling, but it might be that we are using pooling, but the following is occurring due to the MSC issue. From an MSDN article on pooling:
When connection pooling is enabled, and if a timeout error or other login error occurs, an exception will be thrown and subsequent connection attempts will fail for the next five seconds, the "blocking period". If the application attempts to connect within the blocking period, the first exception will be thrown again. After the blocking period ends, another connection failure by the application will result in a blocking period that is twice as long as the previous blocking period. Subsequent failures after a blocking period ends will result in a new blocking periods that is twice as long as the previous blocking period, up to a maximum of five minutes.So, if the mysterious SQL connection issue is causing a connection to fail, then we should expect exponentially longer waits (up to 5 minutes) for each failure to connect. It would explain why our web method call is taking hardly any time to return usually, and then takes 30+ seconds at other times.
I just need to prove that we are using connection pooling, or find out why we are not using connection pooling. I hope that we are using connection pooling, because the connection failure blocking period seems to be a great explanation for what we are seeing.
Leave a comment, or send an email, if you have any troubleshooting suggestions.