Sunday, August 15, 2010

Have you had a vendor troubleshooting nightmare?



Have you had a vendor troubleshooting nightmare?
Ok, so I'm clearly still on my vendor kick. Last night I was lucky enough to sit on a troubleshooting call with one of our app teams and their vendor. They were having issues with DB performance. Typically when I get on these types of calls I prefer to just lurk instead of getting really involved because the vendors are usually so stupid I really have a hard time not getting upset. They also typically override me when I say something and app teams usually side with the vendor since it's their product and they should know best. However, most of the time it has to do with straight SQL knowledge and not their product.

So anyway, here I was on this call and I was listening to this tech support guy, we'll call him 'asdf' cause it's easy to type, make one outrageous claim after another. This guy clearly knows so much more about SQL than I do cause he was able to definitively diagnose the problem (several times) with very minimal information. And again, all the while I was merely lurking because I think it's best for everyone if I don't get involved unless I need to. And this is really what I wanted to talk about today... techs who make wild assumptions and diagnose problems before they have anywhere close to enough info to be making such claims. And it's not necessarily vendors who are the most guilty of this, but since I'm picking on vendors these days we'll go ahead and make it about that.

So here's what happened, and I'd like you guys to use this as a template for how NOT to troubleshoot your systems.

The vendor asked me to run a script that pulled version info and dumped the process table that told him how many users were on the box. He then used that to surmise that the issue was index fragmentation. I said very little and continued to listen. So they asked him if he was sure this was the problem and he said yes. So they asked me to rebuild the indexes. I said sure. It took about 6mins and they began testing again. The problem was still there and the tech was confused why that didn't work so he looked at the script results again and boldly claimed that our cache/hit ratio wasn't high enough. It was in the low 90s and he said it should be at least 95. So he asked if that was a change I'd be willing to make on the server without testing it in dev first. The exchange went more or less like this:



Asdf: "Your cache/hit ratio isn't high enough".

Me: "OK".

Asdf: "Is that a change you'd be willing to make in prod without testing it in dev first?"

Me: "Is what a change I'd be willing to make?"

Asdf: "The cache/hit ratio."

Me: "Oh, I get it. Yeah, let me see what I can do. I'll check a couple things out. Oh wait, here's the problem. I accidentally set the caching knob to piss-poor. Here, I'll change it to lightning fast and see if that works better."

Asdf: "What are you talking about? I just asked you to set the cache/hit ratio higher."

Me: "I doubt we can add RAM to the server right this second. This system has been running for a while now with no increase in activity so I doubt the RAM is the issue."

Asdf: "Now wait, nobody asked you to add RAM to the box, I'm talking about cache/hit ratio, not RAM."

Me: "Well, the only real way to improve that ratio is by adding RAM, so yeah you did ask me to add RAM."

Asdf: "No, that's not the only way to change it. I've done it before and I'll lookup the setting if you like."

Me: "YES. As a matter of fact I'd love you to lookup that setting."

(never heard anything back from him even though I reminded him a couple times)



Ok, so back to the story. After we exhausted the cache/hit avenue he turned his sights to indexes. The indexes must not be configured correctly he says. Everyone else says, ok, how do we fix that. He says, well your DBA is going to need to analyze the indexes to make sure they're efficient enough for the application. Still I said nothing... trust me it's best. Then almost as quickly as he'd thought of the index idea, someone said that the query was returning quickly from a different location so would that still indicate an index problem? He said no, it's got to be network related. (notice I'm still not saying anything?) So he had them run a ping between all the different locations and they returned well within normal parameters. That's when he jumped back onto the index raft but quickly jumped back off when he remembered a hotfix for a Windows memory leak. We have to apply this patch right now, he declares. So they run and get someone from the server team who applies this patch completely untested on a prod box. Now we test again and still have the same issue. I honestly think he forgot about the index solution because he went straight to MaxDOP. We've got 16 CPUs on that box and the MaxDOP is set too high. It has to be lowered right now before the world explodes. I did ask a small question... how do you know the MaxDOP is too high? What should it be for your app? It should be set to 1 he said. Anything higher is too much. But how do you know it's too high now? Because that's the only thing it could be. Well, it's already at 1 now, so what would you like me to set it to? WAIT, we have a log that tells us how long the queries are taking. Hey you there... app guy, go check your log and see how long your queries are taking. Ok so they're taking 50secs and timing out. So the queries are taking 50secs in the DB, and that's way too long so the DBA needs to fix the indexes. They're probably not as efficient as they can be so just get him to iron them out and you'll be ok.

Unfortunately, this is where I jump back in and it went something like this:

"You have no way of knowing whether the query was in the DB for 50secs or in the network, etc. You just know that it took 50secs to get data back."

"No, the log says the query took 50secs. That's DB time."

"But there's no way for you to know that. All you're gathering is roundtrip stats. You can't say with any authority at all that the query took a long time in the DB itself."

"Yes we can. This is the problem."

"But you've claimed everything that's crossed your mind so far is the problem. I've sat here and listened to you fumble your way through this call and grasp at every straw you could find and make wild claims without anything to back them up. Now you 're saying that I have to tune the indexes on YOUR product. Your product, which has been working just fine until today."

"Well there's not another explanation."

"Of course there is. I can guarantee you that unless you went in there and physically dropped an index, the indexes are fine. Oh, and btw, I noticed that your fill factors are set at 100% so I'm changing them to 80% so we have some breathing room between maint windows."

"Yeah, you're supposed to set them up correctly, would you like me to send you our maintenance guide so you can see what we recommend as our best practices?"

"Does this guide talk about the appropriate fill factors for your app?"

"Yes it does. It gives full recommendations."

"How does this DB get installed. Did we create the schema ourselves?"

"No, the application creates all the tables and everything."

"But we can change the tables if we wanted to, right? I could go in and change the data types, and column names?"

"No, it has to stay the way it is or the app will stop working."

"But you made us create all of our own indexes. Why doesn't the system come with indexes already created on the tables?"

"It does come with indexes. These are the indexes we created."

"So you're telling me that you installed a DB on our system with indexes that don't follow your best practices in your own guide? Why wouldn't you create the DB initially with your own best practices on it?"

(nothing)

"Dude, that wasn't a rhetorical question. Why did you create indexes that don't follow the best practices you expect us to follow?"

(nothing)

When all of the dust settled it turned out that there was a problem with a switch that sat between some of the clients and the DB that was having problems.

OK, I know this post is getting long so I'll just bottom-line this for you now. What I'd like to see is some more of the vendors spending time and effort to train their people in basic troubleshooting techniques. Too many of them now just to the equivalent of reading tea leaves.

No comments:

MCITP Training - MCITP Certificaion - MCTS Exams Training - CCNA Exams - and more at CertKingdom.com