Monitoring Service Broker

 

This is an update to my old post [[Service Broker Monitoring Routine]].  The SB monitoring routine from that post was a little difficult to understand and tried to integrate setting up Service Broker with the monitoring of a Service Broker solution.  Bad idea.  This version of my monitoring routine is far easier to understand.  

Service Broker scares folks because, well, it's a bit esoteric...with its "poison messages", "dropped queue monitors", and "conversation population explosions".  Service Broker is then viewed as mysterious, which I've tried to debunk in my Service Broker Demystified Blog Series.  

Most people tend to learn a new technology by first using the GUI tools...and SB has almost no GUI tools, especially for monitoring.  My SB_Monitor script is a handy way to do SSB monitoring.  It starts at the top with basic checks like ensuring SSB is enabled.  It then moves all the way down the stack until it finally displays any errors that individual activator procedures may be throwing.  

This monitoring routine is MOST valuable when it is run in dev and QA environments when your code is most volatile.  Unfortunately some shops don't allow developers to have the necessary permissions to run SB_MONITOR.  In that case, have your DBA run the routine for you.  

What Does It Monitor?

I won't belabor everything SB_Monitor does and I won't bore you with a bunch of screenshots:  

TestDescriptionCause and Symptom
1Ensure SSB is enabled in the dbIf SSB isn't enabled pretty much nothing will run as expected.  
2Look for DROPPED queue monitors on activated queuesA dropped queue monitor means that an activator IS NOT running and therefore your queue is not "processing".  This generally means your activator proc is throwing errors.  I'll show you how SB_Monitor further diagnoses this below.  
3Look for queues in the NOTIFIED state for > 10 secondsThis is a WARNING only.  This may mean your queue is REALLY busy, or it could mean there is a problem with your activator throwing errors.  In this case you need to see if your queues are "functionally" working as you expect.  
4Activated Queues exist in a DISABLED stateThis generally means your activator is throwing errors.  In this case try running ALTER QUEUE WITH ACTIVATION (DROP);.  Then manually run the activator to determine what the problem is, correct is, and enable activation again.  
5"Poison Message" detection (queues that are no longer is_receive_enabled)Actually, this testing is looking for queues that are not able to receive messages.  In general this is almost always due to poison messages.  A "poison message" is any message that causes your activator to ROLLBACK 5 times in a row.  There may be nothing wrong with the message, it could be that your activator isn't handling an edge case you didn't code for.  Once you fix the problem, simply run ALTER QUEUE WITH STATUS = ON;
6Conversation Population ExplosionThis checks for an "excessive" amount of conversations NOT in a CLOSED state (hardcoded to 500...feel free to change).  Normally this means a "receiver" is not ENDing CONVERSATION properly and therefore the sender is keeping the messages open.  In this case messages can actually stay "open" forever, which can lead to problems months or years later. 
7conversations "stuck" in sys.transmission queueEither you have VERY busy queues or something is misconfigured and messages are not getting to where they need to go.  You may have a remote server down or a queue is disabled.  

 

After these tests are completed SB_Monitor will output a series of result sets that indicate various "statuses".  I like to monitor these results looking for things that seem fishy.  Here's what I output:

  • Contents of various "system queues" and SSB configuration settings.  
  • What queues are currently activating and their status.  
  • Various PerfMon counters around SSB showing message throughputs.  I especially like:
    • Broker Transaction Rollbacks:  this should be 0 in a properly designed system.  
    • Corrupted Messages Total:  better be 0.  If not, why?  
    • Activation Errors Total:  better be 0.  If not, why?  
    • Task Limit Reached :  this means that another activator could've been spun up to process requests but could not due to MAX_QUEUE_READERS.  You may want to look at your expected throughput.  

Finally I show the last hour of SQL Server error log entries.  Why?  When an activator throws an error it is logged to the error log, otherwise the error wouldn't be viewable by anyone.  It would be lost forever.  


You have just read "[[Monitoring Service Broker]]" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.