Skip to main content

Blogging about SharePoint

Go Search
Home
Blogging about SharePoint
Public Speaking
  

Michael Blumenthal's BlumenthalIT.NET > Blogging about SharePoint > Posts > Digging deep to understand MOSS's behavior
Digging deep to understand MOSS's behavior

I blogged about this problem earlier in my post titled Corrupted Workflow Association?.  At the time, I decided to work around it by renaming the workflow association and then NOT EVER TYRING TO DELETE IT AGAIN.  That was fine in my development environment, however then this happened in our QA environment.  The symptom was similar in the QA environment: I try to delete the workflow, and the page never returns, meanwhile the CPU on the SQL server gets very busy - in this case, sticking at 50-80%  for a long time.  Long enough for us to find a database administrator to kill the blocking process in SQL Server.  Soon as the process dies, the CPU goes back to nearly idle.

So now I had to figure out what it was doing so we could ensure it would NEVER HAPPEN AGAIN.

So some background first. 

We have several environments, and we do code migrations from one to another with a fair amount of structured process.  I develop workflow solutions on my virtual machine (VM), a MOSS Windows 2003 server with a local SQL Server and Visual Studio and a number of other tools.  The VM is a domain member. I package my code into a Web Solution Package (WSP) and upload the WSP to the "Deployment Control List" - a SharePoint List that the Farm Admin created to help structure the deployment process.  The farm admin downloads the WSP, other related files, and the link to the install doc from that list.  He then deploys it to the INTEGRATION environment.  Successfully INTEGRATION deployment means we can deploy to QA later in the week.  Likewise, successful deployment to QA means we can go to PRODUCTION.  At this client, we only go to QA once a week and to PRODUCTION once a week.  Moves to production can happen either the following day or the following week after a successful deployment to QA, depending on how much time in QA is desired.

To summarize, the code migration path is DEV > INT > QA > PROD. 

All four of our environments are MOSS Enterprise Edition.  The size of each farm varies from an all-in-one (though NOT using SQL Express!) environment for my Dev VM to QA and PROD that have two Web Front Ends, several application servers (including a SQL Server Reporting Services server), and a clustered SQL Server back end.  QA is virtualized, while I believe PROD is physical servers.

This is release 2 of this particular workflow solution, and yesterday was about the fourth time it was deployed to QA, and about the 9th time it had been deployed to INT.  In other words, we had been finding and fixing bugs for the last four weeks, as well as adding features.  The business analyst I work with is understandably more than ready to get this version into production and done with.

In hindsight, it's important to note that an intermediate build of V2 of this solution - call it v2.2, with the current build being 2.9 - had a bug in it where it could end up in a loop that created many, many tasks in the task list (THOUSANDS apparently).  This was fixed in build 2.7.  Build 2.2 made it as far as QA, but did not go to PRODUCTION.  The trigger that would send the v2.2 workflow into the task creation loop was reassigning a task.  Our testing process leans toward the informal, and because it wasn't the focus of our change in v2.2, I think we had only tested the reassign functionality in QA on a whim.  When it broke, I went back into DEV, and recreated the problem.  The net effect was that both Dev and QA had thousands of tasks, whereas Integration did not.  V1 of this workflow has been running in Production for over a month and has only 300-400 tasks in Production.

Why is the number of tasks important? Well, I'm getting ahead of my self, but you will see.  Let's go back to the front end and the primary symptom.

When you delete a particular workflow association in my Dev environment, the CPU hits the roof, stays there, and interactive performance takes a nosedive.

First of all, what is "removing a workflow association"?

Remove Workflows Screenshot

It's selecting Remove, then clicking OK, on the Remove Workflows page (get there from List Settings, then Workflow Settings).

So what happens when you click OK? Well, on the MOSS Web Front End side, you could crack open .NET Reflector and take a look. However, the symptom her is that the SQL Server process goes nuts, NOT W3P.  Therefore, let's not look at the .NET code, but rather we crack open SQL Profiler and have it trace the SQL activity between the Web Front End and the database.  We start the trace right before we click the OK button.  We stop the trace once we see that the process has definitely gone nuts.  Confirmation that SQL has definitely gone nuts is as simple as running Task Manager and watching the CPU Usage graph go solid bright green and the CPU History Graph(s) jump up and stay high.  Soon as you see that, it's time to restart SQL Server or kill the offending SQL Server Process.  During my investigations, I had the Services MMC open so that I could quickly restart SQLServer in one click.  In my Dev environment, if you let it run to long, the machine becomes unusable (e.g. you click and it may respond a minute later if it can spare the CPU cycles to deal with users and UI).

We also had SQL Server Management Studio up, and had the Activity Monitor going.  This time we got lucky and captured a particularly relevant SQL Statement as the statement that was causing SQL anguish:

image

Now we cross referenced this back to our SQL Trace in SQL Profiler, and did a search for DropWorkflowAssociation, and sure enough we found the following statement:

exec proc_DropWorkflowAssociation '7082EA6F-E14E-4B3E-9581-C10AB17F0A12','9FE5442A-C3DA-443E-B755-48A50D9BD9A4'

Oh, great...GUIDs.  <Sarcasm>They make everything so much more readable!</Sarcasm>.  However, a web search of proc_DropWorkflowAssociation reveals that the stored procedures in MOSS are actually documented in a PDF on MSDN!  http://msdn.microsoft.com/en-us/library/cc313104.aspx

In fact, if you look at section 3.1.4.29 on page 70, it provides the signature for the stored procedure call, and explains the parameters.

PROCEDURE dbo.proc_DropWorkflowAssociation ( @SiteId uniqueidentifier, @Id uniqueidentifier, @DropAll int = 0 );

So the first parameter is the site collection GUID, the second parameter is the workflow association GUID, and the third parameter is not specified so it defaults to 0.

So, is '7082EA6F-E14E-4B3E-9581-C10AB17F0A12' the GUID of the site collection that contains the site that has the list with the bad workflow association?  It sure is, and a line of PowerShell proves it:

Get-SPSite http://server/managedpath/pathToSiteCollectionRootSite | foreach {$_.Id.ToString(); $_.Dispose()}

That returns the GUID that is used in parameter 1.  See www.codeplex.com/PSBB and my other posts on PowerShell for more information about using Get-SPSite.

The second parameter is indeed the GUID for the workflow association.  We can verify that later on.

So now that we know we have found the relevant statement, why is it driving the processor nuts?  What does DropWrokflowAssociation actually do? Well, I don't want to post the entire stored proc here (might be a violation of MSFT's Intellectual Property), but if you have access to the SQL Server for a MOSS installation, you can take a look at it yourself - just look at the stored procedures for the relevant Content Database.  Nonetheless, I think I can discuss the procedure at the pseudocode level without getting in trouble.  What it does

  1. Get the GUID of the SPWeb and the GUID of the SPList from the WorkflowAssociation Table where they are for the SPSite and WorkflowAssociation that we passed in.
  2. If we find it, and since DropAll is zero, we call the following procedure:

    proc_AutoDropWorkflows @SiteId, NULL, @ListId, NULL, @Id, NULL, 1 , 200

  3. If #2 deleted all workflows, then delete the workflow association from the workflow association table.

So let's take a look at step #2 because it calls another stored procedure.  What does AutoDropWorkflows do?  Back to the PDF documentation, this time section 3.1.4.9 on page 50.

PROCEDURE dbo.proc_AutoDropWorkflows ( @SiteId uniqueidentifier, @WebId uniqueidentifier, @ListId uniqueidentifier, @ListItemId int, @TemplateId uniqueidentifier, @AutoCleanupDate datetime, @ForceDelete int, @TopBeforeQuick int = 2147483647 );

Ok, so we are passing a null for the Web Id, a Null for the List Item Id, a null for the Auto Cleanup Date, a 1 for ForceDelete, and Top Before Quick is 200.  Reading through the PDF document [apparently referred to as MS-WSSPROG],  if WebID is null, the action is for all sites (SPWebs) in the Site Collection.  Ok, in our scenario, that seems unnecessary since we are deleting a workflow association in a specific site (SPWeb).  If you run this procedure and specify the GUID of the SPWeb that has the list with the workflow association, does that make this work?   Well, there's a problem with trying to answer that question - you are modifying the Content Database directly via T-SQL, NOT through through the SharePoint API.  In other words, DON'T DO THIS.  If you execute any T-SQL commands other than select statements, Microsoft Support will no longer help you should you call them.  What? You say you want to do it anyway? It's a development database after all.  OK, then you MUST do a Full Backup of the Content Database before you run any T-SQL command that will edit anything.  Now that you have done a Full Backup, are you ready to play with the T-SQL command? NO. NO. NO.  You don't know you have a valid backup until you successfully restore from it.  So edit a list item somewhere, then restore the backup and make sure the site comes up with the list item in its previous state.

Now that you have backed up the content database, AND PROVEN VIA A RESTORE THAT THE BACKUP WORKS, you can now muck around with T-SQL in the Content Database because you can always get back to a supported state.  So let's try this: If we run AutoDropWorkflows @SiteId, @WebId instead of NULL, @ListId, NULL for the List Item ID, @Id for the workflow association, NULL for the AutoCleanupDate, 1 for ForceDelete , 200 for TopBeforeQuick, does it work better? No, the processor still goes nuts.  So let's take a look at the rest of what it does.

ListItemID - is being passed as NULL, so all list items are being worked with.  That seems ok, though what list items are these?  The items in the list with the workflow association. 

TemplateID is the workflow association. 

AutoCleanupDate is irrelevant since ForceDelete is 1 - again, per the MS-WSSPROG doc.

ForceDelete is 1, and when this is one, it deletes the first 200 workflows. The 200 is specified in the TopBeforeQuick parameter.

Note that the return value of this procedure is 1 only if it deleted all the workflows without hitting the TopBeforeQuick limit.  Only if the return value is 1 does the calling procedure, Drop All Workflows, actually do step 3 above (deleting the workflow association from the association table) However, since when I ran this, it never returned, I don't know if it will return  1 or 0.

Now thinking about my setup, I only have one or two workflow associations set up on this list, so if I am interpreting that right, since 2 < 200, I will never hit that limit.  Let's take a look into the body of Auto Drop Workflows to see why it would peg the CPU and apparently never return.

As I look at the body of Auto Drop Workflows, the fourth line after the AS statement defines a cursor.  Ah-ha, a looping construct!  A loop gone bad is consistent with the CPU going nuts. Cursors are what you use in T-SQL when you have to walk through a list item by item instead of being able to deal with a set of records as a whole.    What does the cursor do?  Well, for each workflow instance, it calls

proc_DropWorkflow @WorkflowInstanceId, @SiteId, @WebId, @ListId

Drop Workflow does the following:

  1. Looks up the TaskListID (among other things) in the first select statement
  2. Conditionally decrements an instance counter in the WorkflowAssociation table
  3. Conditionally calls proc_RemoveFailover
  4. Deletes from the Workflow table where the site and Workflow instance ID match.
  5. Calls proc_DropWorkflowTasks @WorkflowInstanceId, @SiteId, @WebID, @TaskListId
  6. Conditionally calls proc_DeleteContextCollectionEventReceivers and deletes from ScheduledWorkItems

Step 5 is the important one.  Looking at it, it declares another cursor based on data in the AllUserData table, and walks through that data calling proc_DropListRecord on each row.

Let's take a closer look at that.  Note that the documentation for this stored proc is in MS-WSSDLIM.pdf, not MS-WSSPROG.pdf.  You should download the whole package of PDFs and set up Desktop Search or MOSS Search to index them.  This way it is easy to find which document contains a given stored procedure.

EXEC proc_DropListRecord @SiteID, @WebID, @TaskListId, 0, @ItemId, 1, 0, NULL, NULL, 0, 0, NULL, 3

Look at the definition of this stored procedure.  This isn't  just a simple delete from a table either.  This stored proc has to take into account use of the recycle bin, the need to audit, and perhaps other list settings.  Also, there is no one to one correspondence between that Tasks list and a "Tasks" table in the database. 

More specifically, Drop List Record does the following:

  1. Looks up a bunch of values
  2. If it finds them, then it calls proc_VerifyUpdateConditions
  3. It calls proc_DeleteContextCollectionEventRecievers
  4. If Auditing is needed, it calls proc_AddAuditEntryFromSql
  5. it calls proc_DeleteUrlCore
  6. If it's a meeting list it ... well, never mind, it's not a meeting list
  7. It calls proc_UpdateDiskUsed

That's a lot of stuff to do for deleting a list item!  I suspect that some of those may call other stored procedures, but I have to end this somewhere, so I think this is deep enough for now.  We are now 4 layers deep, inside two cursors.  No wonder this takes a while and makes a SQL Server busy.

The net effect of the call we started with, Drop Workflow Association, is that it is trying to individually drop each task associated with each list item that used the workflow used by the workflow association in question.  When you have several thousand tasks, and you have to do stuff to each one of them, it's not quick to get rid of them.  From this I can conclude that the process to drop a workflow, wasn't really stuck spinning doing nothing.  It was working its way through a very long list, but in my two environments, doing it so slowly that it would have taken hours (at least in my Dev VM) to complete.  No wonder it looked like the SQL server was going nuts.

Overall, my resulting request to the SharePoint Product Team is that if there is more than some number of tasks in the task list, then deletion of a workflow association should be treated as a long running operation, at least from a UI perspective. It should present a UI similar to the long running operation screen you get when you click on the "hurricane button" to update a Business Data column in a list.  Unless of course they have a better way.

--Michael

Comments

Thanks Michael!

I have just spent a considerable amount of time trying to understand why my workflow instances were hanging around after I had removed a workflow from a list. Now I understand, we have a LOT of instances and only the first 200 were actually being deleted, the rest just being flagged for cleanup.
To make our matters worse we have disabled auto workflow cleanup :-(
Your article also provides some great pointers for how to dig deep, the tools to use and how to avoid becoming unsupported.
Much appreciated.
Louise P
at 8/4/2009 9:22 AM

Re: Digging deep to understand MOSS's behavior

 A nice & a meaningful article. the article  provides some great pointers for how to dig deep, the tools to use and how to avoid becoming unsupported. Thanks.
<a href="http://www.profitablegreensolutions.com">going green solutions</a>
at 10/29/2009 11:03 PM

Digging deep into MOSS

Thank you for posting this article. This has provided some valuable insight for our team!
at 5/11/2010 6:28 AM

Add Comment

Items on this list require content approval. Your submission will not appear in public views until approved by someone with proper rights. More information on content approval.

Title


Body *


Please put your name at the end of your comment as the Name field doesn't display in the main view of the comments.

Name: *


Your name.  If you first name is Michael, please provide initials or the like to clearly distinguish your name from Michael Blumenthal's.
Also, this field does not appear in the main view of the comments.

Need Paid Consulting Services?

Are you interested in engaging a SharePoint consultant for fee-based consulting services?  If so, I can have an Account Executive from my employer contact you.  Other site visitors will not see this column.

Additional Contact Information

Please provide additional contact information if you need a direct response.  Other site visitors will not see this column.

What is tomorrow's date? *

This is a weak CAPTCHA based on advice from a tweet from @ToddKlindt.
Attachments