Skip to main content

Blogging about SharePoint

Go Search
Home
Blogging about SharePoint
Public Speaking
  

Michael Blumenthal's BlumenthalIT.NET > Blogging about SharePoint > Categories
MOSS Search Fails during Infrastructure Updates Install (Problem and Resolution)

image

“Failed to start service SearchServiceInstance on this server after completing upgrade.  Please start it manually.”

The service it is talking about is Office SharePoint Search.  PSConfig stays on the “task 8” screen for a while, though it does later complete “successfully”. 

Here’s the other symptoms:

When I am logged in as MOSSsa, my Farm Admin account which is a member of Administrators, and I try to start the service I get:

image

Let’s help the search engines:  The error shown is “Error 183: Cannot create a file when that file already exists.”

When I logged in as Administrator (my box/domain admin – this machine is a Domain Controller), I got a different message once (approximately: service started then stopped, which sometimes is OK), but on a subsequent attempt, got the same message as above.

Resolution/Workaround: Add MOSSSearchSvc to the Administrators group, start the service.  The service starts successfully.  Stop the service, remove it from the group, then restart the service.  No complaints.  Apparently, it just needed extra permissions long enough to remove the offending file. My guess is that the file was created by MOSSsa since I was logged in as MOSSsa when I ran the Infrastructure Update.

--Michael

UPDATE 24 April 2009: The following day, SharePoint Search stopped again, with the same error.  I added it permanently to the Administrators group.  A friend of mine pointed out that TechNet does say that the Search Service should be in the Administrators group, so this is actually the recommended configuration.

Need More Permissions than Sec Admin and DB Creator to Attach Content Database in SQL 2008

So I decided that the best way for me to get some content into this new MOSS farm that I created was to attach a content database.

I logged into my farm server (it’s a demo environment so it’s all on one box). I logged in as my Farm Admin account.

I went into SQL Server Management Studio, right clicked the Databases folder, and then clicked Attach.  Got this:

image

That’s not what I was expecting.  After asking one of our SQL gurus to take a look and having him tell me that he’d never seen this one before, I feared my install of SQL Server 2008 was broken and I’d have to reinstall….  But one last thing came to mind – I was logged in as my Farm Admin account, not the box admin/domain admin (machine is a Domain Controller so box admin is domain admin), so I only had Security Admin and DB Creator roles, not full control / system admin.  Sure enough, when I logged into the machine as domain admin, the Attach dialog worked.

Hopefully, the attach will achieve what I want as well…

--Michael

Building a Demo VM in a Highly Scripted Way

Currently, I am setting up a single server MOSS virtual machine which will host a demo site that will be used by part of Magenic’s sales force.  The demo site will represent a repeatable service offering.  Additionally, another project team in my office is working on a MOSS and Commerce Server solution that will have multiple concurrent MOSS developers.  We are planning to give each developer their own single server farm.  I believe the best way to do this is through a scripted install.  Because of these two projects, I have been building out a MOSS VM using command line scripts and answer files as much as I could.

To this end, I have created an answer file for DCPROMO, a SQL Server 2008 answer file.  a batch file that runs PSCONFIG, and batch files that run several STSADM commands.  I also wrote  a PowerShell script to create users, configure them, and put them in the right OrganizationalUnit.  Plus I am documenting almost everything.

I’ll do my best to share these later in the week. In the mean time, I wanted to cite a few resources: Alpesh Nakars on having good examples of these: http://www.alpesh.nakars.com/blog/stsadm-command-line

Also http://www.powershellpro.com/powershell-tutorial-introduction/powershell-tutorial-active-directory/ This one showed me how to get the user to show up in the right  OU.

 

That’s enough for the moment.

Michael

Digging deep to understand MOSS's behavior

I blogged about this problem earlier in my post titled Corrupted Workflow Association?.  At the time, I decided to work around it by renaming the workflow association and then NOT EVER TYRING TO DELETE IT AGAIN.  That was fine in my development environment, however then this happened in our QA environment.  The symptom was similar in the QA environment: I try to delete the workflow, and the page never returns, meanwhile the CPU on the SQL server gets very busy - in this case, sticking at 50-80%  for a long time.  Long enough for us to find a database administrator to kill the blocking process in SQL Server.  Soon as the process dies, the CPU goes back to nearly idle.

So now I had to figure out what it was doing so we could ensure it would NEVER HAPPEN AGAIN.

So some background first. 

We have several environments, and we do code migrations from one to another with a fair amount of structured process.  I develop workflow solutions on my virtual machine (VM), a MOSS Windows 2003 server with a local SQL Server and Visual Studio and a number of other tools.  The VM is a domain member. I package my code into a Web Solution Package (WSP) and upload the WSP to the "Deployment Control List" - a SharePoint List that the Farm Admin created to help structure the deployment process.  The farm admin downloads the WSP, other related files, and the link to the install doc from that list.  He then deploys it to the INTEGRATION environment.  Successfully INTEGRATION deployment means we can deploy to QA later in the week.  Likewise, successful deployment to QA means we can go to PRODUCTION.  At this client, we only go to QA once a week and to PRODUCTION once a week.  Moves to production can happen either the following day or the following week after a successful deployment to QA, depending on how much time in QA is desired.

To summarize, the code migration path is DEV > INT > QA > PROD. 

All four of our environments are MOSS Enterprise Edition.  The size of each farm varies from an all-in-one (though NOT using SQL Express!) environment for my Dev VM to QA and PROD that have two Web Front Ends, several application servers (including a SQL Server Reporting Services server), and a clustered SQL Server back end.  QA is virtualized, while I believe PROD is physical servers.

This is release 2 of this particular workflow solution, and yesterday was about the fourth time it was deployed to QA, and about the 9th time it had been deployed to INT.  In other words, we had been finding and fixing bugs for the last four weeks, as well as adding features.  The business analyst I work with is understandably more than ready to get this version into production and done with.

In hindsight, it's important to note that an intermediate build of V2 of this solution - call it v2.2, with the current build being 2.9 - had a bug in it where it could end up in a loop that created many, many tasks in the task list (THOUSANDS apparently).  This was fixed in build 2.7.  Build 2.2 made it as far as QA, but did not go to PRODUCTION.  The trigger that would send the v2.2 workflow into the task creation loop was reassigning a task.  Our testing process leans toward the informal, and because it wasn't the focus of our change in v2.2, I think we had only tested the reassign functionality in QA on a whim.  When it broke, I went back into DEV, and recreated the problem.  The net effect was that both Dev and QA had thousands of tasks, whereas Integration did not.  V1 of this workflow has been running in Production for over a month and has only 300-400 tasks in Production.

Why is the number of tasks important? Well, I'm getting ahead of my self, but you will see.  Let's go back to the front end and the primary symptom.

When you delete a particular workflow association in my Dev environment, the CPU hits the roof, stays there, and interactive performance takes a nosedive.

First of all, what is "removing a workflow association"?

Remove Workflows Screenshot

It's selecting Remove, then clicking OK, on the Remove Workflows page (get there from List Settings, then Workflow Settings).

So what happens when you click OK? Well, on the MOSS Web Front End side, you could crack open .NET Reflector and take a look. However, the symptom her is that the SQL Server process goes nuts, NOT W3P.  Therefore, let's not look at the .NET code, but rather we crack open SQL Profiler and have it trace the SQL activity between the Web Front End and the database.  We start the trace right before we click the OK button.  We stop the trace once we see that the process has definitely gone nuts.  Confirmation that SQL has definitely gone nuts is as simple as running Task Manager and watching the CPU Usage graph go solid bright green and the CPU History Graph(s) jump up and stay high.  Soon as you see that, it's time to restart SQL Server or kill the offending SQL Server Process.  During my investigations, I had the Services MMC open so that I could quickly restart SQLServer in one click.  In my Dev environment, if you let it run to long, the machine becomes unusable (e.g. you click and it may respond a minute later if it can spare the CPU cycles to deal with users and UI).

We also had SQL Server Management Studio up, and had the Activity Monitor going.  This time we got lucky and captured a particularly relevant SQL Statement as the statement that was causing SQL anguish:

image

Now we cross referenced this back to our SQL Trace in SQL Profiler, and did a search for DropWorkflowAssociation, and sure enough we found the following statement:

exec proc_DropWorkflowAssociation '7082EA6F-E14E-4B3E-9581-C10AB17F0A12','9FE5442A-C3DA-443E-B755-48A50D9BD9A4'

Oh, great...GUIDs.  <Sarcasm>They make everything so much more readable!</Sarcasm>.  However, a web search of proc_DropWorkflowAssociation reveals that the stored procedures in MOSS are actually documented in a PDF on MSDN!  http://msdn.microsoft.com/en-us/library/cc313104.aspx

In fact, if you look at section 3.1.4.29 on page 70, it provides the signature for the stored procedure call, and explains the parameters.

PROCEDURE dbo.proc_DropWorkflowAssociation ( @SiteId uniqueidentifier, @Id uniqueidentifier, @DropAll int = 0 );

So the first parameter is the site collection GUID, the second parameter is the workflow association GUID, and the third parameter is not specified so it defaults to 0.

So, is '7082EA6F-E14E-4B3E-9581-C10AB17F0A12' the GUID of the site collection that contains the site that has the list with the bad workflow association?  It sure is, and a line of PowerShell proves it:

Get-SPSite http://server/managedpath/pathToSiteCollectionRootSite | foreach {$_.Id.ToString(); $_.Dispose()}

That returns the GUID that is used in parameter 1.  See www.codeplex.com/PSBB and my other posts on PowerShell for more information about using Get-SPSite.

The second parameter is indeed the GUID for the workflow association.  We can verify that later on.

So now that we know we have found the relevant statement, why is it driving the processor nuts?  What does DropWrokflowAssociation actually do? Well, I don't want to post the entire stored proc here (might be a violation of MSFT's Intellectual Property), but if you have access to the SQL Server for a MOSS installation, you can take a look at it yourself - just look at the stored procedures for the relevant Content Database.  Nonetheless, I think I can discuss the procedure at the pseudocode level without getting in trouble.  What it does

  1. Get the GUID of the SPWeb and the GUID of the SPList from the WorkflowAssociation Table where they are for the SPSite and WorkflowAssociation that we passed in.
  2. If we find it, and since DropAll is zero, we call the following procedure:

    proc_AutoDropWorkflows @SiteId, NULL, @ListId, NULL, @Id, NULL, 1 , 200

  3. If #2 deleted all workflows, then delete the workflow association from the workflow association table.

So let's take a look at step #2 because it calls another stored procedure.  What does AutoDropWorkflows do?  Back to the PDF documentation, this time section 3.1.4.9 on page 50.

PROCEDURE dbo.proc_AutoDropWorkflows ( @SiteId uniqueidentifier, @WebId uniqueidentifier, @ListId uniqueidentifier, @ListItemId int, @TemplateId uniqueidentifier, @AutoCleanupDate datetime, @ForceDelete int, @TopBeforeQuick int = 2147483647 );

Ok, so we are passing a null for the Web Id, a Null for the List Item Id, a null for the Auto Cleanup Date, a 1 for ForceDelete, and Top Before Quick is 200.  Reading through the PDF document [apparently referred to as MS-WSSPROG],  if WebID is null, the action is for all sites (SPWebs) in the Site Collection.  Ok, in our scenario, that seems unnecessary since we are deleting a workflow association in a specific site (SPWeb).  If you run this procedure and specify the GUID of the SPWeb that has the list with the workflow association, does that make this work?   Well, there's a problem with trying to answer that question - you are modifying the Content Database directly via T-SQL, NOT through through the SharePoint API.  In other words, DON'T DO THIS.  If you execute any T-SQL commands other than select statements, Microsoft Support will no longer help you should you call them.  What? You say you want to do it anyway? It's a development database after all.  OK, then you MUST do a Full Backup of the Content Database before you run any T-SQL command that will edit anything.  Now that you have done a Full Backup, are you ready to play with the T-SQL command? NO. NO. NO.  You don't know you have a valid backup until you successfully restore from it.  So edit a list item somewhere, then restore the backup and make sure the site comes up with the list item in its previous state.

Now that you have backed up the content database, AND PROVEN VIA A RESTORE THAT THE BACKUP WORKS, you can now muck around with T-SQL in the Content Database because you can always get back to a supported state.  So let's try this: If we run AutoDropWorkflows @SiteId, @WebId instead of NULL, @ListId, NULL for the List Item ID, @Id for the workflow association, NULL for the AutoCleanupDate, 1 for ForceDelete , 200 for TopBeforeQuick, does it work better? No, the processor still goes nuts.  So let's take a look at the rest of what it does.

ListItemID - is being passed as NULL, so all list items are being worked with.  That seems ok, though what list items are these?  The items in the list with the workflow association. 

TemplateID is the workflow association. 

AutoCleanupDate is irrelevant since ForceDelete is 1 - again, per the MS-WSSPROG doc.

ForceDelete is 1, and when this is one, it deletes the first 200 workflows. The 200 is specified in the TopBeforeQuick parameter.

Note that the return value of this procedure is 1 only if it deleted all the workflows without hitting the TopBeforeQuick limit.  Only if the return value is 1 does the calling procedure, Drop All Workflows, actually do step 3 above (deleting the workflow association from the association table) However, since when I ran this, it never returned, I don't know if it will return  1 or 0.

Now thinking about my setup, I only have one or two workflow associations set up on this list, so if I am interpreting that right, since 2 < 200, I will never hit that limit.  Let's take a look into the body of Auto Drop Workflows to see why it would peg the CPU and apparently never return.

As I look at the body of Auto Drop Workflows, the fourth line after the AS statement defines a cursor.  Ah-ha, a looping construct!  A loop gone bad is consistent with the CPU going nuts. Cursors are what you use in T-SQL when you have to walk through a list item by item instead of being able to deal with a set of records as a whole.    What does the cursor do?  Well, for each workflow instance, it calls

proc_DropWorkflow @WorkflowInstanceId, @SiteId, @WebId, @ListId

Drop Workflow does the following:

  1. Looks up the TaskListID (among other things) in the first select statement
  2. Conditionally decrements an instance counter in the WorkflowAssociation table
  3. Conditionally calls proc_RemoveFailover
  4. Deletes from the Workflow table where the site and Workflow instance ID match.
  5. Calls proc_DropWorkflowTasks @WorkflowInstanceId, @SiteId, @WebID, @TaskListId
  6. Conditionally calls proc_DeleteContextCollectionEventReceivers and deletes from ScheduledWorkItems

Step 5 is the important one.  Looking at it, it declares another cursor based on data in the AllUserData table, and walks through that data calling proc_DropListRecord on each row.

Let's take a closer look at that.  Note that the documentation for this stored proc is in MS-WSSDLIM.pdf, not MS-WSSPROG.pdf.  You should download the whole package of PDFs and set up Desktop Search or MOSS Search to index them.  This way it is easy to find which document contains a given stored procedure.

EXEC proc_DropListRecord @SiteID, @WebID, @TaskListId, 0, @ItemId, 1, 0, NULL, NULL, 0, 0, NULL, 3

Look at the definition of this stored procedure.  This isn't  just a simple delete from a table either.  This stored proc has to take into account use of the recycle bin, the need to audit, and perhaps other list settings.  Also, there is no one to one correspondence between that Tasks list and a "Tasks" table in the database. 

More specifically, Drop List Record does the following:

  1. Looks up a bunch of values
  2. If it finds them, then it calls proc_VerifyUpdateConditions
  3. It calls proc_DeleteContextCollectionEventRecievers
  4. If Auditing is needed, it calls proc_AddAuditEntryFromSql
  5. it calls proc_DeleteUrlCore
  6. If it's a meeting list it ... well, never mind, it's not a meeting list
  7. It calls proc_UpdateDiskUsed

That's a lot of stuff to do for deleting a list item!  I suspect that some of those may call other stored procedures, but I have to end this somewhere, so I think this is deep enough for now.  We are now 4 layers deep, inside two cursors.  No wonder this takes a while and makes a SQL Server busy.

The net effect of the call we started with, Drop Workflow Association, is that it is trying to individually drop each task associated with each list item that used the workflow used by the workflow association in question.  When you have several thousand tasks, and you have to do stuff to each one of them, it's not quick to get rid of them.  From this I can conclude that the process to drop a workflow, wasn't really stuck spinning doing nothing.  It was working its way through a very long list, but in my two environments, doing it so slowly that it would have taken hours (at least in my Dev VM) to complete.  No wonder it looked like the SQL server was going nuts.

Overall, my resulting request to the SharePoint Product Team is that if there is more than some number of tasks in the task list, then deletion of a workflow association should be treated as a long running operation, at least from a UI perspective. It should present a UI similar to the long running operation screen you get when you click on the "hurricane button" to update a Business Data column in a list.  Unless of course they have a better way.

--Michael

A Sign the Farm is Quiescent

In Central Admin, in Operations, you have an option to quiesce the farm.  This stops all new connections, and is a good idea before, for example, restoring a content database.  it's the next to last operation under Global Configuration.

image

When the farm is quiescent, you will get certain errors, such as the following.

image

"The server is currently being taken down for maintenance" is InfoPath Forms Services' way of saying the farm is being quiesced, and we can't do any new connections - although the fact that the web pages leading up to this screen didn't object seems  a little inconsistent.

To fix this, go to the Quiesce Farm option in Central admin and click on Reset Farm.

image

 

--Michael

Search Keyword to search for just sites: ContentClass:STS_Web

I know I'm overdue to respond to some blog comments you all have been kind enough to leave, but in the mean time I want to share this item I came across.

I had a requirement (one of several) from a business user that had a site hierarchy with a site (Projects) that had many subsites (Project 1, Project 2, Project 3...) and wanted a searchable list of these sites.  In the process of figuring out the best way to meet this requirements, I discovered this:

You know how you can specify properties in a keyword search?  Property:PropertyValue.  Well, there is a property called ContentClass that you can use to specify sites only. The property value is STS_Web.

Without the ContentClass filter:

image

With it:

image

 

--

Symptoms of your DB Server going down

Tonight I was supposed to be involved in the deployment of a workflow Solution I had built, but when the deployment window opened, I went to my client's SharePoint intranet site and got three terrifying error messages:

http://server.dom.tld/ returned a 404.

http://server.dom.tld/managedPath/siteCollectionRootSite returned a 404.

http://server.dom.tld/managedPath/siteCollectionRootSite/default.aspx returned a "An unexpected error has occurred. "

I don't have access to Central Admin on this farm, but the Farm Admin who was going to deploy my Solution (WSP) quickly identified the problem - the database server that holds all SharePoint's databases was down.

In as much as it's not a problem with my WSP, I am relieved.  In as much as it keeps me from deploying my solution and getting to my weekend, I am frustrated.  Such is life.

 

-- Michael

When a Solution Won't Deploy

We loaded a WSP into the solution store with STSADM, and then used STSADM to deploy the solution and execute admin service jobs.  STSADM reported a success, but when we looked in Central Admin, we saw that the solution deployment was scheduled but never starting. When we canceled the deployment job, the status page for the solution package said that it had deployed to WFE (Web Front End) Server 2 (which hosts Central Admin too),  but not server 1 which is just a WFE for the user-facing web apps.

The net of it is that the OWS Timer on Server 1 was not running (even though it said it started, but when we right-clicked to restart, it gave us a start option instead of stop or restart). It was not running due to a bad config file that we had made for it because we needed to provide some application settings and some WCF settings for a workflow solution we built.  Both OWSTimer as well as the web.config for the user-facing MOSS site needed the config settings for our workflow because our workflow has a delay task and the workflow is owned by the w3wp process (assuming it was started via MOSS ) until it hits a delay task. Once a workflow hits a delay task, it resumes under the context of OWSTimer.

The error message that we get in the ULS log is:

The timer service could not initialize its configuration, please check the configuration database.  Will retry later.

It's categorized as Unexpected and is produced by OWSTIMER.EXE.

Removing the config file let the timer service start and then deployments started working again.  We just need to figure out what was wrong with our config file.

--Michael

Great Deployment and Planning Pocket Guide

I've been meaning to blog about this for several weeks now.  There's this great little pocket guide called the "SharePoint Deployment Guide and Checklists", and it was written by some people from Microsoft and some MVPs and the like, including Joel Oleson and  Rob Bogue, among others.

I highly recommend it.  If you have a Microsoft Account Executive or other Microsoft contact, you can ask them for one.  It's Microsoft part number 098-108909, and looks like this:

Image040

If that doesn't work, and you live and work in the US, an account executive at the consulting firm I work for probably can get you one.  Leave a comment in this post with your contact information if you want that.  In the event that demand for these gets overwhelming, I may have to discontinue this offer.  Since this blog is currently read by about 40 people (according to Feedburner), that shouldn't be a problem, but you never know.

Links to Information about Records Management with MOSS 2007

Here are some handy links to Records Management with Microsoft solutions, primarily MOSS 2007:

Here is a screenshot of the Record Center site when it has just been created.

1 - 10 Next