The Problem
I was recently at a client site looking at sorting out a perforce repository which had some database errors (due to disk problems on the NAS server). A quick look at the server log showed entries like:
Perforce server error: Date 2005/12/01 10:03:12: Operation: user-fstat Operation 'dbscan' failed. Database scan error on db.have! dbscan: db.have: Cannot create a file when that file already exists. Corrupt tree
The Solution
The easiest solution initially seemed to be to restore from a previous checkpoint and the current journal. (I started by copying all the db.* and journal files before starting to make any changes.)
Then things started becoming more complicated. The journal file turned out to be 9Gb in size (and yes that meant they hadn’t done a checkpoint for a looong time!).
I tried the journal recovery with high hopes (after removing db.* files in that directory), but…
E:\Perforce>p4d -r . -jr g:\backup.ckp.51 journal
Recovering from g:\backup.ckp.51...
Recovering from journal...
Perforce server error:
Journal file 'journal' replay failed at line 42327877!
Bad opcode '' journal record!
Note the size of the journal file (43M lines!):
E:\Perforce>time /t && wc journal && time /t 11:58 43,611,160 398236488 9536756331 journal 12:01
Fortunately I had installed some Unix utilities for Windows (from unxutils.sf.net and yes that is unx not unix) including wc and as it later turned out, sed.
Trying to edit a 9Gb with Notepad or Write (all that were available on a Windows 2003 Server) was not possible. I couldn’t find any other easily downloadable editor capable of such feats and I was wondering how I could do anything sensible with this journal file. But then I realised I had sed available to me, and it was then fairly easy to start identifying the problem and set about resolving it.
Identifying and Fixing The Corrupted Lines
To print out the block of lines around the error:
E:\perforce>sed -n -e "42327850,42327900{p;}" journal > extract.txt
This showed the some spaces or other strange characters on the start of line 42,327,877 (lines chopped for brevity):
@pv@ 1 @db.have@ @//FSmith/x-platform/packages/doc/html-tool/spur.gif@ ...
@rv@ 3 @db.user@ @bjones@ @bjones@@somecompany.com@ @@ 1121675402 ...
@rv@ 3 @db.domain@ @UK-A7993@ 99 @UK-A7993@ @d:\dev@ @@ @@ @fredb@ 1125593143 ...
(the problem line is the second with @rv@). I was able to �fix� this line with the following (all on one line):
time /t &&
sed -n -e "1,42327876{p;};42327877,42327877{s/^[^@]*//;p;};42327878,${p;}" journal >
new.journal && time /t
The sed command just prints all lines except for the offending line on which it runs a regular expression removing the unwanted chars at the start of the line (it turns out they weren’t just spaces either). This created new.journal with the offending line fixed (�time/t� just shows current time � took 5-6 minutes to process 9Gb file).
As it later turned out, this new journal file still had some problems since it appeared to be out of order with respect to the latest checkpoint file (shown by any record for the db.counter value for journal not being correct). As a result, I started to lose confidence in the reliability of the journal file at all.
So in the end I took a different tack using the “undocumented” (see “p4 help undoc”) commands p4d -xv/-xr to both validate the various database tables (db.*) and then to recover them. There only appeared to be an error in db.have table which is not that worrying (it is a list of all files synced to client workspaces and thus can be reset by the users in the last resort).
Validating db.review Validating db.have Problems Summary: pages which are not connect to tree or freelist Validating db.label Validating db.integ
(The -xr option just fixed things).
And The Moral of the Story is…
Well there are potentially lots of morals here, but a selection is:
- Keep your journal file on a different disk (volume) to your database (db.*) files if you can to avoid a single disk problem corrupting both.
- Do regular checkpoints! (Once a week probably bare minimum, though usually once every 24 hours is ideal). There are various mechanisms for dealing with large databases and if checkpoint times become a burden (e.g. many tens of minutes).
- When dealing with large files, don’t forget those unix tools such as sed which are always there and very powerful – also easy to install on Windows.
- Remember to talk to support since they will know about relevant undocumented or otherwise commands (in some circumstances they have “fixed” a checkpoint or journal file using internal tools and resent it to the client). At the very least they will act as a sounding board and confirm that what you are planning to do makes sense – always worth doing given that you are often dealing with the “crown jewels” of a company’s intellectual property and also that commands often take a reasonable amount of time to run – hours can flit by unnoticed (well unless you are holding up a project team when every minute is begrudged).
- Consider disaster recovery up front (all part of business continuity – look at ITIL/BS15000 for some ideas on this). Spend an appropriate amount of money on your server and disks (RAID etc) to try and avoid these errors in the first place. However, Murphy’s law is always lurking and it is often the little things that catch you out (e.g. air conditioning dies and server then dies). Thus you need the backup strategies (checkpointing etc) in place as appropriate.