[Dart] Re: Shared source builds

Thu, 31 Jan 2002 11:21:19 -0500

I think the key issue to be solved is syncronization. With shared
source builds, it is important that the actual state of the CVS
repository matches that of the Update.xml, and is crucial for
continuous builds with automatic emailing.

This is a long email. The summary: use lock files everywhere to aid
syncronization. Having to go delete a lock file from time to time is
okay, because lock files lying around imply some error, and errors
should be investigated, not swept under the rug.

I think some kind of "in progress" lock is essential, even when there
are no shared builds. Suppose I launch the continuous build every 15
minutes, and for some reason, it takes longer than 15 minutes to
update. Then the next invocation could interfere with the current
invocation. The easiest way is to create a lockfile with status
information, and for the process to abort *with a warning* when it
unexpectedly encounters a lockfile. If all the progress information is
put to stdout, and only critical errors are put to stderr, then I can
invoke my continuous build as  gmake Continuous > logfile. In the
normal case, the process does not produce output, and crond does
nothing. In the case of an error, crond will mail me the error
message, and I can go find out what went wrong.

One way to make sure that lock files are not left lying around is to
have a catch-all exception handler the main routine that will delete
the lock files. I think this is a bad idea, though, because a
lock file lying around implies something went wrong, and what you
should have a look and make sure it was a transient error.

I think each stage should create a lock file in the Temporary
directory. Something like:
  1. Process starts: create lock file "running"
  2.   Update: create lock "update"
               delete lock "update"
  3.   Build:  create lock "build"
               delete lock "build"
      ...
  *. Process ends: delete lock file "running"

Each process and stage would check if the appropriate lock file exists
before launching. The way to handle shared source would be that a
secondary build would check for the "update" lock in the *primary*
build's directory. (I think the process should then sleep and check
again instead of aborting with an error, with some reasonable timeout
to detect failures.)

I think having a LastUpdate.xml and the secondary process searching
for the last update are identical, because (as it stands) the primary
process will delete the LastUpdate.xml if no updates have
occurred. With shared source, you can run into syncronization
issues. For example, suppose the primary build updates twice before
the secondary build runs. Then, the secondary build will only get
information about the last update, instead of from both updates.

If you wanted to be really clever, each secondary build could keep
track of the buildstamp from which it last obtained an update, and
then merge the information in all subsequent updates in the primary
builds. I think this creates too much work for the moment. Rather put
the onus on the maintainer of the shared builds to make sure the
continuous processes are launched at appropriate times to minimise the
likelyhood of this happening. If the secondary builds do a wait loop
for the updates, as I suggested above, then you could launch all the
secondary builds a few seconds (30 or so) after the primary build, and
the processes will wait for the primary to end and immediately grab
the update file, and not have to wait for the next cron-issue. To
further reduce the likelyhood of race conditions, the dart process
could sleep a minute or so after it finishes everything before
removing the "running" lock. This will reduce the probability
another primary from running before the secondaries get a chance to
copy the Update.xml.

[I just thought of the next idea, and I'm too lazy to go back and
incorporate it into what I've written above...]

Actually, I think each secondary should create a "running.[secondary
buildname]" lockfile in the *primary's* build directory, and the
primary will not launch an update while one of these exists. This will
prevent the primary from running an update while a secondary is still
building. If the primary finds such a file, it should probably abort
with an error. This will serve as a warning to the maintainer than the
time between continuous builds may need to be increased.