[Midas] Problem with Midas + Batchmake + Condor

Sorina Camarasu Pop sorina.pop at creatis.insa-lyon.fr
Mon Mar 3 12:44:59 EST 2014



Le 03/03/2014 18:06, Michael Grauer a écrit :
> Where did you see the message: ""DC_AUTHENTICATE: authentication of 
> <xxx.xxx.xxx.xxx:59888> did not result in a valid mapped user name, 
> which is required for this command (1112 QMGMT_WRITE_CMD), so aborting."

In the condor log : /home/condor/localcondor/log/SchedLog

> Was there any other output included there?

I copied parts of the log file in the attached file containing the 
output printed both when using batchmake and directly condor commands.

> Do you have a "condor" user on your VM?

Yes.

>  When you successfully run jobs by doing "condor_submit_dag" from the 
> command line as the apache user,

apache   22773 29731  0 18:18 ?        00:00:00 
condor_scheduniv_exec.26.0 -f -l . -Lockfile challenge.dagjob.lock 
-AutoRescue 1 -DoRescueFrom 0 -Dag challenge.dagjob -CsdVersion 
$CondorVersion: 7.9.1 Aug 24 2012 PRE-RELEASE-UWCS $ -Force -Dagman 
/bin/condor_dagman


> when you watch your job run with ps or top, which user runs the actual 
> execution process (whatever job batchmake will run for you) ?

When launching it with batchmake (through the web interface) I do not 
manage to to get the corresponding condor process... I only get a httpd 
process run by apache....

>
> Can you include your "challenge.bms" script in an email?

Of course, here it is attached.

>
> Can you show me the output of "ls" from a directory where the submit 
> failed and then again from one where the submit succeeded, at the end 
> of the job processing run?

Failed :
ls -la 52/
total 56
drwxrwxr-x  3 apache apache 4096  3 mars  18:25 .
drwxr-xr-x 35 apache apache 4096  3 mars  18:25 ..
-rw-r--r--  1 apache apache  140  3 mars  18:25 adminconfig.cfg
-rw-r--r--  1 apache apache  355  3 mars  18:25 challenge.0.dagjob
-rw-r--r--  1 apache apache  332  3 mars  18:25 challenge.1.dagjob
-rw-r--r--  1 apache apache  564  3 mars  18:25 challenge.2.dagjob
-rw-r--r--  1 apache apache  355  3 mars  18:25 challenge.3.dagjob
lrwxrwxrwx  1 apache apache   56  3 mars  18:25 challenge.bms -> 
/var/www/miccai4/modules/challenge/library/challenge.bms
-rw-r--r--  1 apache apache 1473  3 mars  18:25 challenge.config.bms
-rw-r--r--  1 apache apache 1593  3 mars  18:25 challenge.dagjob
-rw-r--r--  1 apache apache 1043  3 mars  18:25 challenge.dagjob.condor.sub
lrwxrwxrwx  1 apache apache   70  3 mars  18:25 
challenge_validator_app.bms -> 
/var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
drwxrwxr-x  4 apache apache 4096  3 mars  18:25 data
lrwxrwxrwx  1 apache apache   50  3 mars  18:25 PHP.bmm -> 
/var/www/miccai4/modules/challenge/library/PHP.bmm
-rw-r--r--  1 apache apache  138  3 mars  18:25 userconfig.cfg
lrwxrwxrwx  1 apache apache   67  3 mars  18:25 ValidateImageAveDist.bmm 
-> /var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm


OK (created by matchmake and relaunched by hand):
-bash-4.2$ ls -la 48
total 104
drwxrwxr-x  3 apache apache  4096  3 mars  18:18 .
drwxr-xr-x 35 apache apache  4096  3 mars  18:25 ..
-rw-r--r--  1 apache apache   140  3 mars  18:09 adminconfig.cfg
-rw-r--r--  1 apache apache     0  3 mars  18:13 bmGrid.0.error.txt
-rw-r--r--  1 apache apache  1968  3 mars  18:18 bmGrid.0.log.txt
-rw-r--r--  1 apache apache   148  3 mars  18:18 bmGrid.0.out.txt
-rw-r--r--  1 apache apache   355  3 mars  18:09 challenge.0.dagjob
-rw-r--r--  1 apache apache   332  3 mars  18:09 challenge.1.dagjob
-rw-r--r--  1 apache apache   564  3 mars  18:09 challenge.2.dagjob
-rw-r--r--  1 apache apache   355  3 mars  18:09 challenge.3.dagjob
lrwxrwxrwx  1 apache apache    56  3 mars  18:09 challenge.bms -> 
/var/www/miccai4/modules/challenge/library/challenge.bms
-rw-r--r--  1 apache apache  1473  3 mars  18:09 challenge.config.bms
-rw-r--r--  1 apache apache  1593  3 mars  18:09 challenge.dagjob
-rw-r--r--  1 apache apache  1042  3 mars  18:18 challenge.dagjob.condor.sub
-rw-r--r--  1 apache apache   610  3 mars  18:18 challenge.dagjob.dagman.log
-rw-r--r--  1 apache apache 16074  3 mars  18:18 challenge.dagjob.dagman.out
-rw-r--r--  1 apache apache   256  3 mars  18:18 challenge.dagjob.dot
-rw-r--r--  1 apache apache     0  3 mars  18:18 challenge.dagjob.lib.err
-rw-r--r--  1 apache apache    29  3 mars  18:18 challenge.dagjob.lib.out
-rw-r--r--  1 apache apache   970  3 mars  18:18 challenge.dagjob.nodes.log
-rw-r--r--  1 apache apache   243  3 mars  18:18 challenge.dagjob.rescue001
-rw-r--r--  1 apache apache   243  3 mars  18:13 
challenge.dagjob.rescue001.old
lrwxrwxrwx  1 apache apache    70  3 mars  18:09 
challenge_validator_app.bms -> 
/var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
drwxrwxr-x  4 apache apache  4096  3 mars  18:09 data
lrwxrwxrwx  1 apache apache    50  3 mars  18:09 PHP.bmm -> 
/var/www/miccai4/modules/challenge/library/PHP.bmm
-rw-r--r--  1 apache apache   138  3 mars  18:09 userconfig.cfg
lrwxrwxrwx  1 apache apache    67  3 mars  18:09 
ValidateImageAveDist.bmm -> 
/var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm

> I'm not sure what is going on, just trying to get more context...
>
> I recall I ran into a problem where one machine was the submitter, and 
> there was a midas user there, with uid 100, and a midas user on 
> another machine (the execution node) with a uid 200, and I got what 
> sounded like a similar message--I had to make sure their uids were the 
> same across machines to deal with permissions across an NFS mount on 
> both machines. This sounds nothing like your problem, but I wanted to 
> include it in case it gives you any ideas.

Thank you for the hint.
My problem seems to be similar, in the sense that it looks like a user 
problem. However, I do not manage to find the difference between the 2 
potential users : apache and who else ?...

I noticed in the condor log (the one attached) the following line :
03/03/14 18:39:34 ATTEMPT_ACCESS: Switching to user uid: 48 gid: 48.
uid 48 does corerspond to apache. What surprises me is that the log 
prints out "Switching to user uid: 48". That means that till that moment 
it is executed as some other user ?...

>
> Can you explain more about the library issues you ran into earlier 
> that prevented you from running jobs?

I don't remember exactly, but I spent quite some time on that one too.
In that case, jobs were submitted, but stayed idle : if I remember 
correctly, there was some library preventing one of the condor daemons 
from launching/executing correctly. I really don't think this could be 
connected...

Thank you,
Sorina

>
>
>
> Thanks,
> Mike
>
>
>
>
>
> On Mon, Mar 3, 2014 at 11:50 AM, Sorina Camarasu Pop 
> <sorina.pop at creatis.insa-lyon.fr 
> <mailto:sorina.pop at creatis.insa-lyon.fr>> wrote:
>
>     Hi Mike,
>
>     Thank you for your prompt reply.
>
>     Le 03/03/2014 17:27, Michael Grauer a écrit :
>>     Hi Sorina,
>>
>>     These are tough to track down.
>
>     I know, I've spent my afternoon on it...
>
>
>>     Can you tell me more about your environment?  Specifically, the 3
>>     machines (possibly all the same machine) that are your condor
>>     submit, condor manager, and condor execute nodes?
>
>     I use the same machine (virtual machine configured as a dual core)
>     for my condor submit, condor manager, and condor execute nodes.
>
>
>>     What operating system is your web server, and what version of
>>     Condor are you using?
>
>     Fedora 18.
>     For Condor, I had compiled the latest version available, but had
>     some library problems preventing me from launching any job. I
>     finally had it work with the version available for yum install :
>     condor_version
>     $CondorVersion: 7.9.1 Aug 24 2012 PRE-RELEASE-UWCS $
>     $CondorPlatform: X86_64-Fedora_18 $
>
>
>
>>      Is your condor submit node the same as your web server (most
>>     likely yes)?
>
>     yes.
>
>
>>     Are you running your web server as the apache user (most likely
>>     yes),
>
>     Yes, I even printed out "whoami" to check that it really runs as
>     apache.
>
>
>>     and is it your web server that is calling the php code that
>>     results in condor_dag_submit (most likely yes, again) ?
>
>     Yes.
>     I use the "standard" batchmake config, i.e. the condorSubmitDag
>     function from KWBatchmakeComponent.php
>
>
>>     Can you show the permissions and ownership of the temporary work
>>     directory where the condor_dag_submit command is executed?
>
>     ls -la
>     ...
>     drwxrwxr-x  3 apache apache 4096  3 mars  16:53 45
>     drwxrwxr-x  3 apache apache 4096  3 mars  17:41 46
>
>     -bash-4.2$ cd 46
>     -bash-4.2$ ls -la
>     total 92
>     drwxrwxr-x  3 apache apache 4096  3 mars  17:41 .
>     drwxr-xr-x 29 apache apache 4096  3 mars  17:40 ..
>     -rw-r--r--  1 apache apache  140  3 mars  17:40 adminconfig.cfg
>     -rw-r--r--  1 apache apache    0  3 mars  17:41 bmGrid.0.error.txt
>     lrwxrwxrwx  1 apache apache   56  3 mars  17:40 challenge.bms ->
>     /var/www/miccai4/modules/challenge/library/challenge.bms
>     ...
>
>
>
>>     When you tested as the apache user, did you do this test from the
>>     same temporary work directory that Midas/apache would have tried
>>     this from?
>
>     Yes, from folder /var/www/miccai4/tmp/misc/batchmake/tmp/SSP/7/46
>     (drwxrwxr-x , owned by apache)
>
>
>>     Is there any more information in the logs or error logs generated
>>     by Condor in the temp work directory that you could share?
>
>     tail -f challenge.dagjob.condor.sub
>     # Note: default on_exit_remove expression:
>     # ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0
>     && ExitCode <= 2))
>     # attempts to ensure that DAGMan is automatically
>     # requeued by the schedd if it exits abnormally or
>     # is killed (e.g., during a reboot).
>     on_exit_remove  = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED
>     && ExitCode >=0 && ExitCode <= 2))
>     copy_to_spool   = False
>     arguments       = "-f -l . -Lockfile challenge.dagjob.lock
>     -AutoRescue 1 -DoRescueFrom 0 -Dag challenge.dagjob -CsdVersion
>     $CondorVersion:' '7.9.1' 'Aug' '24' '2012' 'PRE-RELEASE-UWCS' '$
>     -Dagman /usr/bin/condor_dagman"
>     environment     =
>     _CONDOR_DAGMAN_LOG=challenge.dagjob.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
>     queue
>
>     tail -f challenge.0.dagjob
>     # More information at: http://www.batchmake.org
>     Universe       = vanilla
>     Output         = bmGrid.0.out.txt
>     Error          = bmGrid.0.error.txt
>     Log            = bmGrid.0.log.txt
>     Notification   = NEVER
>     Executable    = /usr/bin/php
>     Arguments     = "'--version'"
>     Queue 1
>
>     I hope this can help with debugging the problem...
>
>     Thank you,
>     Sorina
>
>
>>     Thanks,
>>     Mike
>>
>>
>>     On Mon, Mar 3, 2014 at 11:16 AM, Sorina Camarasu Pop
>>     <sorina.pop at creatis.insa-lyon.fr
>>     <mailto:sorina.pop at creatis.insa-lyon.fr>> wrote:
>>
>>         Dear Midas users and developers,
>>
>>         I am trying to configure Midas with the Challenge and
>>         BatchMake modules, but I encounter problems when executing
>>         the condor_submit_dag command.
>>
>>         The error printed by Condor when executing the
>>         condor_submit_dag command using the Batchmake module looks
>>         like this : "DC_AUTHENTICATE: authentication of
>>         <xxx.xxx.xxx.xxx:59888> did not result in a valid mapped user
>>         name, which is required for this command (1112
>>         QMGMT_WRITE_CMD), so aborting."
>>
>>         Nevertheless, if I execute exactly the same command line as
>>         apache in a console, everything works fine. My condor I do
>>         not understand where the difference comes from.
>>
>>         Do you know if there's any special configuration for Condor
>>         to work with the Batchmake module ?
>>
>>         Thank you for your help,
>>         Sorina
>>
>>         -- 
>>         Sorina Pop, PhD
>>         CNRS Research Engineer
>>         CREATIS
>>         Tel : +33 (0)4 72 43 72 99
>>         <tel:%2B33%20%280%294%2072%2043%2072%2099>
>>
>>         _______________________________________________
>>         Midas mailing list
>>         Midas at public.kitware.com <mailto:Midas at public.kitware.com>
>>         http://public.kitware.com/cgi-bin/mailman/listinfo/midas
>>
>>
>>
>>
>>
>>
>
>
>     -- 
>     Sorina Pop, PhD
>     CNRS Research Engineer
>     CREATIS
>     Tel :+33 (0)4 72 43 72 99  <tel:%2B33%20%280%294%2072%2043%2072%2099>
>
>
>
>
> -- 
> Thanks,
> Michael Grauer
> R & D Engineer
> Kitware, Inc.
> 919 969 6990 x322
>
>


-- 
Sorina Pop, PhD
CNRS Research Engineer
CREATIS
Tel : +33 (0)4 72 43 72 99

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/midas/attachments/20140303/0d6290a1/attachment-0002.html>
-------------- next part --------------
Include(challenge.config.bms)

SetApp(php @PHP)
SetAppOption(php.script --version)
CondorPostScript(php ${cfg_exe} ${cfg_condordagpostscript} START ${cfg_resultsrunID} ${cfg_output_directory} ${cfg_taskID} challenge)
Run(output ${php})


# Loop through the sequence of job indices
ForEach(jobInd ${cfg_jobInds})

  # get each of the param values for this particular job
  GetParam(truthImage ${cfg_truthItems} ${jobInd})
  GetParam(resultImage ${cfg_resultItems} ${jobInd})

  # MiccaiErrors
  GetParam(resultRunItemIds ${ValidateImageAveDist_resultRunItemIds} ${jobInd})
  Set(outputFile aveDist.${jobInd}.txt)
  SetApp(ValidatorApp @ValidateImageAveDist)
  Include(challenge_validator_app.bms)

EndForEach()


SetApp(php @PHP)
SetAppOption(php.script --version)
CondorPostScript(php ${cfg_exe} ${cfg_condordagpostscript} END ${cfg_resultsrunID})
Run(output ${php})


Exit()
-------------- next part --------------

With batchmake

3/03/14 18:09:56 authenticate_self_gss: acquiring self credentials failed. Please check your Condor configuration file if this is a server process. Or the user environment variable if this is a user process.

GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with GSI credential
globus_gsi_gssapi: Error with gss credential handle
globus_credential: Valid credentials could not be found in any of the possible locations specified by the credential search order.
Valid credentials could not be found in any of the possible locations specified by the credential search order.
Attempt 1
globus_credential: Error reading host credential
globus_sysconfig: Could not find a valid certificate file: The host cert could not be found in:
1) env. var. X509_USER_CERT
2) /etc/grid-security/hostcert.pem
3) $GLOBUS_LOCATION/etc/hostcert.pem
4) $HOME/.globus/hostcert.pem

The host key could not be found in:
1) env. var. X509_USER_KEY
2) /etc/grid-security/hostkey.pem
3) $GLOBUS_LOCATION/etc/hostkey.pem
4) $HOME/.globus/hostkey.pem


Attempt 2
globus_credential: Error reading proxy credential
globus_sysconfig: Could not find a valid proxy certificate file location
globus_sysconfig: Error with key filename
globus_sysconfig: File does not exist: /tmp/x509up_u0 is not a valid file
Attempt 3
globus_credential: Error reading user credential
globus_sysconfig: Error with certificate filename: The user cert could not be found in:
1) env. var. X509_USER_CERT
2) $HOME/.globus/usercert.pem
3) $HOME/.globus/usercred.p12



03/03/14 18:09:56 DC_AUTHENTICATE: authentication of <192.168.122.187:57170> did not result in a valid mapped user name, which is required for this command (1112 QMGMT_WRITE_CMD), so aborting.
03/03/14 18:09:56 DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:ë:Failed to authenticate with any method|AUTHENTICATE:ì:Failed to authenticate using GSI|GSI::Failed to authenticate.  Globus is reporting error (851968:550).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)|AUTHENTICATE:ì:Failed to authenticate using KERBEROS|AUTHENTICATE:ì:Failed to authenticate using FS|FS:ì:Unable to lstat(/tmp/FS_XXXVpwC9j)



Console command line :
03/03/14 18:39:33 OwnerCheck retval 1 (success),no ad
03/03/14 18:39:33 schedd: NewCluster rval 30 errno 0
03/03/14 18:39:33 OwnerCheck retval 1 (success),no ad
03/03/14 18:39:33 schedd: NewProc rval 0 errno 0
03/03/14 18:39:33 Prioritized runnable job list will be rebuilt, because ClassAd attribute JobPrio=0 changed
03/03/14 18:39:34 New job: 30.0
03/03/14 18:39:34 Writing record to user logfile=/var/www/miccai4/tmp/misc/batchmake/tmp/SSP/7/47/challenge.dagjob.dagman.log owner=apache
03/03/14 18:39:34 directory_util::rec_touch_file: Creating directory /tmp
03/03/14 18:39:34 directory_util::rec_touch_file: Creating directory /tmp/condorLocks
03/03/14 18:39:34 directory_util::rec_touch_file: Creating directory /tmp/condorLocks/93
03/03/14 18:39:34 directory_util::rec_touch_file: Creating directory /tmp/condorLocks/93/02
03/03/14 18:39:34 FileLock object is updating timestamp on: /tmp/condorLocks/93/02/537977980945501.lockc
03/03/14 18:39:34 FileLock::obtain(1) - @1393868374.012168 lock on /tmp/condorLocks/93/02/537977980945501.lockc now WRITE
03/03/14 18:39:34 FileLock::obtain(2) - @1393868374.036904 lock on /tmp/condorLocks/93/02/537977980945501.lockc now UNLOCKED
03/03/14 18:39:34 FileLock::obtain(1) - @1393868374.037020 lock on /tmp/condorLocks/93/02/537977980945501.lockc now WRITE
03/03/14 18:39:34 directory_util::rec_clean_up: file /tmp/condorLocks/93/02/537977980945501.lockc has been deleted.
03/03/14 18:39:34 Lock file /tmp/condorLocks/93/02/537977980945501.lockc has been deleted.
03/03/14 18:39:34 FileLock::obtain(2) - @1393868374.037118 lock on /tmp/condorLocks/93/02/537977980945501.lockc now UNLOCKED
03/03/14 18:39:34 New job: 30.0, Duplicate Keys: 3, Total Keys: 5
03/03/14 18:39:34 QMGR Connection closed
03/03/14 18:39:34 ATTEMPT_ACCESS: Switching to user uid: 48 gid: 48.
03/03/14 18:39:34 Checking file /var/www/miccai4/tmp/misc/batchmake/tmp/SSP/7/47/challenge.dagjob.lib.err for write permission.
03/03/14 18:39:34 Switching back to old priv state.
03/03/14 18:39:34 ATTEMPT_ACCESS: Switching to user uid: 48 gid: 48.
03/03/14 18:39:34 Checking file /var/www/miccai4/tmp/misc/batchmake/tmp/SSP/7/47/challenge.dagjob.lib.out for write permission.
03/03/14 18:39:34 Switching back to old priv state.
03/03/14 18:39:34 -------- Begin starting jobs --------
03/03/14 18:39:34 -------- Done starting jobs --------
03/03/14 18:39:34 JobsRunning = 0
03/03/14 18:39:34 JobsIdle = 0
03/03/14 18:39:34 JobsHeld = 0
03/03/14 18:39:34 JobsRemoved = 0
03/03/14 18:39:34 LocalUniverseJobsRunning = 0
03/03/14 18:39:34 LocalUniverseJobsIdle = 0
03/03/14 18:39:34 SchedUniverseJobsRunning = 0
03/03/14 18:39:34 SchedUniverseJobsIdle = 1
03/03/14 18:39:34 N_Owners = 1
03/03/14 18:39:34 MaxJobsRunning = 10000
03/03/14 18:39:34 TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
03/03/14 18:39:34 Trying to update collector <127.0.0.1:9618>
03/03/14 18:39:34 Attempting to send update via UDP to collector localhost <127.0.0.1:9618>
03/03/14 18:39:34 Sent HEART BEAT ad to 1 collectors. Number of submittors=1
03/03/14 18:39:34 Changed attribute: RunningJobs = 0
03/03/14 18:39:34 Changed attribute: IdleJobs = 0
03/03/14 18:39:34 Changed attribute: HeldJobs = 0
03/03/14 18:39:34 Changed attribute: FlockedJobs = 0
03/03/14 18:39:34 Changed attribute: Name = apache@*.creatis.insa-lyon.fr
03/03/14 18:39:34 Sent ad to central manager for apache@*.creatis.insa-lyon.fr
03/03/14 18:39:34 Trying to update collector <127.0.0.1:9618>
03/03/14 18:39:34 Attempting to send update via UDP to collector localhost <127.0.0.1:9618>
03/03/14 18:39:34 Sent ad to 1 collectors for apache@*.creatis.insa-lyon.fr
03/03/14 18:39:34 ============ Begin clean_shadow_recs =============
03/03/14 18:39:34 ============ End clean_shadow_recs =============
03/03/14 18:39:34 Found idle scheduler universe job 30.0
03/03/14 18:39:34 Job prep for 30.0 will not block, calling aboutToSpawnJobHandler() directly
03/03/14 18:39:34 aboutToSpawnJobHandler() completed for job 30.0
03/03/14 18:39:34 Starting sched universe job 30.0
03/03/14 18:39:34 Create_Process: using fast clone() to create child process.
03/03/14 18:39:34 Successfully created sched universe process
03/03/14 18:39:34 Cleared dirty attributes for job 30.0
03/03/14 18:39:34 Writing record to user logfile=/var/www/miccai4/tmp/misc/batchmake/tmp/SSP/7/47/challenge.dagjob.dagman.log owner=apache


More information about the Midas mailing list