[Midas] Problem with Midas + Batchmake + Condor

Sorina Camarasu Pop sorina.pop at creatis.insa-lyon.fr
Tue Mar 4 09:46:28 EST 2014


Hello again,

I've discovered an interesting config option within condor: the 
SHADOW_ALLOW_UNSAFE_REMOTE_EXEC seems to allow shell calls via the libc 
'system()' function. Is it something any of you have already used in 
order to allow calls with the Midas executor->exec ?

I tried to put it on and use the Condor shadow daemon, but I get an 
error saying "Assertion ERROR on (job_ad_file)" at line 166 in file 
shadow_v61_main.cpp" ...

So before going into the trouble of trying to solve this error, I was 
wondering if you know about this "shadow" config option and if you can 
confirm it is necessary.

Best regards,
Sorina

Le 03/03/2014 18:44, Sorina Camarasu Pop a écrit :
>
>
> Le 03/03/2014 18:06, Michael Grauer a écrit :
>> Where did you see the message: ""DC_AUTHENTICATE: authentication of 
>> <xxx.xxx.xxx.xxx:59888> did not result in a valid mapped user name, 
>> which is required for this command (1112 QMGMT_WRITE_CMD), so aborting."
>
> In the condor log : /home/condor/localcondor/log/SchedLog
>
>> Was there any other output included there?
>
> I copied parts of the log file in the attached file containing the 
> output printed both when using batchmake and directly condor commands.
>
>> Do you have a "condor" user on your VM?
>
> Yes.
>
>>  When you successfully run jobs by doing "condor_submit_dag" from the 
>> command line as the apache user,
>
> apache   22773 29731  0 18:18 ?        00:00:00 
> condor_scheduniv_exec.26.0 -f -l . -Lockfile challenge.dagjob.lock 
> -AutoRescue 1 -DoRescueFrom 0 -Dag challenge.dagjob -CsdVersion 
> $CondorVersion: 7.9.1 Aug 24 2012 PRE-RELEASE-UWCS $ -Force -Dagman 
> /bin/condor_dagman
>
>
>> when you watch your job run with ps or top, which user runs the 
>> actual execution process (whatever job batchmake will run for you) ?
>
> When launching it with batchmake (through the web interface) I do not 
> manage to to get the corresponding condor process... I only get a 
> httpd process run by apache....
>
>>
>> Can you include your "challenge.bms" script in an email?
>
> Of course, here it is attached.
>
>>
>> Can you show me the output of "ls" from a directory where the submit 
>> failed and then again from one where the submit succeeded, at the end 
>> of the job processing run?
>
> Failed :
> ls -la 52/
> total 56
> drwxrwxr-x  3 apache apache 4096  3 mars  18:25 .
> drwxr-xr-x 35 apache apache 4096  3 mars  18:25 ..
> -rw-r--r--  1 apache apache  140  3 mars  18:25 adminconfig.cfg
> -rw-r--r--  1 apache apache  355  3 mars  18:25 challenge.0.dagjob
> -rw-r--r--  1 apache apache  332  3 mars  18:25 challenge.1.dagjob
> -rw-r--r--  1 apache apache  564  3 mars  18:25 challenge.2.dagjob
> -rw-r--r--  1 apache apache  355  3 mars  18:25 challenge.3.dagjob
> lrwxrwxrwx  1 apache apache   56  3 mars  18:25 challenge.bms -> 
> /var/www/miccai4/modules/challenge/library/challenge.bms
> -rw-r--r--  1 apache apache 1473  3 mars  18:25 challenge.config.bms
> -rw-r--r--  1 apache apache 1593  3 mars  18:25 challenge.dagjob
> -rw-r--r--  1 apache apache 1043  3 mars  18:25 
> challenge.dagjob.condor.sub
> lrwxrwxrwx  1 apache apache   70  3 mars  18:25 
> challenge_validator_app.bms -> 
> /var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
> drwxrwxr-x  4 apache apache 4096  3 mars  18:25 data
> lrwxrwxrwx  1 apache apache   50  3 mars  18:25 PHP.bmm -> 
> /var/www/miccai4/modules/challenge/library/PHP.bmm
> -rw-r--r--  1 apache apache  138  3 mars  18:25 userconfig.cfg
> lrwxrwxrwx  1 apache apache   67  3 mars  18:25 
> ValidateImageAveDist.bmm -> 
> /var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm
>
>
> OK (created by matchmake and relaunched by hand):
> -bash-4.2$ ls -la 48
> total 104
> drwxrwxr-x  3 apache apache  4096  3 mars  18:18 .
> drwxr-xr-x 35 apache apache  4096  3 mars  18:25 ..
> -rw-r--r--  1 apache apache   140  3 mars  18:09 adminconfig.cfg
> -rw-r--r--  1 apache apache     0  3 mars  18:13 bmGrid.0.error.txt
> -rw-r--r--  1 apache apache  1968  3 mars  18:18 bmGrid.0.log.txt
> -rw-r--r--  1 apache apache   148  3 mars  18:18 bmGrid.0.out.txt
> -rw-r--r--  1 apache apache   355  3 mars  18:09 challenge.0.dagjob
> -rw-r--r--  1 apache apache   332  3 mars  18:09 challenge.1.dagjob
> -rw-r--r--  1 apache apache   564  3 mars  18:09 challenge.2.dagjob
> -rw-r--r--  1 apache apache   355  3 mars  18:09 challenge.3.dagjob
> lrwxrwxrwx  1 apache apache    56  3 mars  18:09 challenge.bms -> 
> /var/www/miccai4/modules/challenge/library/challenge.bms
> -rw-r--r--  1 apache apache  1473  3 mars  18:09 challenge.config.bms
> -rw-r--r--  1 apache apache  1593  3 mars  18:09 challenge.dagjob
> -rw-r--r--  1 apache apache  1042  3 mars  18:18 
> challenge.dagjob.condor.sub
> -rw-r--r--  1 apache apache   610  3 mars  18:18 
> challenge.dagjob.dagman.log
> -rw-r--r--  1 apache apache 16074  3 mars  18:18 
> challenge.dagjob.dagman.out
> -rw-r--r--  1 apache apache   256  3 mars  18:18 challenge.dagjob.dot
> -rw-r--r--  1 apache apache     0  3 mars  18:18 challenge.dagjob.lib.err
> -rw-r--r--  1 apache apache    29  3 mars  18:18 challenge.dagjob.lib.out
> -rw-r--r--  1 apache apache   970  3 mars  18:18 
> challenge.dagjob.nodes.log
> -rw-r--r--  1 apache apache   243  3 mars  18:18 
> challenge.dagjob.rescue001
> -rw-r--r--  1 apache apache   243  3 mars  18:13 
> challenge.dagjob.rescue001.old
> lrwxrwxrwx  1 apache apache    70  3 mars  18:09 
> challenge_validator_app.bms -> 
> /var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
> drwxrwxr-x  4 apache apache  4096  3 mars  18:09 data
> lrwxrwxrwx  1 apache apache    50  3 mars  18:09 PHP.bmm -> 
> /var/www/miccai4/modules/challenge/library/PHP.bmm
> -rw-r--r--  1 apache apache   138  3 mars  18:09 userconfig.cfg
> lrwxrwxrwx  1 apache apache    67  3 mars  18:09 
> ValidateImageAveDist.bmm -> 
> /var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm
>
>> I'm not sure what is going on, just trying to get more context...
>>
>> I recall I ran into a problem where one machine was the submitter, 
>> and there was a midas user there, with uid 100, and a midas user on 
>> another machine (the execution node) with a uid 200, and I got what 
>> sounded like a similar message--I had to make sure their uids were 
>> the same across machines to deal with permissions across an NFS mount 
>> on both machines. This sounds nothing like your problem, but I wanted 
>> to include it in case it gives you any ideas.
>
> Thank you for the hint.
> My problem seems to be similar, in the sense that it looks like a user 
> problem. However, I do not manage to find the difference between the 2 
> potential users : apache and who else ?...
>
> I noticed in the condor log (the one attached) the following line :
> 03/03/14 18:39:34 ATTEMPT_ACCESS: Switching to user uid: 48 gid: 48.
> uid 48 does corerspond to apache. What surprises me is that the log 
> prints out "Switching to user uid: 48". That means that till that 
> moment it is executed as some other user ?...
>
>>
>> Can you explain more about the library issues you ran into earlier 
>> that prevented you from running jobs?
>
> I don't remember exactly, but I spent quite some time on that one too.
> In that case, jobs were submitted, but stayed idle : if I remember 
> correctly, there was some library preventing one of the condor daemons 
> from launching/executing correctly. I really don't think this could be 
> connected...
>
> Thank you,
> Sorina
>
>>
>>
>>
>> Thanks,
>> Mike
>>
>>
>>
>>
>>
>> On Mon, Mar 3, 2014 at 11:50 AM, Sorina Camarasu Pop 
>> <sorina.pop at creatis.insa-lyon.fr 
>> <mailto:sorina.pop at creatis.insa-lyon.fr>> wrote:
>>
>>     Hi Mike,
>>
>>     Thank you for your prompt reply.
>>
>>     Le 03/03/2014 17:27, Michael Grauer a écrit :
>>>     Hi Sorina,
>>>
>>>     These are tough to track down.
>>
>>     I know, I've spent my afternoon on it...
>>
>>
>>>     Can you tell me more about your environment?  Specifically, the
>>>     3 machines (possibly all the same machine) that are your condor
>>>     submit, condor manager, and condor execute nodes?
>>
>>     I use the same machine (virtual machine configured as a dual
>>     core) for my condor submit, condor manager, and condor execute
>>     nodes.
>>
>>
>>>     What operating system is your web server, and what version of
>>>     Condor are you using?
>>
>>     Fedora 18.
>>     For Condor, I had compiled the latest version available, but had
>>     some library problems preventing me from launching any job. I
>>     finally had it work with the version available for yum install :
>>     condor_version
>>     $CondorVersion: 7.9.1 Aug 24 2012 PRE-RELEASE-UWCS $
>>     $CondorPlatform: X86_64-Fedora_18 $
>>
>>
>>
>>>      Is your condor submit node the same as your web server (most
>>>     likely yes)?
>>
>>     yes.
>>
>>
>>>     Are you running your web server as the apache user (most likely
>>>     yes),
>>
>>     Yes, I even printed out "whoami" to check that it really runs as
>>     apache.
>>
>>
>>>     and is it your web server that is calling the php code that
>>>     results in condor_dag_submit (most likely yes, again) ?
>>
>>     Yes.
>>     I use the "standard" batchmake config, i.e. the condorSubmitDag
>>     function from KWBatchmakeComponent.php
>>
>>
>>>     Can you show the permissions and ownership of the temporary work
>>>     directory where the condor_dag_submit command is executed?
>>
>>     ls -la
>>     ...
>>     drwxrwxr-x  3 apache apache 4096  3 mars  16:53 45
>>     drwxrwxr-x  3 apache apache 4096  3 mars  17:41 46
>>
>>     -bash-4.2$ cd 46
>>     -bash-4.2$ ls -la
>>     total 92
>>     drwxrwxr-x  3 apache apache 4096  3 mars  17:41 .
>>     drwxr-xr-x 29 apache apache 4096  3 mars  17:40 ..
>>     -rw-r--r--  1 apache apache  140  3 mars  17:40 adminconfig.cfg
>>     -rw-r--r--  1 apache apache    0  3 mars  17:41 bmGrid.0.error.txt
>>     lrwxrwxrwx  1 apache apache   56  3 mars  17:40 challenge.bms ->
>>     /var/www/miccai4/modules/challenge/library/challenge.bms
>>     ...
>>
>>
>>
>>>     When you tested as the apache user, did you do this test from
>>>     the same temporary work directory that Midas/apache would have
>>>     tried this from?
>>
>>     Yes, from folder /var/www/miccai4/tmp/misc/batchmake/tmp/SSP/7/46
>>     (drwxrwxr-x , owned by apache)
>>
>>
>>>     Is there any more information in the logs or error logs
>>>     generated by Condor in the temp work directory that you could share?
>>
>>     tail -f challenge.dagjob.condor.sub
>>     # Note: default on_exit_remove expression:
>>     # ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0
>>     && ExitCode <= 2))
>>     # attempts to ensure that DAGMan is automatically
>>     # requeued by the schedd if it exits abnormally or
>>     # is killed (e.g., during a reboot).
>>     on_exit_remove  = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED
>>     && ExitCode >=0 && ExitCode <= 2))
>>     copy_to_spool   = False
>>     arguments       = "-f -l . -Lockfile challenge.dagjob.lock
>>     -AutoRescue 1 -DoRescueFrom 0 -Dag challenge.dagjob -CsdVersion
>>     $CondorVersion:' '7.9.1' 'Aug' '24' '2012' 'PRE-RELEASE-UWCS' '$
>>     -Dagman /usr/bin/condor_dagman"
>>     environment     =
>>     _CONDOR_DAGMAN_LOG=challenge.dagjob.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
>>     queue
>>
>>     tail -f challenge.0.dagjob
>>     # More information at: http://www.batchmake.org
>>     Universe       = vanilla
>>     Output         = bmGrid.0.out.txt
>>     Error          = bmGrid.0.error.txt
>>     Log            = bmGrid.0.log.txt
>>     Notification   = NEVER
>>     Executable    = /usr/bin/php
>>     Arguments     = "'--version'"
>>     Queue 1
>>
>>     I hope this can help with debugging the problem...
>>
>>     Thank you,
>>     Sorina
>>
>>
>>>     Thanks,
>>>     Mike
>>>
>>>
>>>     On Mon, Mar 3, 2014 at 11:16 AM, Sorina Camarasu Pop
>>>     <sorina.pop at creatis.insa-lyon.fr
>>>     <mailto:sorina.pop at creatis.insa-lyon.fr>> wrote:
>>>
>>>         Dear Midas users and developers,
>>>
>>>         I am trying to configure Midas with the Challenge and
>>>         BatchMake modules, but I encounter problems when executing
>>>         the condor_submit_dag command.
>>>
>>>         The error printed by Condor when executing the
>>>         condor_submit_dag command using the Batchmake module looks
>>>         like this : "DC_AUTHENTICATE: authentication of
>>>         <xxx.xxx.xxx.xxx:59888> did not result in a valid mapped
>>>         user name, which is required for this command (1112
>>>         QMGMT_WRITE_CMD), so aborting."
>>>
>>>         Nevertheless, if I execute exactly the same command line as
>>>         apache in a console, everything works fine. My condor I do
>>>         not understand where the difference comes from.
>>>
>>>         Do you know if there's any special configuration for Condor
>>>         to work with the Batchmake module ?
>>>
>>>         Thank you for your help,
>>>         Sorina
>>>
>>>         -- 
>>>         Sorina Pop, PhD
>>>         CNRS Research Engineer
>>>         CREATIS
>>>         Tel : +33 (0)4 72 43 72 99
>>>         <tel:%2B33%20%280%294%2072%2043%2072%2099>
>>>
>>>         _______________________________________________
>>>         Midas mailing list
>>>         Midas at public.kitware.com <mailto:Midas at public.kitware.com>
>>>         http://public.kitware.com/cgi-bin/mailman/listinfo/midas
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>     -- 
>>     Sorina Pop, PhD
>>     CNRS Research Engineer
>>     CREATIS
>>     Tel :+33 (0)4 72 43 72 99  <tel:%2B33%20%280%294%2072%2043%2072%2099>
>>
>>
>>
>>
>> -- 
>> Thanks,
>> Michael Grauer
>> R & D Engineer
>> Kitware, Inc.
>> 919 969 6990 x322
>>
>>
>
>
> -- 
> Sorina Pop, PhD
> CNRS Research Engineer
> CREATIS
> Tel : +33 (0)4 72 43 72 99
>
>
> _______________________________________________
> Midas mailing list
> Midas at public.kitware.com
> http://public.kitware.com/cgi-bin/mailman/listinfo/midas


-- 
Sorina Pop, PhD
CNRS Research Engineer
CREATIS
Tel : +33 (0)4 72 43 72 99

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/midas/attachments/20140304/c9dd9bc3/attachment-0002.html>


More information about the Midas mailing list