[Midas] Problem with Midas + Batchmake + Condor
Sorina Camarasu Pop
sorina.pop at creatis.insa-lyon.fr
Tue Mar 4 09:46:28 EST 2014
Hello again,
I've discovered an interesting config option within condor: the
SHADOW_ALLOW_UNSAFE_REMOTE_EXEC seems to allow shell calls via the libc
'system()' function. Is it something any of you have already used in
order to allow calls with the Midas executor->exec ?
I tried to put it on and use the Condor shadow daemon, but I get an
error saying "Assertion ERROR on (job_ad_file)" at line 166 in file
shadow_v61_main.cpp" ...
So before going into the trouble of trying to solve this error, I was
wondering if you know about this "shadow" config option and if you can
confirm it is necessary.
Best regards,
Sorina
Le 03/03/2014 18:44, Sorina Camarasu Pop a écrit :
>
>
> Le 03/03/2014 18:06, Michael Grauer a écrit :
>> Where did you see the message: ""DC_AUTHENTICATE: authentication of
>> <xxx.xxx.xxx.xxx:59888> did not result in a valid mapped user name,
>> which is required for this command (1112 QMGMT_WRITE_CMD), so aborting."
>
> In the condor log : /home/condor/localcondor/log/SchedLog
>
>> Was there any other output included there?
>
> I copied parts of the log file in the attached file containing the
> output printed both when using batchmake and directly condor commands.
>
>> Do you have a "condor" user on your VM?
>
> Yes.
>
>> When you successfully run jobs by doing "condor_submit_dag" from the
>> command line as the apache user,
>
> apache 22773 29731 0 18:18 ? 00:00:00
> condor_scheduniv_exec.26.0 -f -l . -Lockfile challenge.dagjob.lock
> -AutoRescue 1 -DoRescueFrom 0 -Dag challenge.dagjob -CsdVersion
> $CondorVersion: 7.9.1 Aug 24 2012 PRE-RELEASE-UWCS $ -Force -Dagman
> /bin/condor_dagman
>
>
>> when you watch your job run with ps or top, which user runs the
>> actual execution process (whatever job batchmake will run for you) ?
>
> When launching it with batchmake (through the web interface) I do not
> manage to to get the corresponding condor process... I only get a
> httpd process run by apache....
>
>>
>> Can you include your "challenge.bms" script in an email?
>
> Of course, here it is attached.
>
>>
>> Can you show me the output of "ls" from a directory where the submit
>> failed and then again from one where the submit succeeded, at the end
>> of the job processing run?
>
> Failed :
> ls -la 52/
> total 56
> drwxrwxr-x 3 apache apache 4096 3 mars 18:25 .
> drwxr-xr-x 35 apache apache 4096 3 mars 18:25 ..
> -rw-r--r-- 1 apache apache 140 3 mars 18:25 adminconfig.cfg
> -rw-r--r-- 1 apache apache 355 3 mars 18:25 challenge.0.dagjob
> -rw-r--r-- 1 apache apache 332 3 mars 18:25 challenge.1.dagjob
> -rw-r--r-- 1 apache apache 564 3 mars 18:25 challenge.2.dagjob
> -rw-r--r-- 1 apache apache 355 3 mars 18:25 challenge.3.dagjob
> lrwxrwxrwx 1 apache apache 56 3 mars 18:25 challenge.bms ->
> /var/www/miccai4/modules/challenge/library/challenge.bms
> -rw-r--r-- 1 apache apache 1473 3 mars 18:25 challenge.config.bms
> -rw-r--r-- 1 apache apache 1593 3 mars 18:25 challenge.dagjob
> -rw-r--r-- 1 apache apache 1043 3 mars 18:25
> challenge.dagjob.condor.sub
> lrwxrwxrwx 1 apache apache 70 3 mars 18:25
> challenge_validator_app.bms ->
> /var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
> drwxrwxr-x 4 apache apache 4096 3 mars 18:25 data
> lrwxrwxrwx 1 apache apache 50 3 mars 18:25 PHP.bmm ->
> /var/www/miccai4/modules/challenge/library/PHP.bmm
> -rw-r--r-- 1 apache apache 138 3 mars 18:25 userconfig.cfg
> lrwxrwxrwx 1 apache apache 67 3 mars 18:25
> ValidateImageAveDist.bmm ->
> /var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm
>
>
> OK (created by matchmake and relaunched by hand):
> -bash-4.2$ ls -la 48
> total 104
> drwxrwxr-x 3 apache apache 4096 3 mars 18:18 .
> drwxr-xr-x 35 apache apache 4096 3 mars 18:25 ..
> -rw-r--r-- 1 apache apache 140 3 mars 18:09 adminconfig.cfg
> -rw-r--r-- 1 apache apache 0 3 mars 18:13 bmGrid.0.error.txt
> -rw-r--r-- 1 apache apache 1968 3 mars 18:18 bmGrid.0.log.txt
> -rw-r--r-- 1 apache apache 148 3 mars 18:18 bmGrid.0.out.txt
> -rw-r--r-- 1 apache apache 355 3 mars 18:09 challenge.0.dagjob
> -rw-r--r-- 1 apache apache 332 3 mars 18:09 challenge.1.dagjob
> -rw-r--r-- 1 apache apache 564 3 mars 18:09 challenge.2.dagjob
> -rw-r--r-- 1 apache apache 355 3 mars 18:09 challenge.3.dagjob
> lrwxrwxrwx 1 apache apache 56 3 mars 18:09 challenge.bms ->
> /var/www/miccai4/modules/challenge/library/challenge.bms
> -rw-r--r-- 1 apache apache 1473 3 mars 18:09 challenge.config.bms
> -rw-r--r-- 1 apache apache 1593 3 mars 18:09 challenge.dagjob
> -rw-r--r-- 1 apache apache 1042 3 mars 18:18
> challenge.dagjob.condor.sub
> -rw-r--r-- 1 apache apache 610 3 mars 18:18
> challenge.dagjob.dagman.log
> -rw-r--r-- 1 apache apache 16074 3 mars 18:18
> challenge.dagjob.dagman.out
> -rw-r--r-- 1 apache apache 256 3 mars 18:18 challenge.dagjob.dot
> -rw-r--r-- 1 apache apache 0 3 mars 18:18 challenge.dagjob.lib.err
> -rw-r--r-- 1 apache apache 29 3 mars 18:18 challenge.dagjob.lib.out
> -rw-r--r-- 1 apache apache 970 3 mars 18:18
> challenge.dagjob.nodes.log
> -rw-r--r-- 1 apache apache 243 3 mars 18:18
> challenge.dagjob.rescue001
> -rw-r--r-- 1 apache apache 243 3 mars 18:13
> challenge.dagjob.rescue001.old
> lrwxrwxrwx 1 apache apache 70 3 mars 18:09
> challenge_validator_app.bms ->
> /var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
> drwxrwxr-x 4 apache apache 4096 3 mars 18:09 data
> lrwxrwxrwx 1 apache apache 50 3 mars 18:09 PHP.bmm ->
> /var/www/miccai4/modules/challenge/library/PHP.bmm
> -rw-r--r-- 1 apache apache 138 3 mars 18:09 userconfig.cfg
> lrwxrwxrwx 1 apache apache 67 3 mars 18:09
> ValidateImageAveDist.bmm ->
> /var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm
>
>> I'm not sure what is going on, just trying to get more context...
>>
>> I recall I ran into a problem where one machine was the submitter,
>> and there was a midas user there, with uid 100, and a midas user on
>> another machine (the execution node) with a uid 200, and I got what
>> sounded like a similar message--I had to make sure their uids were
>> the same across machines to deal with permissions across an NFS mount
>> on both machines. This sounds nothing like your problem, but I wanted
>> to include it in case it gives you any ideas.
>
> Thank you for the hint.
> My problem seems to be similar, in the sense that it looks like a user
> problem. However, I do not manage to find the difference between the 2
> potential users : apache and who else ?...
>
> I noticed in the condor log (the one attached) the following line :
> 03/03/14 18:39:34 ATTEMPT_ACCESS: Switching to user uid: 48 gid: 48.
> uid 48 does corerspond to apache. What surprises me is that the log
> prints out "Switching to user uid: 48". That means that till that
> moment it is executed as some other user ?...
>
>>
>> Can you explain more about the library issues you ran into earlier
>> that prevented you from running jobs?
>
> I don't remember exactly, but I spent quite some time on that one too.
> In that case, jobs were submitted, but stayed idle : if I remember
> correctly, there was some library preventing one of the condor daemons
> from launching/executing correctly. I really don't think this could be
> connected...
>
> Thank you,
> Sorina
>
>>
>>
>>
>> Thanks,
>> Mike
>>
>>
>>
>>
>>
>> On Mon, Mar 3, 2014 at 11:50 AM, Sorina Camarasu Pop
>> <sorina.pop at creatis.insa-lyon.fr
>> <mailto:sorina.pop at creatis.insa-lyon.fr>> wrote:
>>
>> Hi Mike,
>>
>> Thank you for your prompt reply.
>>
>> Le 03/03/2014 17:27, Michael Grauer a écrit :
>>> Hi Sorina,
>>>
>>> These are tough to track down.
>>
>> I know, I've spent my afternoon on it...
>>
>>
>>> Can you tell me more about your environment? Specifically, the
>>> 3 machines (possibly all the same machine) that are your condor
>>> submit, condor manager, and condor execute nodes?
>>
>> I use the same machine (virtual machine configured as a dual
>> core) for my condor submit, condor manager, and condor execute
>> nodes.
>>
>>
>>> What operating system is your web server, and what version of
>>> Condor are you using?
>>
>> Fedora 18.
>> For Condor, I had compiled the latest version available, but had
>> some library problems preventing me from launching any job. I
>> finally had it work with the version available for yum install :
>> condor_version
>> $CondorVersion: 7.9.1 Aug 24 2012 PRE-RELEASE-UWCS $
>> $CondorPlatform: X86_64-Fedora_18 $
>>
>>
>>
>>> Is your condor submit node the same as your web server (most
>>> likely yes)?
>>
>> yes.
>>
>>
>>> Are you running your web server as the apache user (most likely
>>> yes),
>>
>> Yes, I even printed out "whoami" to check that it really runs as
>> apache.
>>
>>
>>> and is it your web server that is calling the php code that
>>> results in condor_dag_submit (most likely yes, again) ?
>>
>> Yes.
>> I use the "standard" batchmake config, i.e. the condorSubmitDag
>> function from KWBatchmakeComponent.php
>>
>>
>>> Can you show the permissions and ownership of the temporary work
>>> directory where the condor_dag_submit command is executed?
>>
>> ls -la
>> ...
>> drwxrwxr-x 3 apache apache 4096 3 mars 16:53 45
>> drwxrwxr-x 3 apache apache 4096 3 mars 17:41 46
>>
>> -bash-4.2$ cd 46
>> -bash-4.2$ ls -la
>> total 92
>> drwxrwxr-x 3 apache apache 4096 3 mars 17:41 .
>> drwxr-xr-x 29 apache apache 4096 3 mars 17:40 ..
>> -rw-r--r-- 1 apache apache 140 3 mars 17:40 adminconfig.cfg
>> -rw-r--r-- 1 apache apache 0 3 mars 17:41 bmGrid.0.error.txt
>> lrwxrwxrwx 1 apache apache 56 3 mars 17:40 challenge.bms ->
>> /var/www/miccai4/modules/challenge/library/challenge.bms
>> ...
>>
>>
>>
>>> When you tested as the apache user, did you do this test from
>>> the same temporary work directory that Midas/apache would have
>>> tried this from?
>>
>> Yes, from folder /var/www/miccai4/tmp/misc/batchmake/tmp/SSP/7/46
>> (drwxrwxr-x , owned by apache)
>>
>>
>>> Is there any more information in the logs or error logs
>>> generated by Condor in the temp work directory that you could share?
>>
>> tail -f challenge.dagjob.condor.sub
>> # Note: default on_exit_remove expression:
>> # ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0
>> && ExitCode <= 2))
>> # attempts to ensure that DAGMan is automatically
>> # requeued by the schedd if it exits abnormally or
>> # is killed (e.g., during a reboot).
>> on_exit_remove = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED
>> && ExitCode >=0 && ExitCode <= 2))
>> copy_to_spool = False
>> arguments = "-f -l . -Lockfile challenge.dagjob.lock
>> -AutoRescue 1 -DoRescueFrom 0 -Dag challenge.dagjob -CsdVersion
>> $CondorVersion:' '7.9.1' 'Aug' '24' '2012' 'PRE-RELEASE-UWCS' '$
>> -Dagman /usr/bin/condor_dagman"
>> environment =
>> _CONDOR_DAGMAN_LOG=challenge.dagjob.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
>> queue
>>
>> tail -f challenge.0.dagjob
>> # More information at: http://www.batchmake.org
>> Universe = vanilla
>> Output = bmGrid.0.out.txt
>> Error = bmGrid.0.error.txt
>> Log = bmGrid.0.log.txt
>> Notification = NEVER
>> Executable = /usr/bin/php
>> Arguments = "'--version'"
>> Queue 1
>>
>> I hope this can help with debugging the problem...
>>
>> Thank you,
>> Sorina
>>
>>
>>> Thanks,
>>> Mike
>>>
>>>
>>> On Mon, Mar 3, 2014 at 11:16 AM, Sorina Camarasu Pop
>>> <sorina.pop at creatis.insa-lyon.fr
>>> <mailto:sorina.pop at creatis.insa-lyon.fr>> wrote:
>>>
>>> Dear Midas users and developers,
>>>
>>> I am trying to configure Midas with the Challenge and
>>> BatchMake modules, but I encounter problems when executing
>>> the condor_submit_dag command.
>>>
>>> The error printed by Condor when executing the
>>> condor_submit_dag command using the Batchmake module looks
>>> like this : "DC_AUTHENTICATE: authentication of
>>> <xxx.xxx.xxx.xxx:59888> did not result in a valid mapped
>>> user name, which is required for this command (1112
>>> QMGMT_WRITE_CMD), so aborting."
>>>
>>> Nevertheless, if I execute exactly the same command line as
>>> apache in a console, everything works fine. My condor I do
>>> not understand where the difference comes from.
>>>
>>> Do you know if there's any special configuration for Condor
>>> to work with the Batchmake module ?
>>>
>>> Thank you for your help,
>>> Sorina
>>>
>>> --
>>> Sorina Pop, PhD
>>> CNRS Research Engineer
>>> CREATIS
>>> Tel : +33 (0)4 72 43 72 99
>>> <tel:%2B33%20%280%294%2072%2043%2072%2099>
>>>
>>> _______________________________________________
>>> Midas mailing list
>>> Midas at public.kitware.com <mailto:Midas at public.kitware.com>
>>> http://public.kitware.com/cgi-bin/mailman/listinfo/midas
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Sorina Pop, PhD
>> CNRS Research Engineer
>> CREATIS
>> Tel :+33 (0)4 72 43 72 99 <tel:%2B33%20%280%294%2072%2043%2072%2099>
>>
>>
>>
>>
>> --
>> Thanks,
>> Michael Grauer
>> R & D Engineer
>> Kitware, Inc.
>> 919 969 6990 x322
>>
>>
>
>
> --
> Sorina Pop, PhD
> CNRS Research Engineer
> CREATIS
> Tel : +33 (0)4 72 43 72 99
>
>
> _______________________________________________
> Midas mailing list
> Midas at public.kitware.com
> http://public.kitware.com/cgi-bin/mailman/listinfo/midas
--
Sorina Pop, PhD
CNRS Research Engineer
CREATIS
Tel : +33 (0)4 72 43 72 99
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/midas/attachments/20140304/c9dd9bc3/attachment-0002.html>
More information about the Midas
mailing list