[Midas] Problem with Midas + Batchmake + Condor

Michael Grauer michael.grauer at kitware.com
Tue Mar 4 10:21:38 EST 2014


Hi Sorina,

I never had to deal with the SHADOW_ALLOW_UNSAFE_REMOTE_EXEC property, but
I had been using Condor 7.4.4, and you are using a newer version.

I Googled around a bit on your error message, and saw a couple posts that
might help.

Also, looking at the top of your attached condor.sched.log file I see (but
haven't Googled for this)

3/03/14 18:09:56 authenticate_self_gss: acquiring self credentials failed.
Please check your Condor configuration file if this is a server process. Or
the user environment variable if this is a user process.


And there may be more helpful hints below that message in the file.
 Together these suggest it is some authentication configuration problem,
perhaps looking at these posts and checking the condor reference for this
configuration might help.

https://www-auth.cs.wisc.edu/lists/htcondor-users/2013-February/msg00129.shtml

http://comments.gmane.org/gmane.comp.distributed.condor.user/27728


You can also email the condor mailing list, I have done so in the past and
the community has been quite helpful.  Before you do this, I suggest you
get this down to the simplest possible example, as all of the
Midas/BatchMake stuff may just add confusion.  Can you try a very simple
example where you take a single PHP file and try to do a condor_submit_dag
with it in the same way that the challenge module does?  If you can repeat
the problem with that test case, and then doing the same thing successfully
as the apache user will give an easier problem to describe and get help
with.

Let us know how it goes, and good luck!

Thanks,
Mike




On Tue, Mar 4, 2014 at 9:46 AM, Sorina Camarasu Pop <
sorina.pop at creatis.insa-lyon.fr> wrote:

>
> Hello again,
>
> I've discovered an interesting config option within condor: the
> SHADOW_ALLOW_UNSAFE_REMOTE_EXEC seems to allow shell calls via the libc
> 'system()' function. Is it something any of you have already used in order
> to allow calls with the Midas executor->exec ?
>
> I tried to put it on and use the Condor shadow daemon, but I get an error
> saying "Assertion ERROR on (job_ad_file)" at line 166 in file
> shadow_v61_main.cpp" ...
>
> So before going into the trouble of trying to solve this error, I was
> wondering if you know about this "shadow" config option and if you can
> confirm it is necessary.
>
> Best regards,
> Sorina
>
> Le 03/03/2014 18:44, Sorina Camarasu Pop a écrit :
>
>
>
> Le 03/03/2014 18:06, Michael Grauer a écrit :
>
> Where did you see the message: ""DC_AUTHENTICATE: authentication of
> <xxx.xxx.xxx.xxx:59888> did not result in a valid mapped user name, which
> is required for this command (1112 QMGMT_WRITE_CMD), so aborting."
>
>
> In the condor log : /home/condor/localcondor/log/SchedLog
>
>  Was there any other output included there?
>
>
> I copied parts of the log file in the attached file containing the output
> printed both when using batchmake and directly condor commands.
>
>  Do you have a "condor" user on your VM?
>
>
> Yes.
>
>   When you successfully run jobs by doing "condor_submit_dag" from the
> command line as the apache user,
>
>
> apache   22773 29731  0 18:18 ?        00:00:00 condor_scheduniv_exec.26.0
> -f -l . -Lockfile challenge.dagjob.lock -AutoRescue 1 -DoRescueFrom 0 -Dag
> challenge.dagjob -CsdVersion $CondorVersion: 7.9.1 Aug 24 2012
> PRE-RELEASE-UWCS $ -Force -Dagman /bin/condor_dagman
>
>
>  when you watch your job run with ps or top, which user runs the actual
> execution process (whatever job batchmake will run for you) ?
>
>
> When launching it with batchmake (through the web interface) I do not
> manage to to get the corresponding condor process... I only get a httpd
> process run by apache....
>
>
>  Can you include your "challenge.bms" script in an email?
>
>
> Of course, here it is attached.
>
>
>  Can you show me the output of "ls" from a directory where the submit
> failed and then again from one where the submit succeeded, at the end of
> the job processing run?
>
>
> Failed :
> ls -la 52/
> total 56
> drwxrwxr-x  3 apache apache 4096  3 mars  18:25 .
> drwxr-xr-x 35 apache apache 4096  3 mars  18:25 ..
> -rw-r--r--  1 apache apache  140  3 mars  18:25 adminconfig.cfg
> -rw-r--r--  1 apache apache  355  3 mars  18:25 challenge.0.dagjob
> -rw-r--r--  1 apache apache  332  3 mars  18:25 challenge.1.dagjob
> -rw-r--r--  1 apache apache  564  3 mars  18:25 challenge.2.dagjob
> -rw-r--r--  1 apache apache  355  3 mars  18:25 challenge.3.dagjob
> lrwxrwxrwx  1 apache apache   56  3 mars  18:25 challenge.bms ->
> /var/www/miccai4/modules/challenge/library/challenge.bms
> -rw-r--r--  1 apache apache 1473  3 mars  18:25 challenge.config.bms
> -rw-r--r--  1 apache apache 1593  3 mars  18:25 challenge.dagjob
> -rw-r--r--  1 apache apache 1043  3 mars  18:25 challenge.dagjob.condor.sub
> lrwxrwxrwx  1 apache apache   70  3 mars  18:25
> challenge_validator_app.bms ->
> /var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
> drwxrwxr-x  4 apache apache 4096  3 mars  18:25 data
> lrwxrwxrwx  1 apache apache   50  3 mars  18:25 PHP.bmm ->
> /var/www/miccai4/modules/challenge/library/PHP.bmm
> -rw-r--r--  1 apache apache  138  3 mars  18:25 userconfig.cfg
> lrwxrwxrwx  1 apache apache   67  3 mars  18:25 ValidateImageAveDist.bmm
> -> /var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm
>
>
> OK (created by matchmake and relaunched by hand):
> -bash-4.2$ ls -la 48
> total 104
> drwxrwxr-x  3 apache apache  4096  3 mars  18:18 .
> drwxr-xr-x 35 apache apache  4096  3 mars  18:25 ..
> -rw-r--r--  1 apache apache   140  3 mars  18:09 adminconfig.cfg
> -rw-r--r--  1 apache apache     0  3 mars  18:13 bmGrid.0.error.txt
> -rw-r--r--  1 apache apache  1968  3 mars  18:18 bmGrid.0.log.txt
> -rw-r--r--  1 apache apache   148  3 mars  18:18 bmGrid.0.out.txt
> -rw-r--r--  1 apache apache   355  3 mars  18:09 challenge.0.dagjob
> -rw-r--r--  1 apache apache   332  3 mars  18:09 challenge.1.dagjob
> -rw-r--r--  1 apache apache   564  3 mars  18:09 challenge.2.dagjob
> -rw-r--r--  1 apache apache   355  3 mars  18:09 challenge.3.dagjob
> lrwxrwxrwx  1 apache apache    56  3 mars  18:09 challenge.bms ->
> /var/www/miccai4/modules/challenge/library/challenge.bms
> -rw-r--r--  1 apache apache  1473  3 mars  18:09 challenge.config.bms
> -rw-r--r--  1 apache apache  1593  3 mars  18:09 challenge.dagjob
> -rw-r--r--  1 apache apache  1042  3 mars  18:18
> challenge.dagjob.condor.sub
> -rw-r--r--  1 apache apache   610  3 mars  18:18
> challenge.dagjob.dagman.log
> -rw-r--r--  1 apache apache 16074  3 mars  18:18
> challenge.dagjob.dagman.out
> -rw-r--r--  1 apache apache   256  3 mars  18:18 challenge.dagjob.dot
> -rw-r--r--  1 apache apache     0  3 mars  18:18 challenge.dagjob.lib.err
> -rw-r--r--  1 apache apache    29  3 mars  18:18 challenge.dagjob.lib.out
> -rw-r--r--  1 apache apache   970  3 mars  18:18 challenge.dagjob.nodes.log
> -rw-r--r--  1 apache apache   243  3 mars  18:18 challenge.dagjob.rescue001
> -rw-r--r--  1 apache apache   243  3 mars  18:13
> challenge.dagjob.rescue001.old
> lrwxrwxrwx  1 apache apache    70  3 mars  18:09
> challenge_validator_app.bms ->
> /var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
> drwxrwxr-x  4 apache apache  4096  3 mars  18:09 data
> lrwxrwxrwx  1 apache apache    50  3 mars  18:09 PHP.bmm ->
> /var/www/miccai4/modules/challenge/library/PHP.bmm
> -rw-r--r--  1 apache apache   138  3 mars  18:09 userconfig.cfg
> lrwxrwxrwx  1 apache apache    67  3 mars  18:09 ValidateImageAveDist.bmm
> -> /var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm
>
>  I'm not sure what is going on, just trying to get more context...
>
>  I recall I ran into a problem where one machine was the submitter, and
> there was a midas user there, with uid 100, and a midas user on another
> machine (the execution node) with a uid 200, and I got what sounded like a
> similar message--I had to make sure their uids were the same across
> machines to deal with permissions across an NFS mount on both machines.
> This sounds nothing like your problem, but I wanted to include it in case
> it gives you any ideas.
>
>
> Thank you for the hint.
> My problem seems to be similar, in the sense that it looks like a user
> problem. However, I do not manage to find the difference between the 2
> potential users : apache and who else ?...
>
> I noticed in the condor log (the one attached) the following line :
> 03/03/14 18:39:34 ATTEMPT_ACCESS: Switching to user uid: 48 gid: 48.
> uid 48 does corerspond to apache. What surprises me is that the log prints
> out "Switching to user uid: 48". That means that till that moment it is
> executed as some other user ?...
>
>
>  Can you explain more about the library issues you ran into earlier that
> prevented you from running jobs?
>
>
> I don't remember exactly, but I spent quite some time on that one too.
> In that case, jobs were submitted, but stayed idle : if I remember
> correctly, there was some library preventing one of the condor daemons from
> launching/executing correctly. I really don't think this could be
> connected...
>
> Thank you,
> Sorina
>
>
>
>
>  Thanks,
> Mike
>
>
>
>
>
> On Mon, Mar 3, 2014 at 11:50 AM, Sorina Camarasu Pop <
> sorina.pop at creatis.insa-lyon.fr> wrote:
>
>>  Hi Mike,
>>
>> Thank you for your prompt reply.
>>
>> Le 03/03/2014 17:27, Michael Grauer a écrit :
>>
>> Hi Sorina,
>>
>>  These are tough to track down.
>>
>>
>>  I know, I've spent my afternoon on it...
>>
>>
>>  Can you tell me more about your environment?  Specifically, the 3
>> machines (possibly all the same machine) that are your condor submit,
>> condor manager, and condor execute nodes?
>>
>>
>>  I use the same machine (virtual machine configured as a dual core) for
>> my condor submit, condor manager, and condor execute nodes.
>>
>>
>>  What operating system is your web server, and what version of Condor
>> are you using?
>>
>>
>>  Fedora 18.
>> For Condor, I had compiled the latest version available, but had some
>> library problems preventing me from launching any job. I finally had it
>> work with the version available for yum install :
>> condor_version
>> $CondorVersion: 7.9.1 Aug 24 2012 PRE-RELEASE-UWCS $
>> $CondorPlatform: X86_64-Fedora_18 $
>>
>>
>>
>>   Is your condor submit node the same as your web server (most likely
>> yes)?
>>
>>
>>  yes.
>>
>>
>>  Are you running your web server as the apache user (most likely yes),
>>
>>
>>  Yes, I even printed out "whoami" to check that it really runs as apache.
>>
>>
>>  and is it your web server that is calling the php code that results in
>> condor_dag_submit (most likely yes, again) ?
>>
>>
>>  Yes.
>> I use the "standard" batchmake config, i.e. the condorSubmitDag function
>> from KWBatchmakeComponent.php
>>
>>
>>  Can you show the permissions and ownership of the temporary work
>> directory where the condor_dag_submit command is executed?
>>
>>
>>  ls -la
>> ...
>> drwxrwxr-x  3 apache apache 4096  3 mars  16:53 45
>> drwxrwxr-x  3 apache apache 4096  3 mars  17:41 46
>>
>> -bash-4.2$ cd 46
>> -bash-4.2$ ls -la
>> total 92
>> drwxrwxr-x  3 apache apache 4096  3 mars  17:41 .
>> drwxr-xr-x 29 apache apache 4096  3 mars  17:40 ..
>> -rw-r--r--  1 apache apache  140  3 mars  17:40 adminconfig.cfg
>> -rw-r--r--  1 apache apache    0  3 mars  17:41 bmGrid.0.error.txt
>> lrwxrwxrwx  1 apache apache   56  3 mars  17:40 challenge.bms ->
>> /var/www/miccai4/modules/challenge/library/challenge.bms
>> ...
>>
>>
>>
>>  When you tested as the apache user, did you do this test from the same
>> temporary work directory that Midas/apache would have tried this from?
>>
>>
>>  Yes, from folder /var/www/miccai4/tmp/misc/batchmake/tmp/SSP/7/46
>> (drwxrwxr-x , owned by apache)
>>
>>
>>  Is there any more information in the logs or error logs generated by
>> Condor in the temp work directory that you could share?
>>
>>
>>  tail -f challenge.dagjob.condor.sub
>> # Note: default on_exit_remove expression:
>> # ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 &&
>> ExitCode <= 2))
>> # attempts to ensure that DAGMan is automatically
>> # requeued by the schedd if it exits abnormally or
>> # is killed (e.g., during a reboot).
>> on_exit_remove  = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED &&
>> ExitCode >=0 && ExitCode <= 2))
>> copy_to_spool   = False
>> arguments       = "-f -l . -Lockfile challenge.dagjob.lock -AutoRescue 1
>> -DoRescueFrom 0 -Dag challenge.dagjob -CsdVersion $CondorVersion:' '7.9.1'
>> 'Aug' '24' '2012' 'PRE-RELEASE-UWCS' '$ -Dagman /usr/bin/condor_dagman"
>> environment     =
>> _CONDOR_DAGMAN_LOG=challenge.dagjob.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
>> queue
>>
>> tail -f challenge.0.dagjob
>> # More information at: http://www.batchmake.org
>> Universe       = vanilla
>> Output         = bmGrid.0.out.txt
>> Error          = bmGrid.0.error.txt
>> Log            = bmGrid.0.log.txt
>> Notification   = NEVER
>> Executable    = /usr/bin/php
>> Arguments     = "'--version'"
>> Queue 1
>>
>> I hope this can help with debugging the problem...
>>
>> Thank you,
>> Sorina
>>
>>
>>  Thanks,
>> Mike
>>
>>
>> On Mon, Mar 3, 2014 at 11:16 AM, Sorina Camarasu Pop <
>> sorina.pop at creatis.insa-lyon.fr> wrote:
>>
>>> Dear Midas users and developers,
>>>
>>> I am trying to configure Midas with the Challenge and BatchMake modules,
>>> but I encounter problems when executing the condor_submit_dag command.
>>>
>>> The error printed by Condor when executing the condor_submit_dag command
>>> using the Batchmake module looks like this : "DC_AUTHENTICATE:
>>> authentication of <xxx.xxx.xxx.xxx:59888> did not result in a valid mapped
>>> user name, which is required for this command (1112 QMGMT_WRITE_CMD), so
>>> aborting."
>>>
>>> Nevertheless, if I execute exactly the same command line as apache in a
>>> console, everything works fine. My condor I do not understand where the
>>> difference comes from.
>>>
>>> Do you know if there's any special configuration for Condor to work with
>>> the Batchmake module ?
>>>
>>> Thank you for your help,
>>> Sorina
>>>
>>> --
>>> Sorina Pop, PhD
>>> CNRS Research Engineer
>>> CREATIS
>>> Tel : +33 (0)4 72 43 72 99 <%2B33%20%280%294%2072%2043%2072%2099>
>>>
>>> _______________________________________________
>>> Midas mailing list
>>> Midas at public.kitware.com
>>> http://public.kitware.com/cgi-bin/mailman/listinfo/midas
>>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Sorina Pop, PhD
>> CNRS Research Engineer
>> CREATIS
>> Tel : +33 (0)4 72 43 72 99
>>
>>
>
>
>  --
> Thanks,
> Michael Grauer
> R & D Engineer
> Kitware, Inc.
> 919 969 6990 x322
>
>
>
>
> --
> Sorina Pop, PhD
> CNRS Research Engineer
> CREATIS
> Tel : +33 (0)4 72 43 72 99
>
>
>
> _______________________________________________
> Midas mailing listMidas at public.kitware.comhttp://public.kitware.com/cgi-bin/mailman/listinfo/midas
>
>
>
> --
> Sorina Pop, PhD
> CNRS Research Engineer
> CREATIS
> Tel : +33 (0)4 72 43 72 99
>
>
> _______________________________________________
> Midas mailing list
> Midas at public.kitware.com
> http://public.kitware.com/cgi-bin/mailman/listinfo/midas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/midas/attachments/20140304/a7a569a0/attachment-0002.html>


More information about the Midas mailing list