[Midas] Problem with Midas + Batchmake + Condor
Sorina Camarasu Pop
sorina.pop at creatis.insa-lyon.fr
Tue Mar 4 11:44:07 EST 2014
Hi Mike,
Thank you for your reply.
I had also found the first link you sent, but didn't manage to properly
configure Condor with password authetication.
I will follow your advice and contact the Condor mailing list.
Thank you for your help !
Best regards,
Sorina
Le 04/03/2014 16:21, Michael Grauer a écrit :
> Hi Sorina,
>
> I never had to deal with the SHADOW_ALLOW_UNSAFE_REMOTE_EXEC property,
> but I had been using Condor 7.4.4, and you are using a newer version.
>
> I Googled around a bit on your error message, and saw a couple posts
> that might help.
>
> Also, looking at the top of your attached condor.sched.log file I see
> (but haven't Googled for this)
>
> 3/03/14 18:09:56 authenticate_self_gss: acquiring self credentials
> failed. Please check your Condor configuration file if this is a
> server process. Or the user environment variable if this is a user
> process.
>
>
> And there may be more helpful hints below that message in the file.
> Together these suggest it is some authentication configuration
> problem, perhaps looking at these posts and checking the condor
> reference for this configuration might help.
>
> https://www-auth.cs.wisc.edu/lists/htcondor-users/2013-February/msg00129.shtml
>
> http://comments.gmane.org/gmane.comp.distributed.condor.user/27728
>
>
> You can also email the condor mailing list, I have done so in the past
> and the community has been quite helpful. Before you do this, I
> suggest you get this down to the simplest possible example, as all of
> the Midas/BatchMake stuff may just add confusion. Can you try a very
> simple example where you take a single PHP file and try to do a
> condor_submit_dag with it in the same way that the challenge module
> does? If you can repeat the problem with that test case, and then
> doing the same thing successfully as the apache user will give an
> easier problem to describe and get help with.
>
> Let us know how it goes, and good luck!
>
> Thanks,
> Mike
>
>
>
>
> On Tue, Mar 4, 2014 at 9:46 AM, Sorina Camarasu Pop
> <sorina.pop at creatis.insa-lyon.fr
> <mailto:sorina.pop at creatis.insa-lyon.fr>> wrote:
>
>
> Hello again,
>
> I've discovered an interesting config option within condor: the
> SHADOW_ALLOW_UNSAFE_REMOTE_EXEC seems to allow shell calls via the
> libc 'system()' function. Is it something any of you have already
> used in order to allow calls with the Midas executor->exec ?
>
> I tried to put it on and use the Condor shadow daemon, but I get
> an error saying "Assertion ERROR on (job_ad_file)" at line 166 in
> file shadow_v61_main.cpp" ...
>
> So before going into the trouble of trying to solve this error, I
> was wondering if you know about this "shadow" config option and if
> you can confirm it is necessary.
>
> Best regards,
> Sorina
>
> Le 03/03/2014 18:44, Sorina Camarasu Pop a écrit :
>>
>>
>> Le 03/03/2014 18:06, Michael Grauer a écrit :
>>> Where did you see the message: ""DC_AUTHENTICATE: authentication
>>> of <xxx.xxx.xxx.xxx:59888> did not result in a valid mapped user
>>> name, which is required for this command (1112 QMGMT_WRITE_CMD),
>>> so aborting."
>>
>> In the condor log : /home/condor/localcondor/log/SchedLog
>>
>>> Was there any other output included there?
>>
>> I copied parts of the log file in the attached file containing
>> the output printed both when using batchmake and directly condor
>> commands.
>>
>>> Do you have a "condor" user on your VM?
>>
>> Yes.
>>
>>> When you successfully run jobs by doing "condor_submit_dag"
>>> from the command line as the apache user,
>>
>> apache 22773 29731 0 18:18 ? 00:00:00
>> condor_scheduniv_exec.26.0 -f -l . -Lockfile
>> challenge.dagjob.lock -AutoRescue 1 -DoRescueFrom 0 -Dag
>> challenge.dagjob -CsdVersion $CondorVersion: 7.9.1 Aug 24 2012
>> PRE-RELEASE-UWCS $ -Force -Dagman /bin/condor_dagman
>>
>>
>>> when you watch your job run with ps or top, which user runs the
>>> actual execution process (whatever job batchmake will run for
>>> you) ?
>>
>> When launching it with batchmake (through the web interface) I do
>> not manage to to get the corresponding condor process... I only
>> get a httpd process run by apache....
>>
>>>
>>> Can you include your "challenge.bms" script in an email?
>>
>> Of course, here it is attached.
>>
>>>
>>> Can you show me the output of "ls" from a directory where the
>>> submit failed and then again from one where the submit
>>> succeeded, at the end of the job processing run?
>>
>> Failed :
>> ls -la 52/
>> total 56
>> drwxrwxr-x 3 apache apache 4096 3 mars 18:25 .
>> drwxr-xr-x 35 apache apache 4096 3 mars 18:25 ..
>> -rw-r--r-- 1 apache apache 140 3 mars 18:25 adminconfig.cfg
>> -rw-r--r-- 1 apache apache 355 3 mars 18:25 challenge.0.dagjob
>> -rw-r--r-- 1 apache apache 332 3 mars 18:25 challenge.1.dagjob
>> -rw-r--r-- 1 apache apache 564 3 mars 18:25 challenge.2.dagjob
>> -rw-r--r-- 1 apache apache 355 3 mars 18:25 challenge.3.dagjob
>> lrwxrwxrwx 1 apache apache 56 3 mars 18:25 challenge.bms ->
>> /var/www/miccai4/modules/challenge/library/challenge.bms
>> -rw-r--r-- 1 apache apache 1473 3 mars 18:25 challenge.config.bms
>> -rw-r--r-- 1 apache apache 1593 3 mars 18:25 challenge.dagjob
>> -rw-r--r-- 1 apache apache 1043 3 mars 18:25
>> challenge.dagjob.condor.sub
>> lrwxrwxrwx 1 apache apache 70 3 mars 18:25
>> challenge_validator_app.bms ->
>> /var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
>> drwxrwxr-x 4 apache apache 4096 3 mars 18:25 data
>> lrwxrwxrwx 1 apache apache 50 3 mars 18:25 PHP.bmm ->
>> /var/www/miccai4/modules/challenge/library/PHP.bmm
>> -rw-r--r-- 1 apache apache 138 3 mars 18:25 userconfig.cfg
>> lrwxrwxrwx 1 apache apache 67 3 mars 18:25
>> ValidateImageAveDist.bmm ->
>> /var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm
>>
>>
>> OK (created by matchmake and relaunched by hand):
>> -bash-4.2$ ls -la 48
>> total 104
>> drwxrwxr-x 3 apache apache 4096 3 mars 18:18 .
>> drwxr-xr-x 35 apache apache 4096 3 mars 18:25 ..
>> -rw-r--r-- 1 apache apache 140 3 mars 18:09 adminconfig.cfg
>> -rw-r--r-- 1 apache apache 0 3 mars 18:13 bmGrid.0.error.txt
>> -rw-r--r-- 1 apache apache 1968 3 mars 18:18 bmGrid.0.log.txt
>> -rw-r--r-- 1 apache apache 148 3 mars 18:18 bmGrid.0.out.txt
>> -rw-r--r-- 1 apache apache 355 3 mars 18:09 challenge.0.dagjob
>> -rw-r--r-- 1 apache apache 332 3 mars 18:09 challenge.1.dagjob
>> -rw-r--r-- 1 apache apache 564 3 mars 18:09 challenge.2.dagjob
>> -rw-r--r-- 1 apache apache 355 3 mars 18:09 challenge.3.dagjob
>> lrwxrwxrwx 1 apache apache 56 3 mars 18:09 challenge.bms ->
>> /var/www/miccai4/modules/challenge/library/challenge.bms
>> -rw-r--r-- 1 apache apache 1473 3 mars 18:09 challenge.config.bms
>> -rw-r--r-- 1 apache apache 1593 3 mars 18:09 challenge.dagjob
>> -rw-r--r-- 1 apache apache 1042 3 mars 18:18
>> challenge.dagjob.condor.sub
>> -rw-r--r-- 1 apache apache 610 3 mars 18:18
>> challenge.dagjob.dagman.log
>> -rw-r--r-- 1 apache apache 16074 3 mars 18:18
>> challenge.dagjob.dagman.out
>> -rw-r--r-- 1 apache apache 256 3 mars 18:18 challenge.dagjob.dot
>> -rw-r--r-- 1 apache apache 0 3 mars 18:18
>> challenge.dagjob.lib.err
>> -rw-r--r-- 1 apache apache 29 3 mars 18:18
>> challenge.dagjob.lib.out
>> -rw-r--r-- 1 apache apache 970 3 mars 18:18
>> challenge.dagjob.nodes.log
>> -rw-r--r-- 1 apache apache 243 3 mars 18:18
>> challenge.dagjob.rescue001
>> -rw-r--r-- 1 apache apache 243 3 mars 18:13
>> challenge.dagjob.rescue001.old
>> lrwxrwxrwx 1 apache apache 70 3 mars 18:09
>> challenge_validator_app.bms ->
>> /var/www/miccai4/modules/challenge/library/challenge_validator_app.bms
>> drwxrwxr-x 4 apache apache 4096 3 mars 18:09 data
>> lrwxrwxrwx 1 apache apache 50 3 mars 18:09 PHP.bmm ->
>> /var/www/miccai4/modules/challenge/library/PHP.bmm
>> -rw-r--r-- 1 apache apache 138 3 mars 18:09 userconfig.cfg
>> lrwxrwxrwx 1 apache apache 67 3 mars 18:09
>> ValidateImageAveDist.bmm ->
>> /var/www/miccai4/modules/challenge/library/ValidateImageAveDist.bmm
>>
>>> I'm not sure what is going on, just trying to get more context...
>>>
>>> I recall I ran into a problem where one machine was the
>>> submitter, and there was a midas user there, with uid 100, and a
>>> midas user on another machine (the execution node) with a uid
>>> 200, and I got what sounded like a similar message--I had to
>>> make sure their uids were the same across machines to deal with
>>> permissions across an NFS mount on both machines. This sounds
>>> nothing like your problem, but I wanted to include it in case it
>>> gives you any ideas.
>>
>> Thank you for the hint.
>> My problem seems to be similar, in the sense that it looks like a
>> user problem. However, I do not manage to find the difference
>> between the 2 potential users : apache and who else ?...
>>
>> I noticed in the condor log (the one attached) the following line :
>> 03/03/14 18:39:34 ATTEMPT_ACCESS: Switching to user uid: 48 gid: 48.
>> uid 48 does corerspond to apache. What surprises me is that the
>> log prints out "Switching to user uid: 48". That means that till
>> that moment it is executed as some other user ?...
>>
>>>
>>> Can you explain more about the library issues you ran into
>>> earlier that prevented you from running jobs?
>>
>> I don't remember exactly, but I spent quite some time on that one
>> too.
>> In that case, jobs were submitted, but stayed idle : if I
>> remember correctly, there was some library preventing one of the
>> condor daemons from launching/executing correctly. I really don't
>> think this could be connected...
>>
>> Thank you,
>> Sorina
>>
>>>
>>>
>>>
>>> Thanks,
>>> Mike
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Mar 3, 2014 at 11:50 AM, Sorina Camarasu Pop
>>> <sorina.pop at creatis.insa-lyon.fr
>>> <mailto:sorina.pop at creatis.insa-lyon.fr>> wrote:
>>>
>>> Hi Mike,
>>>
>>> Thank you for your prompt reply.
>>>
>>> Le 03/03/2014 17:27, Michael Grauer a écrit :
>>>> Hi Sorina,
>>>>
>>>> These are tough to track down.
>>>
>>> I know, I've spent my afternoon on it...
>>>
>>>
>>>> Can you tell me more about your environment? Specifically,
>>>> the 3 machines (possibly all the same machine) that are
>>>> your condor submit, condor manager, and condor execute nodes?
>>>
>>> I use the same machine (virtual machine configured as a dual
>>> core) for my condor submit, condor manager, and condor
>>> execute nodes.
>>>
>>>
>>>> What operating system is your web server, and what version
>>>> of Condor are you using?
>>>
>>> Fedora 18.
>>> For Condor, I had compiled the latest version available, but
>>> had some library problems preventing me from launching any
>>> job. I finally had it work with the version available for
>>> yum install :
>>> condor_version
>>> $CondorVersion: 7.9.1 Aug 24 2012 PRE-RELEASE-UWCS $
>>> $CondorPlatform: X86_64-Fedora_18 $
>>>
>>>
>>>
>>>> Is your condor submit node the same as your web server
>>>> (most likely yes)?
>>>
>>> yes.
>>>
>>>
>>>> Are you running your web server as the apache user (most
>>>> likely yes),
>>>
>>> Yes, I even printed out "whoami" to check that it really
>>> runs as apache.
>>>
>>>
>>>> and is it your web server that is calling the php code that
>>>> results in condor_dag_submit (most likely yes, again) ?
>>>
>>> Yes.
>>> I use the "standard" batchmake config, i.e. the
>>> condorSubmitDag function from KWBatchmakeComponent.php
>>>
>>>
>>>> Can you show the permissions and ownership of the temporary
>>>> work directory where the condor_dag_submit command is executed?
>>>
>>> ls -la
>>> ...
>>> drwxrwxr-x 3 apache apache 4096 3 mars 16:53 45
>>> drwxrwxr-x 3 apache apache 4096 3 mars 17:41 46
>>>
>>> -bash-4.2$ cd 46
>>> -bash-4.2$ ls -la
>>> total 92
>>> drwxrwxr-x 3 apache apache 4096 3 mars 17:41 .
>>> drwxr-xr-x 29 apache apache 4096 3 mars 17:40 ..
>>> -rw-r--r-- 1 apache apache 140 3 mars 17:40 adminconfig.cfg
>>> -rw-r--r-- 1 apache apache 0 3 mars 17:41
>>> bmGrid.0.error.txt
>>> lrwxrwxrwx 1 apache apache 56 3 mars 17:40
>>> challenge.bms ->
>>> /var/www/miccai4/modules/challenge/library/challenge.bms
>>> ...
>>>
>>>
>>>
>>>> When you tested as the apache user, did you do this test
>>>> from the same temporary work directory that Midas/apache
>>>> would have tried this from?
>>>
>>> Yes, from folder
>>> /var/www/miccai4/tmp/misc/batchmake/tmp/SSP/7/46 (drwxrwxr-x
>>> , owned by apache)
>>>
>>>
>>>> Is there any more information in the logs or error logs
>>>> generated by Condor in the temp work directory that you
>>>> could share?
>>>
>>> tail -f challenge.dagjob.condor.sub
>>> # Note: default on_exit_remove expression:
>>> # ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode
>>> >=0 && ExitCode <= 2))
>>> # attempts to ensure that DAGMan is automatically
>>> # requeued by the schedd if it exits abnormally or
>>> # is killed (e.g., during a reboot).
>>> on_exit_remove = ( ExitSignal =?= 11 || (ExitCode =!=
>>> UNDEFINED && ExitCode >=0 && ExitCode <= 2))
>>> copy_to_spool = False
>>> arguments = "-f -l . -Lockfile challenge.dagjob.lock
>>> -AutoRescue 1 -DoRescueFrom 0 -Dag challenge.dagjob
>>> -CsdVersion $CondorVersion:' '7.9.1' 'Aug' '24' '2012'
>>> 'PRE-RELEASE-UWCS' '$ -Dagman /usr/bin/condor_dagman"
>>> environment =
>>> _CONDOR_DAGMAN_LOG=challenge.dagjob.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
>>> queue
>>>
>>> tail -f challenge.0.dagjob
>>> # More information at: http://www.batchmake.org
>>> Universe = vanilla
>>> Output = bmGrid.0.out.txt
>>> Error = bmGrid.0.error.txt
>>> Log = bmGrid.0.log.txt
>>> Notification = NEVER
>>> Executable = /usr/bin/php
>>> Arguments = "'--version'"
>>> Queue 1
>>>
>>> I hope this can help with debugging the problem...
>>>
>>> Thank you,
>>> Sorina
>>>
>>>
>>>> Thanks,
>>>> Mike
>>>>
>>>>
>>>> On Mon, Mar 3, 2014 at 11:16 AM, Sorina Camarasu Pop
>>>> <sorina.pop at creatis.insa-lyon.fr
>>>> <mailto:sorina.pop at creatis.insa-lyon.fr>> wrote:
>>>>
>>>> Dear Midas users and developers,
>>>>
>>>> I am trying to configure Midas with the Challenge and
>>>> BatchMake modules, but I encounter problems when
>>>> executing the condor_submit_dag command.
>>>>
>>>> The error printed by Condor when executing the
>>>> condor_submit_dag command using the Batchmake module
>>>> looks like this : "DC_AUTHENTICATE: authentication of
>>>> <xxx.xxx.xxx.xxx:59888> did not result in a valid
>>>> mapped user name, which is required for this command
>>>> (1112 QMGMT_WRITE_CMD), so aborting."
>>>>
>>>> Nevertheless, if I execute exactly the same command
>>>> line as apache in a console, everything works fine. My
>>>> condor I do not understand where the difference comes from.
>>>>
>>>> Do you know if there's any special configuration for
>>>> Condor to work with the Batchmake module ?
>>>>
>>>> Thank you for your help,
>>>> Sorina
>>>>
>>>> --
>>>> Sorina Pop, PhD
>>>> CNRS Research Engineer
>>>> CREATIS
>>>> Tel : +33 (0)4 72 43 72 99
>>>> <tel:%2B33%20%280%294%2072%2043%2072%2099>
>>>>
>>>> _______________________________________________
>>>> Midas mailing list
>>>> Midas at public.kitware.com <mailto:Midas at public.kitware.com>
>>>> http://public.kitware.com/cgi-bin/mailman/listinfo/midas
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Sorina Pop, PhD
>>> CNRS Research Engineer
>>> CREATIS
>>> Tel :+33 (0)4 72 43 72 99 <tel:%2B33%20%280%294%2072%2043%2072%2099>
>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Michael Grauer
>>> R & D Engineer
>>> Kitware, Inc.
>>> 919 969 6990 x322 <tel:919%20969%206990%20x322>
>>>
>>>
>>
>>
>> --
>> Sorina Pop, PhD
>> CNRS Research Engineer
>> CREATIS
>> Tel :+33 (0)4 72 43 72 99 <tel:%2B33%20%280%294%2072%2043%2072%2099>
>>
>>
>> _______________________________________________
>> Midas mailing list
>> Midas at public.kitware.com <mailto:Midas at public.kitware.com>
>> http://public.kitware.com/cgi-bin/mailman/listinfo/midas
>
>
> --
> Sorina Pop, PhD
> CNRS Research Engineer
> CREATIS
> Tel :+33 (0)4 72 43 72 99 <tel:%2B33%20%280%294%2072%2043%2072%2099>
>
>
> _______________________________________________
> Midas mailing list
> Midas at public.kitware.com <mailto:Midas at public.kitware.com>
> http://public.kitware.com/cgi-bin/mailman/listinfo/midas
>
>
>
>
>
>
--
Sorina Pop, PhD
CNRS Research Engineer
CREATIS
Tel : +33 (0)4 72 43 72 99
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/midas/attachments/20140304/de810580/attachment-0002.html>
More information about the Midas
mailing list