Difference between revisions of "Proposals:Condor"

From KitwarePublic
Jump to navigationJump to search
Line 567: Line 567:
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):
To test this, you can run the '''cpi.exe''' example application in the '''MPICH2\examples''' directory (I ran it with 4 processors since my machine has 4 cores):
   mpiexec -n 4 full_path_to\cpi.exe
   mpiexec -n 4 full_path_to\cpi.exe
You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable '''hostname''', which should return the two different hostnames of the two machines.
  mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname


== Creating an MPI program on Windows ==
== Creating an MPI program on Windows ==

Revision as of 22:05, 20 April 2011

Introduction

Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can execute jobs in serial and parallel mode. For parallel jobs, it supports the MPI standard. This Wikipage is dedicated to document our working experience using Condor.

Downloading Condor

Different versions of condor can be downloaded from here. This documentation focuses on our experience installing/configuring Condor Version 7.2.0. The official detail documentation for this version can be found here

Preparation

As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.

  1. What machine will be the central manager?
  2. What machines should be allowed to submit jobs?
  3. Will Condor run as root or not?
  4. Do I have enough disk space for Condor?
  5. Do I need MPI configured?

Condor can be installed as either a manager node, a execute or a submit node. Or any combination of these ones. See The Different Roles a Machine Can Play

  • Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request
  • Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.
  • Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.

For more information regarding other required preparatory work, refer the documentation

Installation

Unix

The official instructions on how to install Condor in Unix can be found here . Below we present some of tweaks we had to do to get it to work on our Unix machines.

Prerequisites

  • Be sure the server has a hostname and a domain name
 hostname

should return mymachine.mydomain.com (or .org, .edu, etc.) , if it only returns mymachine, then your server does not have a fully qualified domain name.

To set the domain name, edit /etc/hosts and add your domain name to the first line. You might see something like

10.171.1.124 mymachine

change this to

10.171.1.124 mymachine.mydomain.com

Also edit /etc/hostname to be

mymachine.mydomain.com

Then reboot so that the hostname changes take effect.

  • Make sure the following packages are installed:
apt-get install mailutils
  • Make sure the server has a hostname and a domainname.

For example, you could run a similar command to download the desired package:

wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz
  • You should install Condor as root or with a user having equivalent privileges

Configuring a Condor Manager in Unix

  • Make sure the condor archive is in your home directory (For example /home/kitware), then untar it.
cd ~
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz
cd ./condor-7.2.X
  • If not yet done, create a condor user
adduser condor
  • Run the installation scripts condor_install
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor

After running the installation script, you should get the following output:

Installing Condor from /root/condor-7.2.X to /root/condor

Condor has been installed into:
    /root/condor

Configured condor using these configuration files:
  global: /root/condor/etc/condor_config
  local:  /home/condor/localcondor/condor_config.local
Created scripts which can be sourced by users to setup their
Condor environment variables.  These are:
   sh: /root/condor/condor.sh
  csh: /root/condor/condor.csh
  • Switch to the directory where condor is now installed
cd /root/condor
  • Edit /etc/environment and update PATH variable to include the directory /root/condor/bin and /root/condor/sbin
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"
  • Add the following line
CONDOR_CONFIG="/root/condor/etc/condor_config"
  • Save file and apply the change by running
source /etc/environment
  • Make sure CONDOR_CONFIG and PATH are set correctly
root@rigel:~$ echo $PATH
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

root@rigel:~$ echo $CONDOR_CONFIG
/root/condor/etc/condor_config
  • You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.
  • Edit condor manager config_file and update the line as referenced below:
cd ~/condor
vi ./etc/condor_config
RELEASE_DIR              = /root/condor
LOCAL_DIR                = /home/condor/localcondor
CONDOR_ADMIN             = email@website.com
UID_DOMAIN               = website.com
FILESYSTEM_DOMAIN        = website.com
HOSTALLOW_READ           = *.website.com
HOSTALLOW_WRITE          = *.website.com
HOSTALLOW_CONFIG         = $(CONDOR_HOST)
  • If you have MIDAS integration, in order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor
cd /home/condor
ln -s /home/condor/etc/condor_config condor_config

Configuring a Executer/Submitter in Unix

The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script condor_install. Nevertheless, you still need to update its configuration file.

  • Edit condor node config_file.local and update the line as referenced below:
vi /home/condor/condor_config.local
CONDOR_ADMIN        = email@website.com

If the installation went well, the line having UID_DOMAIN and FILESYSTEM_DOMAIN should already be set to website.com

Windows

The official documentation on how to install Condor in Windows can be found here. Below we describe our experience installing Condor in Windows 7.

  1. Download the Windows install MSI, run it, installing to "C:/condor".
  2. Accept the license agreement.
  3. Decide if you are installing a central controller or a submit/execute node
    1. If installing a Central Controller, then select "create a new central pool" and set the name of the pool
    2. Otherwise select "Join an existing Pool" and enter the hostname of the central manager ( full address ).
  4. Decide whether the machine should be a submitter node, and select the appropriate option.
  5. Decide when Condor should run jobs, if the machine will be an executor.
    1. Decide what happens to jobs when the machine stops being idle.
  6. For accounting domain enter your domain (e.g. yourdomaininternal.com)
  7. For Email settings (I ignored this by clicking next)
  8. for Java settings (I ignored this as we weren't using Java, by clicking next)

Set the following settings when prompted

Host Permission Settings:
 hosts with read: *
 hosts with write: *
 hosts with administrator access $(FULL_HOSTNAME)
 enable vm universe: no
 enable hdfs support: no

When asked if you want a custom install or install, choose install. This will install condor to C:\condor.


The install will ask you to reboot your machine. If you want to access the condor command line programs from anywhere on your system, add C:\condor\bin to your system's PATH environment variable.


When you will be running condor commands, start a cygwin or cmd prompt with elevated/administrator privileges.


After the install, I could see the condor system by running

 condor_status

and this helped me fix up some problems. My condor_status at first gave me "unable to resolve COLLECTOR_HOST". There are many helpful log files to review in C:\condor\log. The first of these I looked at was .master_address. The IP address listed there was incorrect (since my machine has multiple IP addresses, I needed to specify the IP address of my wired connection, which is on the same subnet as my COLLECTOR_HOST. Your IP address might incorrectly be a localhost loopback address like 127.0.0.1, or perhaps just an IP that you would not want).


I shut down condor, right clicked on C:/condor in a Windows explorer, turned off "read only" and set permissions to allow for writing. Then I edited the file "c:/condor/condor_config.local" which started out empty, so that it could pick up some replacement values (some of them didn't seem to be set properly during the install). These values are:

 NETWORK_INTERFACE = <IP address>
 UID_DOMAIN		= *.yourdomaininternal.com
 FILESYSTEM_DOMAIN	= *.yourdomaininternal.com
 COLLECTOR_NAME 		= PoolName
 ALLOW_READ = *
 ALLOW_WRITE = *
 # Choose one of the following:
 #
 #  For a submit/execute node:
    DAEMON_LIST = MASTER, SCHEDD, STARTD
 #  For a central collector host and submit/execute node:
    DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD, KBDD
 TRUST_UID_DOMAIN = True
 START = True
 

You may want to add DEFAULT_DOMAIN_NAME = yourinternaldomain.com if your machine comes up without a domain name in condor.


and then restarted condor. If you ran "condor_status" and see that the Windows machine had Activity=OWNER rather than UNCLAIMED, be sure that you have added in START=True. But this may not be the best configuration for a Windows workstation that is in use. There is probably some additional configuration needed to make sure a Condor job doesn't use the machine when a physical human user is there using it.


At this point I was able to get one of the Condor Windows examples working, but with a bit of tweaking for Windows 7.

Here is the batch file contents for the actual job (printname.bat)

 @echo off
 echo Here is the output from "net USER" :
 net USER


And here is the printname.sub condor submission file I ran with

 condor_submit printname.sub


 universe = vanilla
 environment = path=c:\Windows\system32
 executable = printname.bat
 output = printname.out
 error = printname.err
 log = printname.log
 queue


Useful Condor Commands on Windows

To run these commands, get a command prompt or Cygwin terminal, right click the icon to start it up, and click "run with elevated privileges" or "run as administrator".

condor_master runs as a service on Windows, which controls the other daemons.

To stop condor

 net stop condor

To start condor

 net start condor

At first before you can submit a job to Condor on Windows, you'll need to store your user's credentials (password). Run

 condor_store_cred add 

then enter your password.

Running Condor

The official user's manual on how to perform distributed computing in Condor is here


  • run the condor manager
condor_master
  • Assuming at the installation process, you setup the type as manager,execute,submit (the default), running the following command
ps -e | egrep condor_
  • You should get something similar to:
1063 ?        00:00:00 condor_master
1064 ?        00:00:00 condor_collecto
1065 ?        00:00:00 condor_negotiat
1066 ?        00:00:00 condor_schedd
1067 ?        00:00:00 condor_startd
1068 ?        00:00:00 condor_procd
  • If you run the command ps -e | egrep condor_ just after you started condor, you may also see the following line
1077 ?        00:00:00 condor_starter


  • Check the status
kitware@rigel:~$ condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
 
slot1@rigel        LINUX      X86_64 Unclaimed Idle     0.010  1006  0+00:10:04
slot2@rigel        LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:10:05

                    Total Owner Claimed Unclaimed Matched Preempting Backfill

       X86_64/LINUX     2     0       0         2       0          0        0

              Total     2     0       0         2       0          0        0
  • Setup condor to automatically startup
cp /root/condor/etc/example/condor.boot /etc/init.d/
  • Update MASTER parameter in condor.boot to match your current setup
vi /etc/init.d/condor.boot

MASTER=/root/condor/sbin/condor_master
  • Add condor.boot service to all runlevel
kitware@rigel:~$ update-rc.d condor.boot defaults

/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot

A simple example demonstrating the use of Condor

 #include <unistd.h>
 #include <stdio.h>
 int main( int argc, char** argv )
 {
   printf( "%s\n", argv[1] );
   fflush( stdout );
   sleep( 30 );
   return 0; 
 }

This exe will repeat the command line argument it is given, wait 30 seconds, then exit.

Save this file as foo.c, then compile it with (note static linking).

 gcc foo.c -o foo --static


Create a condor job description, saving the file as condorjob:

 universe   = vanilla
 executable = foo 
 should_transfer_files = YES
 when_to_transfer_output = ON_EXIT
 log        = condorjob.log
 error      = condorjob.err
 output     = condorjob.out
 arguments  = "helloworld"
 Queue

then submit the job to condor:

 condor_submit condorjob

After this job finishes, you should have three files in the submission directory:

condorjob.err (contains the standard error, empty in this case)

condorjob.out (should contain standard output, in this case "helloworld")

condorjob.log (should contain info about the execution of the job, such as the machine that submitted the job and the machine that executed the job)


If you want to test this job on multiple slots (say 2 at once so you can see how Condor will execute the job on multiple execute resources), you can change the condorjob file to be like this:

 universe   = vanilla
 executable = foo 
 should_transfer_files = YES
 when_to_transfer_output = ON_EXIT
 log        = condorjob1.log
 error      = condorjob1.err
 output     = condorjob1.out
 arguments  = "helloworld1"
 Queue
 log        = condorjob2.log
 error      = condorjob2.err
 output     = condorjob2.out
 arguments  = "helloworld2"
 Queue

We had a case where we had 6 slots, 2 were 32 bit with Arch=INTEL, 4 were 64 bit with Arch=X86_64 (but we were unaware of the bit difference at first). We ran 6 jobs and then were wondering why they would only execute on the submitting machine. So we changed the condorjob file to specify a certain architecture by including

 Requirements = Arch == "INTEL"

and submitted this from the X86_64 machine. This told condor to execute only on machines with Architecture of Intel, so it was not attempted to execute on the X86_64 submitting machine. We then saw an error in condorjob1.log saying

 Exec format error

and we realized we had tried to run an executable compiled for the wrong architecture. I've included this story in case it helps with debugging.

Additional Information

Troubleshooting Condor

Our experience with Condor involved a lot of errors that we had to systematically understand and overcome. Here are some lessons from our experience.

  • Be sure that your executable is statically linked.
  • For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor.
  • When building BatchMake, you need to build with grid support on

Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.


condor_status

Allows you to check the status of the Condor Pool. This is the first command line program you should get working, as it will help you debug other problems.

condor_q

Shows you what processes are active, that you have submitted to condor. It will give you a cluster and process ID for each process.

condor_q -analyze CID.PID
condor_q -better-analyze CID.PID

When given the cluster ID and process ID, these will tell you how many execution machines matched each of your requirements for the job.


condor_config_val <CONDOR_VARIABLE>

Will tell you the value of that CONDOR_VARIABLE for your condor setup. This can help maintain your sanity.

condor_rm CID.PID

Will remove the job with cluster ID CID and process ID PID from your condor pool. Useful for killing Held or Idle jobs. For more details on these daemons, see section "3.1.2 The Condor Daemons" in the Condor Manual.

condor_master

On any machine, no matter its configuration, there will be a condor_master daemon. This process will start and stop all other condor daemons. condor_master will write a MasterLog and have a .master_address file. Be sure that the .master_address contains the correct IP address. If the correct IP address doesn't appear, set the correct value in the parameter

 NETWORK_INTERFACE = <desired IP>

in your condor_config.local file. This must be an interface the machine actually has, but condor may have picked a different IP than you would have wanted, perhaps the local loopback address 127.0.0.1 or a wireless instead of ethernet adapter.

condor_startd

This daemon runs on execution nodes, and represents a machine ready to do work for Condor. Its log file describes its requirements. This daemon starts up the condor_starter daemon.

condor_starter

This daemon runs on execution nodes, and is responsible for starting up actual execution processes, and logging their details.

condor_schedd

This daemon runs on submit nodes, and represents data about submitted jobs. It tracks the queueing of jobs and tries to get resources for all of its jobs to be run. When a job submitted is run, this daemon spawns a condor_shadow daemon.

condor_shadow

This daemon runs on the submission machine when an actual execution of the job is run. It will take care of systems calls that need to be executed on the submitting machine for a process. There will be a condor_shadow process for each executing process of a submission machine, meaning that on a machine with a large number of submitted processes, the number of shadow daemons supported by memory or other resources could be a limitation.

condor_collector

This daemon runs on the Pool Collector machine, and is responsible for keeping track of resources within the Pool. All nodes in the pool let this daemon on the Pool Collector machine know that they exist, what services they support, and what requirements they have.

condor_negotiator

This daemon typically runs on the Pool Collector machine, and negotiates between submitted jobs and executing nodes to match a job with an execution. Log files of interest include NegotiatorLog and MatchLog

condor_kbdd

This daemon is used to detect user activity on a execute node, so it can know whether to allow execution of a job or to disallow it because a human user currently is engaged in some task.

The right processor architecture

The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to IA64 which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors.

While trying to run the condor_master, the shell returned the following error message cannot execute binary file

Using the program readelf, it's possible to extract the header of an executable and understand if a given executable could run on a given platform.

kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel IA-64
  Version:                           0x1
  Entry point address:               0x40000000000bf3e0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          9382744 (bytes into file)
  Flags:                             0x10, 64-bit
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         7
  Size of section headers:           64 (bytes)
  Number of section headers:         32
  Section header string table index: 31
kitware@rigel:~$ readelf -h /bin/ls
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x4023c0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          104384 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         8
  Size of section headers:           64 (bytes)
  Number of section headers:         28
  Section header string table index: 27

kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master ELF Header:

 Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
 Class:                             ELF64
 Data:                              2's complement, little endian
 Version:                           1 (current)
 OS/ABI:                            UNIX - System V
 ABI Version:                       0
 Type:                              EXEC (Executable file)
 Machine:                           Advanced Micro Devices X86-64
 Version:                           0x1
 Entry point address:               0x4b9450
 Start of program headers:          64 (bytes into file)
 Start of section headers:          4553256 (bytes into file)
 Flags:                             0x0
 Size of this header:               64 (bytes)
 Size of program headers:           56 (bytes)
 Number of program headers:         8
 Size of section headers:           64 (bytes)
 Number of section headers:         31
 Section header string table index: 30

Comparing the different output, it's possible to observe the architecture Intel IA-64 isn't the right one.


Be sure that your executable is statically linked. For Unix submission/execution, We found that we needed to run jobs as a user that exists on all machines on the Condor grid, because we do not have shared Unix users on our network. In our case this user was Condor. When building BatchMake, you need to build with grid support on Condor supplies a number of utility programs and log files. These are extremely helpful in understanding and correcting problems. Our setups have Condor log files in /home/condor/localcondor/log and C:\condor\log.

Links

  • Detailed Condor documentation is also available on the website here

MPICH2 on Windows

Here we record our experience using MPICH2 on Windows, working towards using MPICH2+Condor in (ideally) a mixed Windows and Linux environment or else in a homogeneous environment.

This will have to be cleaned up to make more sense as we gain more experience.

MPICH2 Environment on Windows 7

First install MPICH2, I installed the version 1.3.2p1, windows 32 bit binary. Then add the location of the MPICH2\bin directory to your system path. Add mpiexec.exe and smpd.exe to the list of exceptions in the Windows firewall.

I created the same user/password combination (with administrative rights) on two different Windows 7 machines. All work was done logged in as that user.

I reset the smpd passphrase

 smpd -remove

then set the same smpd passphrase on both machines

 smpd -install -phrase mypassphrase

and could then check the status using

 smpd -status


I had to register my user credentials on both machines by

 mpiexec -register

then accepted the user it suggested by hitting enter, then entered my user's password.

You can check which user you are with

 mpiexec -whoami

And now you should be able to validate with

 mpiexec -validate

it will ask you for an authentication password for smpd, this the mypassphrase you entered above. If all is correct at this point, you will get a result of SUCCESS.

To test this, you can run the cpi.exe example application in the MPICH2\examples directory (I ran it with 4 processors since my machine has 4 cores):

 mpiexec -n 4 full_path_to\cpi.exe

You can also test that both machines are participating in a computation by specifying the IPs of the two machines, and how many processes each machine should run, and the executable hostname, which should return the two different hostnames of the two machines.

 mpiexec -hosts 2 ip_1 #_cores_1 ip_2 #_cores_2 hostname

Creating an MPI program on Windows

I used MS Visual Studio Express 2008 (MSVSE08) to build a C++ executable.

Here is a very simple working C++ example program:

 #include "stdafx.h"
 #include <iostream>
 #include "mpi.h"
 using namespace std;
 //
 int main(int argc, char* argv[]) 
   {
   // initialize the MPI world
   MPI::Init(argc,argv);
   //
   // get this process's rank
   int rank = MPI::COMM_WORLD.Get_rank();
   //
   // get the total number of processes in the computation
   int size = MPI::COMM_WORLD.Get_size();
   //
   // print out where this process ranks in the total
   std::cout << "I am " << rank << " out of " << size << std::endl;
   //
   // Finalize the MPI world
   MPI::Finalize();
   return 0;
   }

To compile this in MSVSE08, you must right click on the project file (not the solution file), then click properties, and make the following additions:

  • Under C/C++ menu, Additional Include Directories property, add the full path to the MPICH2\include directory
  • Under Linker/General menu, Additional Library Directories, add the full path to the MPICH2\lib directory
  • Under Linker/Input menu, Additional Dependencies, add mpi.lib and cxx.lib

You can test your application using:

 mpiexec -n 4 full_path_to\your.exe