Proposals:Condor

From KitwarePublic
Jump to navigationJump to search

Introduction

Condor is an open source distributed computing software framework. It can be used to manage workload on a dedicated cluster of computers, and/or to farm out work to idle desktop computers. Condor is a cross-platform system that can be run on Unix and Windows operating system. Condor is a complex and flexible system that can jobs in serial and parallel mode. For parallel jobs, it supports the standard MPI standard. This Wiki page is dedicated to document our working experience using Condor.

Downloading Condor

Different versions of condor can be downloaded from here. This documentation focuses on our experience installing/configure Condor Version 7.2.0. Detail documentation for this version can be found here

Preparation

As Condor is a flexible system, there are different ways of configuring condor in your computing infrastructure. Hence, before starting installation, make the following important decisions.

  1. What machine will be the central manager?
  2. What machines should be allowed to submit jobs?
  3. Will Condor run as root or not?
  4. Do I have enough disk space for Condor?
  5. Do I need MPI configured?

Condor can be installed as either a manager node, a execute or a submit node. Or any combination of these ones. See The Different Roles a Machine Can Play

  • Manager: There can be only one central manager for your pool. The machine is the collector of information, and the negotiator between resources and resource request
  • Execute: Any machine in your pool (including the Central Manager) can be configured to execute Condor jobs.
  • Submit: Any machine in your pool (including the Central Manager) can be configured to allow Condor jobs to be submitted.

Installation

Unix

The official instructions on how to install Condor in Unix can be found here . Below we present some of tweaks we had to do to get it to work on our Unix machines.

Prerequisites

  • Be sure the server has a hostname and a domain name
 hostname

should return mymachine.mydomain.com (or .org, .edu, etc.) , if it only returns mymachine, then your server does not have a fully qualified domain name.

To set the domain name, edit /etc/hosts and add your domain name to the first line. You might see something like

10.171.1.124 mymachine

change this to

10.171.1.124 mymachine.mydomain.com

Also edit /etc/hostname to be

mymachine.mydomain.com

Then reboot so that the hostname changes take effect.

  • Make sure the following packages are installed:
apt-get install mailutils
  • Make sure the server has a hostname and a domainname.

For example, you could run a similar command to download the desired package:

wget http://parrot.cs.wisc.edu//symlink/20090223121502/7/7.2/7.2.1/fec3779ab6d2d556027f6ae4baffc0d6/condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz
  • You should install Condor as root or with a user having equivalent privileges

Configuring a Condor Manager in Unix

  • Make sure the condor archive is in your home directory (/home/kitware), then untar it.
cd ~
tar -xzvf condor-7.2.X-linux-x86_64-rhel5-dynamic.tar.gz
cd ./condor-7.2.X
  • If not yet done, create a condor user
adduser condor
  • Run the installation scripts condor_install
./condor_install --install=. --prefix=/root/condor --local-dir=/home/condor/localcondor

After running the installation script, you should get the following output:

Installing Condor from /root/condor-7.2.X to /root/condor

Condor has been installed into:
    /root/condor

Configured condor using these configuration files:
  global: /root/condor/etc/condor_config
  local:  /home/condor/localcondor/condor_config.local
Created scripts which can be sourced by users to setup their
Condor environment variables.  These are:
   sh: /root/condor/condor.sh
  csh: /root/condor/condor.csh
  • Switch to the directory where condor is now installed
cd /root/condor
  • Edit /etc/environment and update PATH variable to include the directory /root/condor/bin and /root/condor/sbin
PATH="/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"
  • Add the following line
CONDOR_CONFIG="/root/condor/etc/condor_config"
  • Save file and apply the change by running
source /etc/environment
  • Make sure CONDOR_CONFIG and PATH are set correctly
root@rigel:~$ echo $PATH
/root/condor/bin:/root/condor/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

root@rigel:~$ echo $CONDOR_CONFIG
/root/condor/etc/condor_config
  • You can know logout / login or even restart he machine, and you should be able check that CONDOR_CONFIG and PATH environment variables are still set.
  • Edit condor manager config_file and update the line as referenced below:
cd ~/condor
vi ./etc/condor_config
RELEASE_DIR              = /root/condor
LOCAL_DIR                = /home/condor/localcondor
CONDOR_ADMIN             = email@website.com
UID_DOMAIN               = website.com
FILESYSTEM_DOMAIN        = website.com
HOSTALLOW_READ           = *.website.com
HOSTALLOW_WRITE          = *.website.com
HOSTALLOW_CONFIG         = $(CONDOR_HOST)
  • In order to allow Midas to run condor command, create a link to /root/condor/etc/condor_config into /home/condor
cd /home/condor
ln -s /home/condor/etc/condor_config condor_config


Configuring a Executer/Submitter in Unix

The different files allowing the server to be also used as a condor submitter/executer have been automatically updated while running the installation script condor_install. Nevertheless, you still need to update its configuration file.

  • Edit condor node config_file.local and update the line as referenced below:
vi /home/condor/condor_config.local
CONDOR_ADMIN        = email@website.com

If the installation went well, the line having UID_DOMAIN and FILESYSTEM_DOMAIN should already be set to website.com

Windows

The official documentation on how to install Condor in Windows can be found here

Running Condor

The official user's manual on how to perform distributed computing here


  • run the condor manager
condor_master
  • Assuming at the installation process, you setup the type as manager,execute,submit (the default), running the following command
ps -e | egrep condor_
  • You should get something similar to:
1063 ?        00:00:00 condor_master
1064 ?        00:00:00 condor_collecto
1065 ?        00:00:00 condor_negotiat
1066 ?        00:00:00 condor_schedd
1067 ?        00:00:00 condor_startd
1068 ?        00:00:00 condor_procd
  • If you run the command ps -e | egrep condor_ just after you started condor, you may also see the following line
1077 ?        00:00:00 condor_starter


  • Check the status
kitware@rigel:~$ condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
 
slot1@rigel        LINUX      X86_64 Unclaimed Idle     0.010  1006  0+00:10:04
slot2@rigel        LINUX      X86_64 Unclaimed Idle     0.000  1006  0+00:10:05

                    Total Owner Claimed Unclaimed Matched Preempting Backfill

       X86_64/LINUX     2     0       0         2       0          0        0

              Total     2     0       0         2       0          0        0
  • Setup condor to automatically startup
cp /root/condor/etc/example/condor.boot /etc/init.d/
  • Update MASTER parameter in condor.boot to match your current setup
vi /etc/init.d/condor.boot

MASTER=/root/condor/sbin/condor_master
  • Add condor.boot service to all runlevel
kitware@rigel:~$ update-rc.d condor.boot defaults

/etc/rc0.d/K20condor.boot -> ../init.d/condor.boot
/etc/rc1.d/K20condor.boot -> ../init.d/condor.boot
/etc/rc6.d/K20condor.boot -> ../init.d/condor.boot
/etc/rc2.d/S20condor.boot -> ../init.d/condor.boot
/etc/rc3.d/S20condor.boot -> ../init.d/condor.boot
/etc/rc4.d/S20condor.boot -> ../init.d/condor.boot
/etc/rc5.d/S20condor.boot -> ../init.d/condor.boot

Additional Information

  • The right processor architecture

The initial installation was done with the package condor-7.2.0-linux-ia64-rhel3-dynamic.tar.gz corresponding to IA64 which corresponds to the Intel Itanium processor 64bits. It doesn't include all 64bits intel processors.

While trying to run the condor_master, the shell returned the following error message cannot execute binary file

Using the program readelf, it's possible to extract the header of an executable and understand if a given executable could run on a given platform.

kitware@rigel:~/$ readelf -h ~/condor-7.2.0_IA64/sbin/condor_master
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel IA-64
  Version:                           0x1
  Entry point address:               0x40000000000bf3e0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          9382744 (bytes into file)
  Flags:                             0x10, 64-bit
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         7
  Size of section headers:           64 (bytes)
  Number of section headers:         32
  Section header string table index: 31
kitware@rigel:~$ readelf -h /bin/ls
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x4023c0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          104384 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         8
  Size of section headers:           64 (bytes)
  Number of section headers:         28
  Section header string table index: 27

kitware@rigel:~$ readelf -h ./condor-7.2.0/sbin/condor_master ELF Header:

 Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
 Class:                             ELF64
 Data:                              2's complement, little endian
 Version:                           1 (current)
 OS/ABI:                            UNIX - System V
 ABI Version:                       0
 Type:                              EXEC (Executable file)
 Machine:                           Advanced Micro Devices X86-64
 Version:                           0x1
 Entry point address:               0x4b9450
 Start of program headers:          64 (bytes into file)
 Start of section headers:          4553256 (bytes into file)
 Flags:                             0x0
 Size of this header:               64 (bytes)
 Size of program headers:           56 (bytes)
 Number of program headers:         8
 Size of section headers:           64 (bytes)
 Number of section headers:         31
 Section header string table index: 30

Comparing the different output, it's possible to observe the architecture Intel IA-64 isn't the right one.

Links

  • Detailed Condor documentation is also available on the website here