Wiki » History » Version 23

« Previous - Version 23/169 (diff) - Next » - Current version
Gueguen Mikael, 05/11/2015 05:18 PM


Mésocentre SPIN Calcul

Machine de calcul MPI, THOR

La machine de calcul Thor est un cluster MPI SGI ICE-X de 2300 coeurs. Il est composé de 115 lames de calcul bi-socket Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (20 coeurs par lames) avec 32 Go de mémoire par socket soit 3.2 Go par coeurs.
!!

Chaque lame est raccordée à un réseau Infiniband hypercube enhanced FDR à 56 Gbit/s. Le stockage des données de calcul est basé sur un système de fichier parallèle Lustre 2.5 via le réseau Infiniband. Il offre une capacité de 56 To pour le répertoire /scratch et de 21 To pour le répertoire /home.

La connexion à la machine se fait à l'adresse thor.univ-poitiers.fr par ssh sur le port 86 :

ssh -p 86 -X homer@thor.univ-poitiers.fr

monitoring : http://thor-ganglia.univ-poitiers.fr

PBS

code de retour

The PBS exit value of a job may fall in one of four ranges:
  • X = 0 (= JOB_EXEC_OK)
    This is a PBS special return value indicating that the job executed successfully
  • X < 0
    This is a PBS special return value indicating that the job could not be executed. These negative values are:

-1 = JOB_EXEC_FAIL1 : Job exec failed, before files, no retry
-2 = JOB_EXEC_FAIL2 : Job exec failed, after files, no retry
-3 = JOB_EXEC_RETRY : Job exec failed, do retry
-4 = JOB_EXEC_INITABT : Job aborted on MOM initialization
-5 = JOB_EXEC_INITRST : Job aborted on MOM initialization, checkpoint, no migrate
-6 = JOB_EXEC_INITRMG : Job aborted on MOM initialization, checkpoint, ok migrate
-7 = JOB_EXEC_BADRESRT : Job restart failed
-8 = JOB_EXEC_GLOBUS_INIT_RETRY : Initialization of Globus job failed. Do retry.
-9 = JOB_EXEC_GLOBUS_INIT_FAIL : Initialization of Globus job failed. Do not retry.
-10 = JOB_EXEC_FAILUID : Invalid UID/GID for job
-11 = JOB_EXEC_RERUN : Job was rerun
-12 = JOB_EXEC_CHKP : Job was checkpointed and killed
-13 = JOB_EXEC_FAIL_PASSWORD : Job failed due to a bad password
-14 = JOB_EXEC_RERUN_ ON_SIS_FAIL : Job was requeued (if rerunnable) or deleted (if not) due to a communication failure between Mother Superior and a Sister

  • 0 <= X < 128 (or 256 depending on the system)

    This is the exit value of the top process in the job, typically the shell. This may be the exit value of the last command executed in the shell or the .logout script if the user has such a script (csh).

  • X >=128 (or 256 depending on the system)
    This means the job was killed with a signal. The signal is given by X modulo 128 (or 256). For example an exit value of 137 means the job's top process was killed with signal 9 (137 % 128 = 9).

Documentations à télécharger

Codes installés

librairies installés

  • VTK
  • BOOST
  • blcr utilitaire pour faire du Checkpoint / Restart
  • perfboost utilitaire pour améliorer les performances d'un code MPI non compilé avec la librairie MPT de SGI

archi_thor.png - architecture (356 KB) Gueguen Mikael, 05/22/2015 03:27 PM

thor.jpg (519 KB) Gueguen Mikael, 05/22/2015 03:28 PM

IMG_1361.JPG - vue thor (103 KB) Gueguen Mikael, 09/01/2015 11:17 AM

initiation_linux_15-03-2021.pdf (4.61 MB) Laplaceta Pierre Francois, 03/15/2021 04:39 PM