April 22nd, 2009
Power failure again (and again, and again). This time, and due to Easter, everything stayed down till the 26th.
Power failure again (and again, and again). This time, and due to Easter, everything stayed down till the 26th.
n0003 followed the steps of n0004 and n0008: yet another power supply failed (which takes us to three out of eight in less than three months). Pooh.
The behaviour of slurm is commendable: two jobs were running on the failed node. When the node stopped responding, slurm set it to 'down', and re-queued the jobs. One of them started immediately on two cores that were not allocated, the other still awaits resources. Nice.
<< Newer entries | Older entries >>





