April 22nd, 2009

Power failure again (and again, and again). This time, and due to Easter, everything stayed down till the 26th.

2009/04/26 16:45

April 16th, 2009

n0003 followed the steps of n0004 and n0008: yet another power supply failed (which takes us to three out of eight in less than three months). Pooh.

The behaviour of slurm is commendable: two jobs were running on the failed node. When the node stopped responding, slurm set it to 'down', and re-queued the jobs. One of them started immediately on two cores that were not allocated, the other still awaits resources. Nice.

2009/04/16 17:42

<< Newer entries | Older entries >>

The full maintenance archive is kept here

…and finally, The infamous MBG's Power Failure Log

about/maintenance.txt · Last modified: 2011/01/31 17:56 (external edit)