Dec 15th, 2009

Take everything down due to scheduled electrical work on the building. Take everything up again (uneventfully) the next day.

2010/01/08 14:42

Dec 1st, 2009

Power failure again, but this time with a Murphy's touch: the power went down for a period of time suffieciently short to kill the jobs, but not sufficiently long to permanently shutdown the nodes. Nodes came-up again, slurm restarted the jobs (from their state three weeks ago), and this was when things went to hell in a handbasket: the restart files that would be needed to continue the long jobs were overwritten, making a proper restart impossible. As a result, a –no-requeue flag was added to the NAMDjob script.

2009/12/01 12:52

<< Newer entries | Older entries >>

The full maintenance archive is kept here

…and finally, The infamous MBG's Power Failure Log

about/maintenance.txt · Last modified: 2011/01/31 17:56 (external edit)