Dec 15th, 2009
Take everything down due to scheduled electrical work on the building. Take everything up again (uneventfully) the next day.
Take everything down due to scheduled electrical work on the building. Take everything up again (uneventfully) the next day.
Power failure again, but this time with a Murphy's touch: the power went down for a period of time suffieciently short to kill the jobs, but not sufficiently long to permanently shutdown the nodes. Nodes came-up again, slurm restarted the jobs (from their state three weeks ago), and this was when things went to hell in a handbasket: the restart files that would be needed to continue the long jobs were overwritten, making a proper restart impossible. As a result, a –no-requeue flag was added to the NAMDjob script.
<< Newer entries | Older entries >>





