I'm thinking of moving away from maui/torque for my cluster but don't really have a handle on what are the better options available.
My cluster runs Scientific Linux 5 64bit, has 60 dual-quad nodes with 16Gb of RAM and the jobs the users normally send run for a few hours and typically use one core and ~2Gb of RAM. Recently we've changed some of the things we're studying and now the memory requirements of the jobs can skyrocket (some poor guy was killing the nodes today with jobs using ~30Gb of memory including virt.)
So, I'm wondering if there's any nice scheduler/batch system combinations which would give a bit better ability to tune how it treats jobs which violate the requested resources than maui/torque and ideally also gives me some nice metrics too.
[link] [10 comments]