Hello all,
I run a fairly significantly sized HPC cluster with a little more than 800 users. I just recently completed an update from a PanFS based parallel file storage appliance to a GPFS based parallel file storage appliance.
The upgrade went well, although slowly; Obviously the rsync's took several days to complete, not too bad with 10GbE and Infiniband. I used GNU parallel combined with an input list of all my users in order to parallelize the file transfer. I ran my parallel rsync command quite a few times coming up to the shutdown period where I could do the final transfer. After I kicked all the users off the cluster and stopped all the jobs, I ran the parallel rsync several times to be sure that everything was transferred.
After I allowed users back on the cluster, two users complained that they had files missing. I double checked the old PanFS appliance vs the GPFS appliance and sure enough they did have files missing. This incident prompted a bit of panic mode on my part as I scrambled to though all 800+ users to find if anyone else had files missing.
I fingered every user on the system to find users that hadn't logged in since the shutdown, then quickly wrote a script that did a count for every file they have in their home directory on the new GPFS appliance vs the old PanFS appliance. I found several users that were missing lots of files that the multiple rsync's missed.
Now I am a bit puzzled how this could have happened... worse yet, I use a very similar method of parallel rsyncs on my backup servers.
Can rsync simply not handle the volume of files I am feeding it? Has anyone else ran into this issue before?
Thanks!
[link][14 comments]