Update on fisher

| | Comments (0)
One move is in progress; two are remaining of the original 50+.

The motivation for bringing the system down was that during a planned move, an md5sum on the source and destination data did not match. After verifying the instance was down and rerunning md5sum twice locally, the md5sum still did not match across any of those 4 times.

The total amount of data md5sum'ed was around 200GB and a single bit error in any of those would have changed the checksum. It does not indicate mass corruption but I wanted to minimize the risk that bad data would be marked clean, which is why the power got pulled.

After bringing up the drives in a different chassis, md5sum ran 2x matched and also matched one of the md5sums collected from fisher, which gave us sufficient confidence that the data was OK to move.

We are performing a memory test on the old fisher box and will follow up with other tests to try to understand what happened and see if it can be detected less easily in the future.

Leave a comment

About this Entry

This page contains a single entry by srn published on February 13, 2015 8:54 AM.

Emergency maintenance - Fisher.prgmr.com going down now was the previous entry in this blog.

sorry for breaking rack 05-10 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.