Quantcast
Channel: linuxadmin: Expanding Linux SysAdmin knowledge
Viewing all articles
Browse latest Browse all 17927

Production system hangs after "rescheduling sector"?

$
0
0

Hi,

I have this Production Debian system hangs often and I am trying to determine why exactly. The last entries in the syslog before I power cycled at 5:28am this morning are these.

Jan 9 05:28:22 gso-01 kernel: [424130.686902] sd 0:0:7:0: [sdh] Unhandled sense code Jan 9 05:28:22 gso-01 kernel: [424130.691801] sd 0:0:7:0: [sdh] Result: hostbyte=invalid driverbyte=DRIVER_SENSE Jan 9 05:28:22 gso-01 kernel: [424130.699212] sd 0:0:7:0: [sdh] Sense Key : Medium Error [current] Jan 9 05:28:22 gso-01 kernel: [424130.705521] Info fld=0x45dcbcd Jan 9 05:28:22 gso-01 kernel: [424130.708670] sd 0:0:7:0: [sdh] Add. Sense: Unrecovered read error Jan 9 05:28:22 gso-01 kernel: [424130.714890] sd 0:0:7:0: [sdh] CDB: Read(10): 28 00 04 5d cb 88 00 00 58 00 Jan 9 05:28:22 gso-01 kernel: [424130.722063] end_request: critical target error, dev sdh, sector 73255885 Jan 9 05:28:22 gso-01 kernel: [424130.728889] md/raid10:md2: sdh2: rescheduling sector 58642312 

Any idea how that could hang a system indefinitely until a power cycle? The machine goes completely unresponsive, and I need to use IPMI to power cycle it. I can't get a console with IPMI either until I power cycle the machine.

This is the SMART report on the drive

# smartctl -a /dev/sdh smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-2-amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net Vendor: SEAGATE Product: ST31000424SS Revision: KS68 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Logical block size: 512 bytes Logical Unit id: 0x5000c50025fe3b7b Serial number: 9WK35JMD Device type: disk Transport protocol: SAS Local Time is: Sat Jan 9 12:03:53 2016 EST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 32 C Drive Trip Temperature: 68 C Manufactured in week 04 of year 2011 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 64 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 64 Elements in grown defect list: 11 Vendor (Seagate) cache information Blocks sent to initiator = 3367248242 Blocks received from initiator = 629034529 Blocks read from cache and sent to initiator = 1852703182 Number of read and write commands whose size <= segment size = 3801627115 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 40730.05 number of minutes until next internal SMART test = 9 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 1434286960 1262 0 1434288222 1434288242 106983.677 15 write: 0 0 0 0 0 34110.193 0 verify: 480297 0 0 480297 480297 5.708 0 Non-medium error count: 1 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed 16 40726 - [- - -] # 2 Background short Completed 16 40707 - [- - -] # 3 Background short Completed 16 40683 - [- - -] # 4 Background short Completed 16 40659 - [- - -] # 5 Background short Completed 16 40635 - [- - -] # 6 Background short Completed 16 40611 - [- - -] # 7 Background short Completed 16 40587 - [- - -] # 8 Background long Completed 16 40572 - [- - -] # 9 Background short Completed 16 40539 - [- - -] #10 Background short Completed 16 40515 - [- - -] #11 Background short Completed 16 40491 - [- - -] #12 Background short Completed 16 40467 - [- - -] #13 Background long Completed 16 40449 - [- - -] #14 Background short Completed 16 40371 - [- - -] #15 Background short Completed 16 40347 - [- - -] #16 Background short Completed 16 40323 - [- - -] #17 Background short Completed 16 40299 - [- - -] #18 Background short Completed 16 40275 - [- - -] #19 Background short Completed 16 40251 - [- - -] #20 Background long Completed 16 40238 - [- - -] Long (extended) Self Test duration: 11100 seconds [185.0 minutes] 

I feel like there is something wrong other than just this disk if a simple error like this can take down an entire system.

Ideas?

Thanks,

David

submitted by djonesax
[link][10 comments]

Viewing all articles
Browse latest Browse all 17927

Latest Images

Trending Articles



Latest Images