Your friendly tchncs.de admin

  • 26 Posts
  • 30 Comments
Joined 2 years ago
cake
Cake day: June 1st, 2023

help-circle




  • its funny at one regional market here, the cashiers are fed up with people being stupid and neet help or are even stealing. so sadly its disabled most of the time. at a different place, there keeps being a loooong line, but literaly noone uses the self checkout. it is so useful. if i forgot something or just gotta get coffee beans, i quickly swing by that place because i know no matter how busy it is, i dont have to wait in line xD



















  • I am a bit confused now… the spare was 98% as to read in my snippet above … where does it say “no spare available”? I think it is on me to request a swap, and thats what i did as also the one with slightly less wear reported 255% used – which afaik is an aprox. lifetime left estimation based on rw cycles (not sure about all factors).

    The one the hoster left in for me to play with, said no:

    [Wed Jul 26 19:19:10 2023] nvme nvme1: I/O 9 QID 0 timeout, disable controller
    [Wed Jul 26 19:19:10 2023] nvme nvme1: Device shutdown incomplete; abort shutdown
    [Wed Jul 26 19:19:10 2023] nvme nvme1: Removing after probe failure status: -4
    

    Tried multiple kernelflags n stuff but couldn’t get past that error. Would have been interesting to have the hoster ship the thing to me (and maybe that would have been a long enough cooldown to have the thing working again), but i assume that would have been expensive from helsinki.



  • i am not sure about those numbers on the new ones … it was one db restore and a few hrs of uptime … a scrub… , then i rsynced some stuff over and since then the thing is in idle 🤷

    sample of the current active system … i think at time of arrival it was 2+tb written or something

    SMART/Health Information (NVMe Log 0x02)
    Critical Warning:                   0x00
    Temperature:                        37 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    0%
    Data Units Read:                    88,116,921 [45.1 TB]
    Data Units Written:                 43,968,235 [22.5 TB]
    Host Read Commands:                 689,015,212
    Host Write Commands:                409,762,513
    Controller Busy Time:               1,477
    Power Cycles:                       4
    Power On Hours:                     248
    Unsafe Shutdowns:                   0
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      0
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               37 Celsius
    Temperature Sensor 2:               46 Celsius
    
    Error Information (NVMe Log 0x01, 16 of 64 entries)
    No Errors Logged
    


  • Dang the old host was deleted from the monitoring – however looking on at least one smart thing from my emails, there were no errors logged before the drives gave up on life during replacement. They just had a ton read/written and the used counter at 255% (even tho rw and age were not equal, its weird and one reason why i wanted to have at least one replaced in the first place). This is the one that had more:

    SMART/Health Information (NVMe Log 0x02)
    Critical Warning:                   0x04
    Temperature:                        53 Celsius
    Available Spare:                    98%
    Available Spare Threshold:          10%
    Percentage Used:                    255%
    Data Units Read:                    7,636,639,249 [3.90 PB]
    Data Units Written:                 2,980,551,083 [1.52 PB]
    Host Read Commands:                 87,676,174,127
    Host Write Commands:                28,741,297,023
    Controller Busy Time:               705,842
    Power Cycles:                       7
    Power On Hours:                     17,437
    Unsafe Shutdowns:                   1
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      0
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               53 Celsius
    Temperature Sensor 2:               64 Celsius
    
    Error Information (NVMe Log 0x01, 16 of 64 entries)
    No Errors Logged
    

    The new ones now, where the zpool errors happened look like this

    SMART/Health Information (NVMe Log 0x02)
    Critical Warning:                   0x00
    Temperature:                        24 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          5%
    Percentage Used:                    3%
    Data Units Read:                    122,135,021 [62.5 TB]
    Data Units Written:                 31,620,076 [16.1 TB]
    Host Read Commands:                 1,014,224,069
    Host Write Commands:                231,627,064
    Controller Busy Time:               3,909
    Power Cycles:                       2
    Power On Hours:                     117
    Unsafe Shutdowns:                   0
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      4
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               24 Celsius
    
    Error Information (NVMe Log 0x01, 16 of 256 entries)
    Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
      0          4     0  0x0000  0x8004  0x000            0     0     -
    
    SMART/Health Information (NVMe Log 0x02)
    Critical Warning:                   0x00
    Temperature:                        24 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          5%
    Percentage Used:                    2%
    Data Units Read:                    153,193,333 [78.4 TB]
    Data Units Written:                 29,787,075 [15.2 TB]
    Host Read Commands:                 1,262,977,843
    Host Write Commands:                230,135,280
    Controller Busy Time:               4,804
    Power Cycles:                       11
    Power On Hours:                     119
    Unsafe Shutdowns:                   5
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      14
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               24 Celsius
    
    Error Information (NVMe Log 0x01, 16 of 256 entries)
    Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
      0         14     0  0x100d  0x8004  0x000            0     0     -