... / Proxmox planté ?

Question

Proxmox planté ?

Created on 2018-02-24 14:46:40 (edited on 2024-09-04 13:11:23) in Serveurs dédiés

Il y a qques jours, j'ai un Proxmox 5.5 qui plante.
Host64L avec RAID HARDWARE SSD + HD 2TB

Les sites sont morts ou avec erreurs.
Plus de SSH.

Je peux rebooter en mode sans échec :
- les tests hardwares en graphique ne donnent rien : cpu, ram, SSD, HD 2TB ne donnent rien
- en mode sans échec, je ne vois rien non plus.

Je ne sais pas faire le fsck.ext4 sur le HD car il réclame une nouvelle version e2fsck.
Mais ça ne doit pas être ça, car il bootait sur les SSD, qui sont corrects.

J'ai rebooté le dédié via le manager OVH vers 9h30, mais il n'a jamais redémarré correctement.

Rien vu dans les logs !

Feb 21 06:41:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 06:41:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 06:41:01 rbx03 CRON[25662]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 06:42:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 06:42:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 06:42:01 rbx03 CRON[26330]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 06:42:49 rbx03 smartd[987]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 81 to 82
Feb 21 06:42:56 rbx03 rrdcached[2433]: flushing old values
Feb 21 06:42:56 rbx03 rrdcached[2433]: rotating journals
Feb 21 06:42:56 rbx03 rrdcached[2433]: started new journal /var/lib/rrdcached/journal/rrd.journal.1519195376.692787
Feb 21 06:42:56 rbx03 rrdcached[2433]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1519188176.692758
Feb 21 06:43:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 06:43:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 06:43:01 rbx03 CRON[27112]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 06:44:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 06:44:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 06:44:01 rbx03 CRON[28378]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 06:45:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 06:45:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 06:45:01 rbx03 CRON[29179]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 06:45:01 rbx03 CRON[29180]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Feb 21 06:46:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 06:46:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 06:46:01 rbx03 CRON[3217]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 06:47:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 06:47:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 06:47:01 rbx03 CRON[3865]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 06:47:29 rbx03 ntpd[2437]: receive: Unexpected origin timestamp 0xde379375.32ca4ef1 does not match aorg 0000000000.00000000 from sym_active@188.165.214.188 xmt 0xde379481.5493c482
Feb 21 06:48:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 06:48:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 06:48:01 rbx03 CRON[4602]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 06:49:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 06:49:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 06:49:01 rbx03 CRON[5943]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Feb 21 09:32:21 rbx03 systemd-modules-load[366]: Inserted module 'iscsi_tcp'
Feb 21 09:32:21 rbx03 kernel: [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x4f8 with crng_init=0
Feb 21 09:32:21 rbx03 systemd-modules-load[366]: Inserted module 'ib_iser'
Feb 21 09:32:21 rbx03 kernel: [ 0.000000] Linux version 4.13.13-5-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18)) #1 SMP PVE 4.13.13-38 (Fri, 26 Jan 2018 10:47:09 +0100) ()
Feb 21 09:32:21 rbx03 systemd-modules-load[366]: Inserted module 'vhost_net'
Feb 21 09:32:21 rbx03 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.13.13-5-pve root=/dev/sda1 ro noquiet nosplash net.ifnames=0 biosdevname=0
Feb 21 09:32:21 rbx03 systemd-udevd[399]: Network interface NamePolicy= disabled on kernel command line, ignoring.
Feb 21 09:32:21 rbx03 kernel: [ 0.000000] KERNEL supported cpus:
Feb 21 09:32:21 rbx03 kernel: [ 0.000000] Intel GenuineIntel
Feb 21 09:32:21 rbx03 systemd[1]: Starting Flush Journal to Persistent Storage...
Feb 21 09:32:21 rbx03 kernel: [ 0.000000] AMD AuthenticAMD
Feb 21 09:32:21 rbx03 kernel: [ 0.000000] Centaur CentaurHauls
Feb 21 09:32:21 rbx03 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

Le check du RAID hardware ne donne rien : les 2 disques RAID SSD et HD sont correctes.

Je pensais réinstaller sur un nouveau Host32 RAID soft, mais je voudrais quand même comprendre ce qui s'est passé ?

Des idées pour poursuivre mes recherches ?

Merci
Bon WE. Didier

Upvotes (0)

7086 Views

27 Replies ( Latest reply on 2018-05-25 15:17:54 by

DidierM

)

Buddy

Bonjour,

pour forcer un fsck au boot (normal) du serveur, tu peux en rescue placer ce fichier sur la partition racine pour qu'elle ait un check automatique au reboot

touch /forcefsck

Le reboot peut être particulièrement long lors du FSCK..

attention en rescue, il faut faire attention de la placer au bon endroit (et pas sur la partition du mode rescue).

Helpful (0)

DidierM

Author

Bonjour Buddy
En Rescue, j'ai fait des checks

sur le SSD :

root@rescue:/mnt# **fsck.ext4 /dev/sda1**
e2fsck 1.42.12 (29-Aug-2014)
/: clean, 77786/1281120 files, 881576/5119744 blocks

Rien détecté.

Pas pu tester le HD 2TB :
root@rescue:/mnt# **fsck.ext4 /dev/sdb1**
e2fsck 1.42.12 (29-Aug-2014)
/dev/sdb1 has unsupported feature(s): metadata_csum
e2fsck: Get a newer version of e2fsck!

Mais je ne pense pas que ce soit le disque dur de 2 TB /dev/sdb, car ça n'aurait pas planté le boot, vu que le système est sur le SSD /dev/sda ...

Le sda5 est "in used"

Le forcefsck au boot, je vais voir comment faire, en particulier depuis le mode rescue !
Mais alors, je dois rebooter le dédié en mode "normal", pas "rescue" ?

Merci, Didier

Helpful (0)

janus57

Bonjour,

sinon si c'est bien un serveur de la gamme OVH vous avez normalement accès au KVM/IP, pourquoi ne pas en tirer profit et l'utiliser pour voir ce qui se passe très exactement au boot ?

Cordialement, janus57

Helpful (0)

DidierM

Author

vous avez normalement accès au KVM/IP,

L'applet Java, ça me sauve le fichier Java : kvm.jnlp
Mais si j'essaie de l'ouvrir, il retourne dans Chromium pour le sauver...

Je suis en Ubuntu avec Chromium ou Firefox.
J'ai essayé l'autre mode "SOL".
ça démarre une fenêtre... mais je ne sais pas trop quoi faire.
J'ai tenté le reboot depuis le manager, mais j'ai immédiatement un message :

_> Les fonctionnalités du module IPMI sont temporairement désactivées car une tâche est en cours._

...
Didier

Helpful (0)

DidierM

Author

p... :o
**Il a rebooté !!!**
Alors que je l'ai déjà fait mercredi, et ça n'avait rien arrangé !
Maintenant c'est ok ...

Bon, comment je vais avoir confiance dans ce serveur moi ?

Je vais checker les RAID, le syslog...
Didier

Helpful (0)

Buddy

Il a rebooté !!!
Alors que je l'ai déjà fait mercredi, et ça n'avait rien arrangé !
Maintenant c'est ok ...

Bon, comment je vais avoir confiance dans ce serveur moi ?

Vous les avez fait quand les fsck ? Vous n'avez pas essayé de rebooter après ?

Helpful (0)

DidierM

Author

mais en mode normal, tous les disques sont montés !
donc, difficile de faire un fsck.
Démonter les disques... faut que j'arrête le Proxmox. Jamais fait.

**Je vérifie les volumes RAID :**

# storcli /c0 /vall show
Controller = 0
Status = Success
Description = None

Virtual Drives :
==============

-----------------------------------------------------------
DG/VD TYPE State Access Consist Cache sCC Size Name
-----------------------------------------------------------
0/0 RAID1 Optl RW Yes RWTD - 446.625 GB
1/1 RAID1 Optl RW Yes RWBD - 1.818 TB
-----------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

-----------------------
ça semble cohérent.
"Consist" est "Yes" pour le RAID SSD et le RAID HD.
-----------------------
rien ici non plus :

# MegaCli -PDList -aAll

Adapter #0

Enclosure Device ID: 252
Slot Number: 0
Device Id: 4
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 447.130 GB [0x37e436b0 Sectors]
Non Coerced Size: 446.630 GB [0x37d436b0 Sectors]
Coerced Size: 446.625 GB [0x37d40000 Sectors]
Firmware state: Online
SAS Address(0): 0x4433221103000000
Connected Port Number: 3(path0)
Inquiry Data: PHYS738000LD480BGN INTEL SSDSC2KB480G7 SCV10100
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Solid State Device

Enclosure Device ID: 252
Slot Number: 1
Device Id: 5
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 447.130 GB [0x37e436b0 Sectors]
Non Coerced Size: 446.630 GB [0x37d436b0 Sectors]
Coerced Size: 446.625 GB [0x37d40000 Sectors]
Firmware state: Online
SAS Address(0): 0x4433221102000000
Connected Port Number: 2(path0)
Inquiry Data: PHYS738000FJ480BGN INTEL SSDSC2KB480G7 SCV10100
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Solid State Device

Enclosure Device ID: 252
Slot Number: 2
Device Id: 7
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Firmware state: Online
SAS Address(0): 0x4433221101000000
Connected Port Number: 1(path0)
Inquiry Data: K5HDMN3A HGST HUS726020ALA610 A5GNT920
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 252
Slot Number: 3
Device Id: 6
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Firmware state: Online
SAS Address(0): 0x4433221100000000
Connected Port Number: 0(path0)
Inquiry Data: K5HEDM2A HGST HUS726020ALA610 A5GNT920
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Exit Code: 0x00
-----------------------------

**smartclt :**
sur les 2 HD de 2 TB : rien :

# smartctl -d megaraid,6 -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-5-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: HGST HUS726020ALA610
Serial Number: K5HEDM2A
LU WWN Device Id: 5 000cca 25ed42faf
Firmware Version: A5GNT920
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Feb 24 23:21:36 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 113) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 288) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 108
3 Spin_Up_Time 0x0007 156 156 024 Pre-fail Always - 189 (Average 195)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 25
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 128 128 020 Pre-fail Offline - 18
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 2301
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 25
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 120
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 120
194 Temperature_Celsius 0x0002 193 193 000 Old_age Always - 31 (Min/Max 18/38)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 105 -
# 2 Short offline Completed without error 00% 103 -
# 3 Short offline Completed without error 00% 101 -
# 4 Short offline Completed without error 00% 101 -
# 5 Short offline Completed without error 00% 101 -
# 6 Short offline Completed without error 00% 97 -
# 7 Short offline Completed without error 00% 96 -
# 8 Short offline Completed without error 00% 96 -
# 9 Short offline Completed without error 00% 95 -
#10 Short offline Completed without error 00% 92 -
#11 Short offline Completed without error 00% 91 -
#12 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

**le 2ème HD :**

# smartctl -d megaraid,7 -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-5-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: HGST HUS726020ALA610
Serial Number: K5HDMN3A
LU WWN Device Id: 5 000cca 25ed3d5b7
Firmware Version: A5GNT920
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Feb 24 23:22:57 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 113) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 288) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 135 135 054 Pre-fail Offline - 112
3 Spin_Up_Time 0x0007 157 157 024 Pre-fail Always - 188 (Average 192)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 25
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 128 128 020 Pre-fail Offline - 18
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 2301
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 25
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 121
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 121
194 Temperature_Celsius 0x0002 181 181 000 Old_age Always - 33 (Min/Max 19/40)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 105 -
# 2 Short offline Completed without error 00% 103 -
# 3 Short offline Completed without error 00% 101 -
# 4 Short offline Completed without error 00% 101 -
# 5 Short offline Completed without error 00% 101 -
# 6 Short offline Completed without error 00% 97 -
# 7 Short offline Completed without error 00% 96 -
# 8 Short offline Completed without error 00% 96 -
# 9 Short offline Completed without error 00% 95 -
#10 Short offline Completed without error 00% 92 -
#11 Short offline Completed without error 00% 91 -
#12 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

==================================================
**Sur les SSD, j'ai du mal à interpréter certaines variables avec des valeurs très élevées !**

En particulier, la variable **Program_Fail_Count_Chip** est énorme !
Mais c'est peut-être non significatif avec un SSD ?

# smartctl -d megaraid,4 -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-5-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: INTEL SSDSC2KB480G7
Serial Number: PHYS738000LD480BGN
LU WWN Device Id: 5 5cd2e4 14e2c9a75
Firmware Version: SCV10100
User Capacity: 480.103.981.056 bytes [480 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Feb 24 23:24:28 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 2) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2187
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13
170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 11
175 Program_Fail_Count_Chip 0x0033 100 100 010 Pre-fail Always - 112193372184
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 070 069 000 Old_age Always - 30 (Min/Max 30/33)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 30
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 1328682
226 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 1864
227 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 17
228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 131183
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 099 099 000 Old_age Always - 0
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 1328682
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 287449
243 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1597053

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 10 -
# 2 Abort offline test Aborted by host 00% 10 -
# 3 Short offline Completed without error 00% 8 -
# 4 Abort offline test Aborted by host 00% 8 -
# 5 Short offline Completed without error 00% 6 -
# 6 Abort offline test Aborted by host 00% 6 -
# 7 Short offline Completed without error 00% 6 -
# 8 Abort offline test Aborted by host 00% 6 -
# 9 Short offline Completed without error 00% 4 -
#10 Abort offline test Aborted by host 00% 4 -
#11 Short offline Completed without error 00% 2 -
#12 Abort offline test Aborted by host 00% 2 -
#13 Short offline Completed without error 00% 0 -
#14 Abort offline test Aborted by host 00% 0 -
#15 Short offline Completed without error 00% 0 -
#16 Abort offline test Aborted by host 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

le 2ème SSD :

# smartctl -d megaraid,5 -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-5-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: INTEL SSDSC2KB480G7
Serial Number: PHYS738000FJ480BGN
LU WWN Device Id: 5 5cd2e4 14e2c9a13
Firmware Version: SCV10100
User Capacity: 480.103.981.056 bytes [480 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Feb 24 23:25:15 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 2) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2187
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13
170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 11
175 Program_Fail_Count_Chip 0x0033 100 100 010 Pre-fail Always - 112194158616
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 070 069 000 Old_age Always - 30 (Min/Max 30/33)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 30
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 1328811
226 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 1884
227 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 15
228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 131197
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 099 099 000 Old_age Always - 0
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 1328811
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 249909
243 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1595445

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 10 -
# 2 Abort offline test Aborted by host 00% 10 -
# 3 Short offline Completed without error 00% 8 -
# 4 Abort offline test Aborted by host 00% 8 -
# 5 Short offline Completed without error 00% 6 -
# 6 Abort offline test Aborted by host 00% 6 -
# 7 Short offline Completed without error 00% 6 -
# 8 Abort offline test Aborted by host 00% 6 -
# 9 Short offline Completed without error 00% 4 -
#10 Abort offline test Aborted by host 00% 4 -
#11 Short offline Completed without error 00% 2 -
#12 Abort offline test Aborted by host 00% 2 -
#13 Short offline Completed without error 00% 0 -
#14 Abort offline test Aborted by host 00% 0 -
#15 Short offline Completed without error 00% 0 -
#16 Abort offline test Aborted by host 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Merci et bon WE.

Helpful (0)

DidierM

Author

La suite de mes tests :
=================================================================

La batterie du contrôleur RAID semble correct :
Il ne faut pas la remplacer.

# MegaCli -AdpBbuCmd -aAll

BBU status for Adapter: 0

BatteryType: Unknown
Voltage: 9858 mV
Current: 0 mA
Temperature: 24 C

BBU Firmware Status:

Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : No
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No

Battery state:

GasGuageStatus:
Fully Discharged : Yes
Fully Charged : No
Discharging : Yes
Initialized : No
Remaining Time Alarm : Yes
Remaining Capacity Alarm: No
Discharge Terminated : Yes
Over Temperature : No
Charging Terminated : Yes
Over Charged : No

Adapter 0: Get BBU Capacity Info Failed.

BBU Design Info for Adapter: 0

Date of Manufacture: 04/07, 2017
Design Capacity: 283 mAh
Design Voltage: 9411 mV
Specification Info: 0
Serial Number: 3195
Pack Stat Configuration: 0x0000
Manufacture Name: LSI
Device Name: CVPM02
Device Chemistry: EDLC
Battery FRU: N/A

BBU Properties for Adapter: 0

Auto Learn Period: 2412000 Sec
Next Learn time: 574535506 Sec
Learn Delay Interval:0 Hours
Auto-Learn Mode: Enabled

Exit Code: 0x00

===========================
Je revérifie le syslog et je trouve ceci pour mercredi quand j'ai essayé de rebooter, sans succès :

Feb 21 09:35:37 rbx03 systemd[1]: Started PVE LXC Container: 308.
Feb 21 09:35:37 rbx03 pvestatd[2518]: status update time (37.194 seconds)
Feb 21 09:35:37 rbx03 pmxcfs[2369]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/rbx03/vz-hd: -1
Feb 21 09:35:37 rbx03 pmxcfs[2369]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/rbx03/local: -1
Feb 21 09:35:37 rbx03 kernel: [ 204.914243] audit: type=1400 audit(1519205737.737:14): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/dev/pts/" pid=14856 comm="mount" flags="rw, remount"
Feb 21 09:35:37 rbx03 kernel: [ 205.116034] vmbr0: port 5(veth308i0) entered blocking state
Feb 21 09:35:37 rbx03 kernel: [ 205.116035] vmbr0: port 5(veth308i0) entered forwarding state
Feb 21 09:35:38 rbx03 pve-guests[14964]: starting CT 311: UPID:rbx03:00003A74:00005031:5A8D3D6A:vzstart:311:root@pam:
Feb 21 09:35:38 rbx03 pve-guests[2646]: starting task UPID:rbx03:00003A74:00005031:5A8D3D6A:vzstart:311:root@pam:
Feb 21 09:35:38 rbx03 systemd[1]: Starting PVE LXC Container: 311...
Feb 21 09:35:38 rbx03 kernel: [ 205.376695] audit: type=1400 audit(1519205738.200:15): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=15057 comm="(dovecot)" flags="rw, rslave"
Feb 21 09:35:38 rbx03 kernel: [ 205.827821] EXT4-fs warning (device loop4): ext4_multi_mount_protect:324: MMP interval 42 higher than expected, please wait.
Feb 21 09:35:38 rbx03 kernel: [ 205.827821]
Feb 21 09:36:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 09:36:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 09:36:01 rbx03 CRON[16024]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 09:36:23 rbx03 kernel: [ 250.900898] EXT4-fs (loop4): recovery complete
Feb 21 09:36:23 rbx03 systemd-udevd[16605]: Could not generate persistent MAC address for veth56KR6L: No such file or directory
Feb 21 09:36:23 rbx03 kernel: [ 250.954487] IPv6: ADDRCONF(NETDEV_UP): veth311i0: link is not ready
Feb 21 09:36:24 rbx03 kernel: [ 251.346316] vmbr0: port 6(veth311i0) entered blocking state
Feb 21 09:36:24 rbx03 kernel: [ 251.347765] device veth311i0 entered promiscuous mode
Feb 21 09:36:24 rbx03 systemd[1]: Started PVE LXC Container: 311.
Feb 21 09:36:24 rbx03 pvestatd[2518]: status update time (36.885 seconds)
Feb 21 09:36:24 rbx03 pmxcfs[2369]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/rbx03/local: -1
Feb 21 09:36:24 rbx03 pmxcfs[2369]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/rbx03/vz-hd: -1
Feb 21 09:36:24 rbx03 kernel: [ 252.067274] vmbr0: port 6(veth311i0) entered blocking state
Feb 21 09:36:24 rbx03 kernel: [ 252.068355] vmbr0: port 6(veth311i0) entered forwarding state
Feb 21 09:36:25 rbx03 pve-guests[2646]: starting task UPID:rbx03:00004651:0000628F:5A8D3D99:vzstart:312:root@pam:
Feb 21 09:36:25 rbx03 pve-guests[18001]: starting CT 312: UPID:rbx03:00004651:0000628F:5A8D3D99:vzstart:312:root@pam:
Feb 21 09:36:25 rbx03 systemd[1]: Starting PVE LXC Container: 312...
Feb 21 09:37:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
Feb 21 09:37:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
Feb 21 09:37:01 rbx03 CRON[20556]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
Feb 21 09:37:10 rbx03 kernel: [ 297.983777] EXT4-fs (loop5): 10 orphan inodes deleted
Feb 21 09:37:10 rbx03 kernel: [ 297.986711] EXT4-fs (loop5): mounted filesystem with ordered data mode. Opts: (null)
Feb 21 09:37:10 rbx03 systemd-udevd[20696]: Could not generate persistent MAC address for veth2A949V: No such file or directory
Feb 21 09:37:11 rbx03 kernel: [ 298.424647] vmbr0: port 7(veth312i0) entered blocking state
Feb 21 09:37:11 rbx03 kernel: [ 298.425395] vmbr0: port 7(veth312i0) entered disabled state
Feb 21 09:37:11 rbx03 kernel: [ 298.426130] device veth312i0 entered promiscuous mode
Feb 21 09:37:11 rbx03 systemd[1]: Started PVE LXC Container: 312.
Feb 21 09:37:11 rbx03 pvestatd[2518]: status update time (36.972 seconds)
Feb 21 09:37:11 rbx03 pmxcfs[2369]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/rbx03/local: -1
Feb 21 09:37:11 rbx03 pmxcfs[2369]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/rbx03/vz-hd: -1
Feb 21 09:37:11 rbx03 kernel: [ 299.157501] vmbr0: port 7(veth312i0) entered blocking state
Feb 21 09:37:11 rbx03 kernel: [ 299.158543] vmbr0: port 7(veth312i0) entered forwarding state
Feb 21 09:37:12 rbx03 pve-guests[21880]: starting CT 358: UPID:rbx03:00005578:000074ED:5A8D3DC8:vzstart:358:root@pam:
Feb 21 09:37:12 rbx03 pve-guests[2646]: starting task UPID:rbx03:00005578:000074ED:5A8D3DC8:vzstart:358:root@pam:
Feb 21 09:37:12 rbx03 systemd[1]: Starting PVE LXC Container: 358...

========================

**C'est quoi ces lignes ?**

**pmxcfs[2369]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/rbx03/local: -1**

Je continue à chercher, et ensuite j'essaie un Reboot.

Merci et bon WE.

Helpful (0)

Buddy

Pas encore essayé de rebooter.
Le Proxmox tourne. Tous les conteneurs tournent...
Mais évidemment je vais faire les fsck et rebooter !
Merci. Didier

ce que je voulais dire c'est que pendant les 3 jours en rescue qu'as tu fait ?
Tu n'avais pas fait un fsck du disque SSD lorsque tu étais en rescue ?

Sinon les disques ont l'air bon, ça peut valoir le coup de faire un petit test
smartctl -t short -d megaraid,4 -a /dev/sda (et ainsi de suite pour les 3 autres disques).

Sinon proxmox est entièrement à jour ?
apt-get update && apt-get upgrade

car avec les failles metldown et spectre, les premiers patchs sortis ne sont pas forcément les plus stables ..

Helpful (0)

DidierM

Author

Bonjour Buddy
Pas avancé comme je voulais car j'avais la crève, médecin etc...
;-)
J'avais déjà fait tous les tests depuis l'interface HTTP Rescue du serveur : rien détecté.
Le Proxmox a toujours été à jour.
Je ne dis pas immédiatement, mais maximum une semaine de retard.
Oui, ça peut être l'effet d'un patch Meltdown ou Spectre...

Test smartctl : en fait, c'est seulement hier que j'ai découvert comment utiliser smartctl sur des disques RAID hardware ;-)
Je lance ces tests short sur les 2 SSD et les 2 HD.
Merci. Didier

Helpful (0)

janus57

Bonjour,

Je lance ces tests short sur les 2 SSD et les 2 HD.

il faut lancer des long dans des cas comme ça, car j'ai déjà eu le cas ou un short me dit "all good" alors qu'un long me dit "attention là je vois des erreurs".

Cordialement, janus57

Helpful (0)

DidierM

Author

on peut perdre un temps bête quand on est malade, la tête dans les vaps ...
Le 1er jour, j'ai booté en mode rescue ... monté les SSD et HD...
et j'ai MÊME PAS trouvé le syslog !!!
...
Je l'ai trouvé 2 jours après, à l'endroit prévu.
Mais j'avais la tête trop dans le vague lors de ma 1ère recherche...
;-)

Helpful (0)

DidierM

Author

derniers tests smartctl LONG sur les 2 HD :

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 **Extended offline Completed without error** 00% 2321 -
# 2 Short offline Completed without error 00% 2315 -

Tout semble ok.
aucune idée de la raison du plantage mercredi.

Je reboot une dernière fois.
Reboot OK, Proxmox OK

Je considère que ce dédié est correct et repasse en Prod.
Didier

Helpful (0)

DavidM12

J'ai eu exactement le même problème que toi avec un proxmox 5 zfs qui m'a fait 2 plantages inexpliqués en l'espace de 3 semaines.

Apparemment l'assistance m'a dit que certains dédiés avaient des problème avec proxmox 5. Les templates proxmox sont d'ailleurs désactivés sur certains d'entre eux et ils proposent proxmox 5 en mode beta pour signaler que l'instabilité du système. J'ai retiré mon serveur de la prod du coup, car je ne lui fais plus confiance.

Helpful (0)

DidierM

Author

Moi on m'avait dit que c'était peut-être les 1ères versions des patches CPU Meltdown / Spectre qui provoquaient des reboots ?

Depuis mon dernier Reboot, Proxmox tourne, sans que j'ai rien modifié.
Plus aucun problème depuis lors.
_Le Proxmox semble actuellement stable_.

Mais franchement, avec en plus la tonne de problème que j'ai eu lors de la migration Proxmox 3.4 OpenVZ --> Proxmox 4.3 (puis Proxmox 5) LXC... absolument pas simple...
la confiance n'y est plus vraiment...

Je n'avais JAMAIS eu ce genre de problème et de plantage en Proxmox 3 OpenVZ !
Lui il était hyper stable et SOUPLE
(pas de stop de conteneur quand tu voulais changer / ajouter une IP, réduction de la taille d'un conteneur impossible en Proxmox 4 et 5... etc )

J'attends Ubuntu 18.04 LTS et je bascule vers des **VPS**
ou éventuellement le hosting des sites Drupal vers **OVH Coud Web** ...

Mais j'en ai marre de ces problèmes aléatoires Proxmox...
Sans compté qu'un dédié... si tu n'en as pas un 2ème en haute dispo... si tu as un crash ou une panne hardware... t'es mal...
Même avec des backups des sites et/ou des conteneurs Proxmox, tu ne le redémarres pas dans les 2 h ...

Helpful (0)

DavidM12

Justement j'ai deux dédiés dont un sert de backup en cas de crash et panne de l'autre. Les LXC en prod sur le premier se sauvegardent sur le second. Le premier est sur proxmox 4 zfs et le second sur proxmox 5 zfs beta.

Le foutage de gueule c'est quand l'assistance SYS te dit d'aller sur le forum pour résoudre ton problème, car ils sont incapables de t'aider. Mais c'est très aléatoire, l'aide du forum. Du coup je viens de résilier le dédié supplémentaire que je paye pour rien. Je ne lui fait plus confiance, il est tombé en rade 2 fois. Mes clients se sont posés des questions sur ma qualité de service. Mes LXC se sauvegardent par snapshot sur du kimsufi, qui, quoiqu'on en dise n'a jamais planté en 3 ans d'utilisation.

Mais comme toi, je me pose de plus en plus la question du VPS pour les LXC stratégiques.

Helpful (0)

DidierM

Author

je me pose de plus en plus la question du VPS pour les LXC stratégiques

moi je cherchais aussi à réduire les frais, car finalement j'ai bcp d'asso, club... càd que ça ne paye pas les frais.
J'ai qques clients Pro, ça oui. Mais faut de la fiabilité !
Trop peu de clients pro pour conserver 2 gros dédiés (SSD, RAID hardware...).

D'un autre côté, on va de plus en plus vers le Cloud, càd VPS mais aussi hosting donc ne même plus devoir m'occuper du Linux.
Pourtant j'aime ça, mais pas assez de temps.

En particulier, j'ai passé DES SOIRS et DES WE sur des config emails, sécurisés etc etc ... et au final mes emails arrivent encore dans les spams chez Hotmail / Live / Outlook.

Finalement, c'est une responsabilité les emails !
Même en POP3, quand c'est down et que le client est bloqué car il ne reçoit plus aucun mails...
Et les mails, tu ne sais pas faire payer DES dizaines d'euro par mois par boites = pas rentable
--> je vais progressivement les passer vers Gmail, G Suite à 40 € par an par boite.

Et les sites, provisoirement VPS fin avril (Ubuntu 18.04 LTS), mais probablement que je les passerai en Cloud WEB quand ce sera en Prod (j'ai besoin de 4 GB ram en Drupal 8 avec composer).

Même si j'aime bcp bricoler en Linux, pas assez de temps. C'est mieux que je me concentre sur les sites Drupal.

Helpful (0)

DidierM

Author

nouveau plantage de ce dédié Proxmox 5 ce matin !

à ce moment là, le logon SSH fonctionne, mais il n'accepte pas ma clé SSH.
Je dois mettre le mot de passe.

Ensuite, plus rien ne va (à part "df -lh")...
Malgré que je suis en SSH, même le reboot ne va pas.
--> reboot hard via le manager OVH

Ok, les conteneurs redémarrent.
mais pffffff... encore !

Je ne vois rien dans le syslog : sauf que plus rien entre 6h15 et 6h43 GMT :

May 22 06:10:01 rbx03 CRON[10563]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
May 22 06:11:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
May 22 06:11:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
May 22 06:11:01 rbx03 CRON[20085]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
May 22 06:11:26 rbx03 smartd[1014]: Device: /dev/bus/0 [megaraid_disk_04] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
May 22 06:11:33 rbx03 rrdcached[2295]: flushing old values
May 22 06:11:33 rbx03 rrdcached[2295]: rotating journals
May 22 06:11:33 rbx03 rrdcached[2295]: started new journal /var/lib/rrdcached/journal/rrd.journal.1526969493.060273
May 22 06:11:33 rbx03 rrdcached[2295]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1526962293.060342
May 22 06:12:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
May 22 06:12:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
May 22 06:12:01 rbx03 CRON[20994]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
May 22 06:13:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
May 22 06:13:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
May 22 06:13:01 rbx03 CRON[21987]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
May 22 06:14:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
May 22 06:14:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
May 22 06:14:01 rbx03 CRON[23946]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
May 22 06:15:00 rbx03 systemd[1]: Starting Proxmox VE replication runner...
May 22 06:15:00 rbx03 systemd[1]: Started Proxmox VE replication runner.
May 22 06:15:01 rbx03 CRON[25139]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
May 22 06:15:01 rbx03 CRON[25138]: (root) CMD (/usr/local/rtm/bin/rtm 47 > /dev/null 2> /dev/null)
May 22 06:43:28 rbx03 systemd-modules-load[357]: Inserted module 'iscsi_tcp'
May 22 06:43:28 rbx03 kernel: [ 0.000000] Linux version 4.15.17-1-pve (tlamprecht@evita) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.17-8 (Thu, 03 May 2018 08:43:38 +0200) ()
May 22 06:43:28 rbx03 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.17-1-pve root=/dev/sda1 ro noquiet nosplash net.ifnames=0 biosdevname=0
May 22 06:43:28 rbx03 kernel: [ 0.000000] KERNEL supported cpus:
May 22 06:43:28 rbx03 kernel: [ 0.000000] Intel GenuineIntel
May 22 06:43:28 rbx03 kernel: [ 0.000000] AMD AuthenticAMD
May 22 06:43:28 rbx03 kernel: [ 0.000000] Centaur CentaurHauls
May 22 06:43:28 rbx03 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
May 22 06:43:28 rbx03 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
May 22 06:43:28 rbx03 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
May 22 06:43:28 rbx03 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
May 22 06:43:28 rbx03 kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
May 22 06:43:28 rbx03 kernel: [ 0.000000] e820: BIOS-provided physical RAM map:
May 22 06:43:28 rbx03 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009afff] usable
May 22 06:43:28 rbx03 kernel: [ 0.000000] BIOS-e820: [mem 0x000000000009b000-0x000000000009ffff] reserved
May 22 06:43:28 rbx03 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
May 22 06:43:28 rbx03 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000796f0fff] usable

Pas le temps d'analyser en détail ce matin !
Bonne journée. Didier

Helpful (0)

DidierM

Author

Rien trouvé dans les syslog du dédié.

J'ai rebooté le Proxmox.
ça tourne !
Les sites sont up ...
enfin... presque tous !

Je me rend compte que maintenant, 2 conteneurs qui avaient plusieurs IPs, n'ont plus que la 1ère IP qui répond !
mais :o ...
Pourquoi les IP suivantes ne répondent plus ?

pffffffff...
ok, pas critique, mais c'est quoi ce truc Proxmox ?

J'avais eu des màj kernel hier, plantage ce matin (je n'avais pas pu rebooter directement)...
et au reboot, 1 seul IP par conteneur ???

pfffff...
Bonne fin de soirée. Didier

Helpful (0)

ArnaudL1

Salut Didier,

J'ai eu deux reboot sur un serveur avec du Proxmox.
Ce que j'ai remarqué c'est que les reboot avaient eu lieu à des horaires de backup. Est-ce ton cas ?
En contactant OVH, j'ai vu que les reboot avait été effectué par des techniciens car le serveur n'avait pas répondu au PING, j'ai donc désactivé l'option monitoring pour éviter qu'un technicien parasite mes tests, ils n'interviennent plus du coup.

A la suite des reboot, les deux proxmox étaient plantés à cause d'un soucis de kernel. J'ai pu voir ça au boot avec l'IPMI (KVN Java).
J'ai fait de la place sur la partition /boot/ et j'ai changé de kernel pour que ça redémarre.

Pour ma part, j'avais planifié de passer sur un dédié avec SSD+HD donc je n'ai pas gardé les machines mais j'ai passé pas mal de temps sur ce problème.

Je suis en train de voir si finalement je laisse proxmox mettre à jour les kernel lui même car en cas de reboot ça peut être fatal ;-)
Par contre, l'IPMI est pratique mais pas top, impossible de coller un mot de passe. Vu que j'avais mis un mot de passe de 256 caractères, j'ai galéré.

Helpful (0)

DidierM

Author

Bonjour Arnaud
**Pas pendant les backups** : le dernier plantage était vers 8h du matin.
Mes backups de conteneurs passent vers 2 ou 3h du mat...

**Ce ne sont pas non plus des reboots de techniciens OVH**. Car le serveur Proxmox est UP, càd répond aux pings, et je peux faire SSH.
Par contre, il n'accepte plus ma clé SSH. Je dois taper mon mot de passe.
et l'environnement est réduit : plus de prompt couleur, la plupart des commandes ne fonctionnent plus !
"ls" fonctionne
"reboot" ne fonctionne pas
J'ai du rebooter moi-même manuellement depuis le manager OVH.

à chaque fois, sans rien faire, après un reboot "hardware", le Proxmox refonctionnait.

*IPMI* en général, il me suffit de rebooter.
Pas besoin de vérifier les messages lors du boot...

**Kernel ?**
C'est possible.
Le dernier plantage, j'avais fait des màjs la veille, dont une nouvelle version du kernel, mais je n'avais pas encore rebooté.

"Laisser Proxmox mettre à jour lui-même le kernel" ? càd ?
Je fais mes màj manuellement, comme sur tous mes serveurs / conteneurs Ubuntu ou Debian.

Franchement, j'hésite.
Trop de problème depuis la version Proxmox 4 LXC...
alors que l'ancien Proxmox 3 OpenVZ était HYPER stable et souple !

ou alors, c'est un problème hardware sur le dédié... qui ne me laisse pas de messages dans le syslog ?

Bon WE. Didier

Helpful (0)

Welcome to OVHcloud Community

Ask questions, search for information, post content, and interact with other OVHcloud Community members.

Proxmox planté ?

Related questions

Join discussion

Most viewed in same Forum

Most recent in same Forum