前提背景:

生产环境上,服务器网络突然断链,ssh连接失败。

问题初步定位:

查找内核日志,得到网卡异常信息

Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 14 not cleared within the polling period

Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 15 not cleared within the polling period

Jan 24 11:52:43 localhost kernel: bonding: bond5: link status definitely down for interface eth0, disabling it

Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: detected SFP+: 5

Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX

Jan 24 11:52:43 localhost kernel: bond5: link status definitely up for interface eth0, 10000 Mbps full duplex.

Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Detected Tx Unit Hang

Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx_buffer_info[next_to_clean]

Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx hang 448 detected on queue 6, resetting adapter

Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Reset adapter

Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

网卡PCI信息:

# lspci -vvv -s 84:00.0
84:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at f7e20000 (64-bit, non-prefetchable) [disabled] [size=128K]
        Region 2: I/O ports at f020 [disabled] [size=32]
        Region 4: Memory at f7e44000 (64-bit, non-prefetchable) [disabled] [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable- Count=64 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #4, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <2us, L1 <32us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB
        Capabilities: [e0] Vital Product Data
                Unknown small resource type 06, will not decode more.
        Capabilities: [100] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [140] Device Serial Number 98-f5-37-ff-ff-e3-64-73
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 384, stride: 2, Device ID: 10ed
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
                Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Kernel driver in use: ixgbe
        Kernel modules: ixgbe

网卡寄存器信息:

# ethtool -d  eth0
0x042A4: LINKS (Link Status register)                 0xFFFFFFFF
       Link Status:                                   up
       Link Speed:                                    10G
0x05080: FCTRL (Filter Control register)              0xFFFFFFFF
       Receive Flow Control Packets:                  enabled
       Receive Priority Flow Control Packets:         enabled
       Discard Pause Frames:                          enabled
       Pass MAC Control Frames:                       enabled
       Broadcast Accept:                              enabled
       Unicast Promiscuous:                           enabled
       Multicast Promiscuous:                         enabled
       Store Bad Packets:                             enabled
0x05088: VLNCTRL (VLAN Control register)              0xFFFFFFFF
       VLAN Mode:                                     enabled
       VLAN Filter:                                   enabled
0x02100: SRRCTL0 (Split and Replic Rx Control 0)      0xFFFFFFFF
       Receive Buffer Size:                           16KB
0x03D00: RMCS (Receive Music Control register)        0xFFFFFFFF
       Transmit Flow Control:                         enabled
       Priority Flow Control:                         enabled
0x04250: HLREG0 (Highlander Control 0 register)       0xFFFFFFFF
       Transmit CRC:                                  enabled
       Receive CRC Strip:                             enabled
       Jumbo Frames:                                  enabled
       Pad Short Frames:                              enabled
       Loopback:                                      enabled
0x00000: CTRL        (Device Control)                 0xFFFFFFFF
0x00008: STATUS      (Device Status)                  0xFFFFFFFF
0x00018: CTRL_EXT    (Extended Device Control)        0xFFFFFFFF
0x00020: ESDP        (Extended SDP Control)           0xFFFFFFFF
0x00028: EODSDP      (Extended OD SDP Control)        0xFFFFFFFF
0x00200: LEDCTL      (LED Control)                    0xFFFFFFFF

........

0x01010: RDH00       (Receive Descriptor Head 00)     0xFFFFFFFF
0x01050: RDH01       (Receive Descriptor Head 01)     0xFFFFFFFF
0x01090: RDH02       (Receive Descriptor Head 02)     0xFFFFFFFF
0x010D0: RDH03       (Receive Descriptor Head 03)     0xFFFFFFFF
0x01110: RDH04       (Receive Descriptor Head 04)     0xFFFFFFFF

..........

0x01028: RXDCTL00    (Receive Descriptor Control 00)  0xFFFFFFFF
0x01068: RXDCTL01    (Receive Descriptor Control 01)  0xFFFFFFFF
0x010A8: RXDCTL02    (Receive Descriptor Control 02)  0xFFFFFFFF

........

0x06010: TDH00       (Transmit Descriptor Head 00)    0xFFFFFFFF
0x06050: TDH01       (Transmit Descriptor Head 01)    0xFFFFFFFF
0x06090: TDH02       (Transmit Descriptor Head 02)    0xFFFFFFFF
0x060D0: TDH03       (Transmit Descriptor Head 03)    0xFFFFFFFF
0x06110: TDH04       (Transmit Descriptor Head 04)    0xFFFFFFFF
0x06150: TDH05       (Transmit Descriptor Head 05)    0xFFFFFFFF

问题可能原因:

Bar0地址看起来没有问题,但寄存器全是0xffffffff了; 82599寄存器开始是正常的, 跑了一段时间(10小时)就 变成FFFF了;

可能pcie 接口接触问题。

Intel 82599网卡异常挂死原因的更多相关文章

  1. intel 82599网卡(ixgbe系列)术语表

    Intel® 82599 10 GbE Controller Datasheet 15.0 Glossary and Acronyms 术语表 缩写 英文解释 中文解释 1 KB A value of ...

  2. 用strace处理程序异常挂死情况

    1. 环境: ubuntu 系统 + strace + vim 2.编写挂死程序:(参考博客) #include <stdio.h> #include <sys/types.h> ...

  3. 关于PF_RING/Intel 82599/透明VPN的一些事

    接近崩溃的边缘,今天这篇文章构思地点在医院,小小又生病了,宁可吊瓶不吃药,带了笔记本却无法上网,我什么都不能干,想了解一些东西,只能用3G,不敢 开热点,因为没人给我报销流量,本周末我只有一天时间,因 ...

  4. 大约PF_RING/Intel 82599/透明VPN一些事

    接近崩溃的边缘,如今,在医院这篇文章地方的想法,小病,我宁愿不吃药瓶.一台笔记本电脑,但无法上网,我不称职.想知道的东西.唯一可用3G,不开的热点.由于没人给我报销流程.这个周末,我只有一天,由于下雨 ...

  5. 记一次 .NET 某上市工业智造 CPU+内存+挂死 三高分析

    一:背景 1. 讲故事 上个月有位朋友加wx告知他的程序有挂死现象,询问如何进一步分析,截图如下: 看这位朋友还是有一定的分析基础,可能玩的少,缺乏一定的分析经验,当我简单分析之后,我发现这个dump ...

  6. 应用程序出现挂死,.NET Runtime at IP 791F7E06 (79140000) with exit code 80131506.

    工具出现挂死问题 1.问题描述 工具出现挂死问题,巡检IIS发现以下异常日志 现网系统日志: 事件类型:    错误 事件来源:    .NET Runtime 描述: Application: Di ...

  7. I2C 挂死,SDA一直为低问题分析【转】

    转自:https://blog.csdn.net/winitz/article/details/72460775 版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csd ...

  8. IIC挂死问题解决过程

    0.环境:arm CPU 带有IIC控制器作为slave端,带有调试串口. 1.bug表现:IIC slave 在系统启动后概率挂死,导致master无法detect到slave. 猜测1:认为IIC ...

  9. MySQL 连接为什么挂死了?

    摘要:本次分享的是一次关于 MySQL 高可用问题的定位过程,其中曲折颇多但问题本身却比较有些代表性,遂将其记录以供参考. 一.背景 近期由测试反馈的问题有点多,其中关于系统可靠性测试提出的问题令人感 ...

随机推荐

  1. 【配置阿里云 I】申请配置阿里云服务器,并部署IIS和开发环境,项目上线经验

    https://blog.csdn.net/vapaad1/article/details/78769520 最近一年在实验室做web后端开发,涉及到一些和服务器搭建及部署上线项目的相关经验,写个帖子 ...

  2. .NET Remoting、WebService、WCF、WebApi一些简单描述

    1. .NET Remoting是传输层协议TCP封装的,速度非常快,.NET Remoting基于.net反射机制,只方便.net使用,因此它有平台限制.(.NET Remoting的工作原理:服务 ...

  3. SolrJ的使用

    SolrJ的使用 1.添加依赖 <?xml version="1.0" encoding="UTF-8"?> <project xmlns=& ...

  4. ERROR: cannot launch node of type [teleop/teleop_key]: can't locate node [teleop_key] in package [teleop]

    节点由python写成,编译通过,运行时报错如下: ERROR: cannot launch node of type [teleop/teleop_key]: can't locate node [ ...

  5. MVC Action 返回类型

    https://www.cnblogs.com/xielong/p/5940535.html https://blog.csdn.net/WuLex/article/details/79008515 ...

  6. 在windows下安装、配置、运行PostgreSQL【转】

    安装PostgreSQL 在Windows下的安装就位无脑安装,选择好安装路径就好了,我的安装目录为D:\PostgreSQL\10,需要注意一下几点: 安装过程中需要一个数据库的目录,我的为D:\P ...

  7. git中加入中文时,乱码

    原因:编码问题,可以看到txt转为为ANSI编码 ---->将编码方式改为UTF-8即可

  8. maven私库nexus2.3.0-04迁移升级到nexus-3.16.1-02(异机迁移备份)

    环境信息: nexus2.3.0-04安装在32位Windows server 2003系统上 安装位置信息如下: 仓库迁移 Nexus的构件仓库都保存在sonatype-work目录中,nexus2 ...

  9. 关于anaconda中jupyter notebook错误

    anaconda这个软件是真的坑,其中的jupyter notebook每次都会出错,不知道,为什么,可惜我的pycharm装tensorflow一直有错误,不然,真想卸了这个软件. 会莫名其妙闪退, ...

  10. Ubuntu下把缺省的dash shell修改为bash shell

    Ubuntu下缺省使用的是shell是dash,而不是bash.从/bin/sh软连接的指向可以看出这点. dash shell 虽然比bash shell更轻便,但是它并不支持所有的语法,运行she ...