问题背景:客户反馈Oracle rac集群节点宕机

1、首先查看宕机原因,归档日志满导致服务重启,查看归档日志路径是USE_DB_RECOVERY_FILE_DEST (默认路径),

安装的时候没有做调整,应该调整单独的归档目录,首先清理归档日志然后修改归档路径

2、节点一正常启动,节点二起不来    没有cluster服务
  检查集群服务
在rac2节点上检查集群服务的状态报错
[grid@rac2 ~]# /u01/app/11.2.0/grid/bin/crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
根据上面报错,可以判断出crs是有问题。
尝试启动也报错:注意需要使用root

尝试启动crs服务
root@ora102 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
正常情况是:
[root@rac2 bin]# /u01/app/11.2.0/grid/bin/crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
检查crs服务,发现有问题:
[grid@rac2 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services demon
CRS-4534: Cannot communicate with Event Manager‘

然后节点rac2查看ip情况,发现vip和scan ip都已经不在,可以判断出节点rac已经脱离了集群。
查看节点 ifconfig -a

3、尝试重新注册节点2加入集群
[root@rac2 ~]# sh /u01/app/11.2.0/grid/root.sh
Performing root user operation for Oracle 11g

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /u01/app/11.2.0/grid
Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The contents of "oraenv" have not changed. No need to overwrite.
The contents of "coraenv" have not changed. No need to overwrite.
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
User ignored Prerequisites during installation
Installing Trace File Analyzer
Configure Oracle Grid Infrastructure for a Cluster ... succeeded

4、还是有问题,清理节点2的配置信息,然后重新运行root.sh
[root@rac2 trace]$ /u01/app/11.2.0/grid/crs/install/rootcrs.pl -verbose -deconfig -force
[root@rac2 ~]# /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
[root@rac2 bin]# /u01/app/11.2.0/grid/root.sh

报错:
[root@rac2 install]#  /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
Can't
locate Env.pm in @INC (@INC contains: /usr/local/lib64/perl5
/usr/local/share/perl5 /usr/lib64/perl5/vendor_perl
/usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .
/u01/app/11.2.0/grid/crs/install) at crsconfig_lib.pm line 703.
BEGIN failed--compilation aborted at crsconfig_lib.pm line 703.
Compilation failed in require at /u01/app/11.2.0/grid/crs/install/roothas.pl line 166.
BEGIN failed--compilation aborted at /u01/app/11.2.0/grid/crs/install/roothas.pl line 166.
缺少依赖包  安装命令 yum install perl-Env

已安装:
  perl-Env.noarch 0:1.04-2.el7

5、清理节点2配置信息
[root@rac2 install]#  /u01/app/11.2.0/grid/crs/install/roothas.pl -verbose -deconfig -force
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Stop failed, or completed with errors.
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Delete failed, or completed with errors.
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac2'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac2'
CRS-2677: Stop of 'ora.mdnsd' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'rac2'
CRS-2677: Stop of 'ora.crf' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'rac2'
CRS-2677: Stop of 'ora.gipcd' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac2'
CRS-2677: Stop of 'ora.gpnpd' on 'rac2' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac2' has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully deconfigured Oracle Restart stack

6、重新注册到集群中
[root@rac2 install]# /u01/app/11.2.0/grid/root.sh
Performing root user operation for Oracle 11g
The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /u01/app/11.2.0/grid
Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The contents of "oraenv" have not changed. No need to overwrite.
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
User ignored Prerequisites during installation
Installing Trace File Analyzer
OLR initialization - successful
Adding Clusterware entries to inittab
CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node rac1, number 1, and is terminating
An active cluster was found during exclusive startup, restarting to join the cluster
Start of resource "ora.cssd" failed
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac2'
CRS-2672: Attempting to start 'ora.gipcd' on 'rac2'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac2' succeeded
CRS-2676: Start of 'ora.gipcd' on 'rac2' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac2'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac2'
CRS-2676: Start of 'ora.diskmon' on 'rac2' succeeded
CRS-2674: Start of 'ora.cssd' on 'rac2' failed
CRS-2679: Attempting to clean 'ora.cssd' on 'rac2'
CRS-2681: Clean of 'ora.cssd' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'rac2'
CRS-2677: Stop of 'ora.gipcd' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'rac2'
CRS-2677: Stop of 'ora.cssdmonitor' on 'rac2' succeeded
CRS-5804: Communication error with agent process
CRS-4000: Command Start failed, or completed with errors.
Failed to start Oracle Grid Infrastructure stack
Failed
to start Cluster Synchorinisation Service in clustered mode at
/u01/app/11.2.0/grid/crs/install/crsconfig_lib.pm line 1278.
/u01/app/11.2.0/grid/perl/bin/perl
-I/u01/app/11.2.0/grid/perl/lib -I/u01/app/11.2.0/grid/crs/install
/u01/app/11.2.0/grid/crs/install/rootcrs.pl execution failed
依然失败

7、CSSD没有在第二个节点上启动。$grid_home/log/rac2子目录中查找cssd日志文件。查看日志信息。
/u01/app/11.2.0/grid/log/rac2/cssd
2019-10-12 15:41:19.013: [    CSSD][3199571712]clssgmDiscEndpcl: gipcDestroy 0x8a28
2019-10-12 15:41:19.064: [    CSSD][3181754112]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2019-10-12
15:41:19.844: [    CSSD][3186484992]clssnmvDHBValidateNcopy: node 1,
rac1, has a disk HB, but no network HB, DHB has rcfg 464729747, wrtcnt,
8055111, LATS 336904, lastSeqNo 8055110, uniqueness 1569234927,
timestamp 1570866136/3845241248
2019-10-12 15:41:20.064: [    CSSD][3181754112]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2019-10-12
15:41:20.845: [    CSSD][3186484992]clssnmvDHBValidateNcopy: node 1,
rac1, has a disk HB, but no network HB, DHB has rcfg 464729747, wrtcnt,
8055112, LATS 337904, lastSeqNo 8055111, uniqueness 1569234927,
timestamp 1570866137/3845242248

8、查看节点2的心跳
[grid@rac2 /]$ ping 20.20.20.201  --节点1的priv
PING 20.20.20.201 (20.20.20.201) 56(84) bytes of data.
From 20.20.20.202 icmp_seq=1 Destination Host Unreachable
From 20.20.20.202 icmp_seq=2 Destination Host Unreachable
From 20.20.20.202 icmp_seq=3 Destination Host Unreachable
From 20.20.20.202 icmp_seq=4 Destination Host Unreachable
 心跳不通、。。。。。心累,据客户说节点1的心跳出过好几次问题了,估计网卡有问题。

征得客户同意,先尝试节点1的网卡重启下,然后把服务重启下,节点1/2服务都正常起来了,
后续建议客户更换网卡消除隐患。

9、绕了一大圈是因为心跳的问题,解决问题就应该大胆假设小心求证,对可能的原因排错最终顺藤摸瓜抓住本质。

CRS-2674: Start of 'ora.cssd' on 'rac2' failed 引发的rac集群服务起不来问题的更多相关文章

  1. 管理集群中的 crs 管理员

     管理集群中的 crs 管理员 oracle Managing CRS Administrators in the Cluster Use the following commands to ma ...

  2. 11gR2 集群(CRS/GRID)新功能—— SCAN(Single Client Access Name)

    https://blogs.oracle.com/database4cn/11gr2-crsgrid-scansingle-client-access-name

  3. 11g RAC r2 的启停命令概述1

    目标: 熟悉主要进程的启停顺序 了解独占模式 -excl crsctl start crs与crsctl start cluster 区别 1.熟悉主要进程的启停顺序 1.1 启动节点rac1: [r ...

  4. oracle 11g rac集群重启顺序以及常用管理命令简介

    转至:https://www.cnblogs.com/yj411511/p/12459533.html 目录 1.关闭数据库 1.1 查看数据库实例状态 1.2 停止所有节点上实例 1.3 确认数据库 ...

  5. 11g RAC集群启动关闭、各种资源检查、配置信息查看汇总。

    简要:一:集群的启动与关闭 1. rac集群的手动启动[root@node1 bin]# ./crsctl start cluster -all2. 查看rac集群的状态[root@node1 bin ...

  6. Oracle 11g RAC运维总结

    转至:https://blog.csdn.net/qq_41944882/article/details/103560879 1 术语解释1.1 高可用(HA)什么是高可用?顾名思义我们能轻松地理解是 ...

  7. CRS-0184 Cannot communicate with the CRS daemon

    事件背景 rman清理脚本异常.导致磁盘空间爆满(一个环境变量没有设置正确) 释放磁盘空间,进行rman清理 之后,领导把实例重启,但是ams实例没有关闭 环境 系统 : AIX 数据库: Oracl ...

  8. Oracle Rac crs无法启动

    OS:ORACLE LINUX 5.7 DB:11.2.0.3 RAC:YES 故障:1.两节点RAC,节点分别为linuxdb1.linuxdb2,其中节点linuxdb2服务器出现故障,无法启动2 ...

  9. RAC2——11g Grid Infrastructure的新机制

    版权声明:本文为博主原创文章,未经博主允许不得转载. 可以看到,最初CRS(Cluster Ready Services)名词的起源就是因为10.1中作为集群软件的原因.后来经历了Clusterwar ...

随机推荐

  1. yzoj1657货仓选址 题解

    题面: 在一条数轴上有N家商店,它们的坐标分别为 A[1]~A[N].现在需要在数轴上建立一家货仓,每天清晨,从货仓到每家商店都要运送一车商品.为了提高效率,求把货仓建在何处,可以使得货仓到每家商店的 ...

  2. mysql之innodb存储引擎介绍

    一.Innodb体系架构 1.1.后台线程 后台任务主要负责刷新内存中的数据,保证缓冲池的数据是最近的数据,此外还将修改的数据刷新到文件磁盘,保证在数据库发生异常的情况下Innodb能恢复到正常的运行 ...

  3. redis之pipeline使用

    redis之pipeline 我们要完成一个业务,可能会对redis做连续的多个操作,这有很多个步骤是需要依次连续执行的.这样的场景,网络传输的耗时将是限制redis处理量的主要瓶颈. 那么此时就可以 ...

  4. 前端-HTML-web服务本质-HTTP协议-请求-标签-01(待完善)

    目录 前端 什么是前端 什么是后端 学习流程 前端三剑客的形容 web服务的本质 测试--浏览器作为客户端向服务器发起请求 浏览器输入网址回车发生了几件事 ***** HTTP协议(超文本传输协议) ...

  5. BeanCopier类

    网上学习了一番BeanCopier类. cglib是一款比较底层的操作java字节码的框架. 下面通过拷贝bean对象来测试BeanCopier的特性: public class OrderEntit ...

  6. Linux 中 Xampp 的 https 安全证书配置

    博客地址:http://www.moonxy.com 一.前言 HTTP 协议是不加密传输数据的,也就是用户跟你的网站之间传递数据有可能在途中被截获,破解传递的真实内容,所以使用不加密的 HTTP 的 ...

  7. Day 2 Bash shell 认识

    1.拍摄虚拟机的快照 2. 什么是Bash shell? 命令解释器,将用户输入的命令,翻译给内核程序,将用户输入的指令翻译给内核 程序,内核处理完成之后将结果返回给bash. 如何打开一个bash窗 ...

  8. [DE] How to learn Big Data

    打开一瞧:50G的文件! emptystacks jobstacks jobtickets stackrequests worker 大数据加数据分析,需要以python+scikit,sql作为基础 ...

  9. null,undefined,undeclared的区别

    1.null表示"没有对象",即该处不应该有值,转为数值时为0.典型用法是: (1) 作为函数的参数,表示该函数的参数不是对象. (2) 作为对象原型链的终点. 2.undefin ...

  10. hadoop之hdfs架构详解

    本文主要从两个方面对hdfs进行阐述,第一就是hdfs的整个架构以及组成,第二就是hdfs文件的读写流程. 一.HDFS概述 标题中提到hdfs(Hadoop Distribute File Syst ...