Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (文档 ID 1546004.1)
In this Document
| Purpose |
| Scope |
| Details |
| What does "split brain" mean? |
| Why is this a problem? |
| How does the clusterware resolve a "split brain" situation? |
| Identifying a split-brain eviction |
| Finding the cohort |
| Understanding the cohort message |
| Using the cohort message to identify interconnect network issues |
| Follow-up Action |
| Community Discussions |
| References |
APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.
PURPOSE
The purpose of this note is to explain split-brain node evictions in Oracle Clusterware release 11.2
SCOPE
The intended audience of this note is Oracle Clusterware 11.2 administrators at any level of expertise. As written, this note applies only to 11.2.
DETAILS
Missed network heartbeat (NHB) evictions happen when ocssd of the surviving node loses contact with the evicted node over the interconnect. The nodes must be able to communicate over the interconnect to avoid a "split brain" situation. In the case of a "split brain" node eviction, one node aborted itself to avoid "split brain" when communication over the interconnect was compromised.
What does "split brain" mean?
"Split brain" means that there are 2 or more distinct sets of nodes, or "cohorts", with no communication between the two cohorts.
For example:
Suppose there are 4 nodes named A, B, C, D, in the following situation
* Nodes A,B can talk to each other; nodes C,D can talk to each other
* But A and B cannot talk to C or D, and vice versa
Then there are two cohorts: {A, B} and {C, D}.
Why is this a problem?
In a split-brain situation, there are in a sense two (or more) separate clusters working on the same shared storage. This has the potential for data corruption. So the split-brain must be resolved.
How does the clusterware resolve a "split brain" situation?
Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort.
If both of the cohorts are the same size, the cohort with the lowest numbered node in it survives.
The clusterware identifies the LARGEST cohort, and aborts all the nodes which do NOT belong to that cohort.
Identifying a split-brain eviction
In a split-brain node eviction, the following message is present in the ocssd log ($GRID_HOME/log/<hostname>/ocssd/ocssd.log) of the evicted node:
And earlier in the same log, within 10 minutes prior to "clssnmCheckDskInfo: Aborting local node" message:
clssnmPollingThread: node %s (%n) at <X>% heartbeat fatal, removal in...
Finding the cohort
The split-brain message in the ocssd.log will show "cohort" information. For example:
2012-12-28 20:26:25.803: [ CSSD][1111296320]clssnmCheckDskInfo: Surviving cohort: 2,3,4
2012-12-28 20:26:25.803: [ CSSD][1111296320](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 1, sprora01, is smaller than cohort of 3 nodes led by node 2, sprora02, based on map type 2
Understanding the cohort message
In a split-brain situation, ocssd on each node records on the voting disk the set of nodes it can communicate with. Each set is known as a "cohort". When there are two (or more) mutually non-intersecting sets, we have a "split-brain" situation. It means that there are two (or more) separate sets of nodes which cannot talk to each other over the interconnect.
For example, in the above quote
My cohort: 1
Surviving cohort: 2,3,4
The meaning of these messages is
* "My cohort: 1" => The list of nodes I can communicate with: 1
* "Surviving cohort: 2,3,4" => From the voting disk, I know that nodes 2,3,4 can all communicate with each other.
* "Cohort of 1 nodes with leader 1, sprora01, is smaller than cohort of 3 nodes led by node 2, sprora02"
=> Oracle Clusterware has identified that the cohort {1} is smaller than the cohort {2,3,4}.
Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort. In this case, the smaller cohort is {1}. Therefore, ocssd on node {1} aborts the node.
Using the cohort message to identify interconnect network issues
The cohort message describes which nodes can communicate with each other.
Each cohort is a set of nodes that can talk to each other, and cannot talk to the nodes NOT in the cohort.
In the above example, the cohort message tells us that nodes {2,3,4} are all in communication; node 1 is not in communication with any of them.
Follow-up Action
The private network between node 1 and the other 3 nodes should be checked.
Please refer to the following note to check private interconnect network: Document 1534949.1 - Oracle Grid Infrastructure: How to Troubleshoot Missed Network Heartbeat Evictions
Community Discussions
Still have questions? Use the communities window below to search for similar discussions or start a new discussion on this subject.
Note: Window is the LIVE community not a screenshot.
Click here to open in main browser window.
REFERENCES
NOTE:1534949.1 - Oracle Grid Infrastructure: How to Troubleshoot Missed Network Heartbeat Evictions
Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (文档 ID 1546004.1)的更多相关文章
- Oracle Recommended Patches -- "Oracle JavaVM Component Database PSU" (OJVM PSU) Patches (文档 ID 1929745.1)
From: https://support.oracle.com What is "Oracle JavaVM Component Database PSU" ? Oracle J ...
- Oracle版本发布规划 (文档 ID 742060.1)
Oracle Database Release Schedule of Current Database Releases (文档 ID 742060.1) Oracle Database RoadM ...
- Oracle Grid Infrastructure安装部署文档
1. 部署环境步骤 1.1 软件环境 操作系统: CentOS release 6.5 oracle安装包: linux.x64_11gR2_grid.zip linux.x64_11gR2_data ...
- Grid Infrastructure Single Client Access Name (SCAN) Explained (文档 ID 887522.1)
APPLIES TO: Oracle Database - Enterprise Edition - Version 11.2.0.1 and laterExalogic Elastic Cloud ...
- ORACLE LINUX 6.3 + ORACLE 11.2.0.3 RAC + VBOX安装文档
ORACLE LINUX 6.3 + ORACLE 11.2.0.3 RAC + VBOX安装文档 2015-10-21 12:51 525人阅读 评论(0) 收藏 举报 分类: Oracle RA ...
- oracle数据库 PSU,SPU(CPU),Bundle Patches 和 Patchsets 补丁号码快速参考 (文档 ID 1922396.1)
数据库 PSU,SPU(CPU),Bundle Patches 和 Patchsets 补丁号码快速参考 (文档 ID 1922396.1) 文档内容 用途 详细信息 Patchsets ...
- xtts v4for oracle 11g&12c(文档ID 2471245
xtts v4for oracle 11g&12c(文档ID 2471245.1) 序号 主机 操作项目 操作内容 备注: 阶段一:初始阶段 1.1 源端 环境验证 migrate_check ...
- 使用node.js 文档里的方法写一个web服务器
刚刚看了node.js文档里的一个小例子,就是用 node.js 写一个web服务器的小例子 上代码 (*^▽^*) //helloworld.js// 使用node.js写一个服务器 const h ...
- What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)
In this Document Purpose Scope Details 1. Clusterware layer 2. Real Application Cluster (d ...
随机推荐
- 【性能诊断】六、并发场景的性能分析(windbg案例,大量的内部异常造成CPU飙升)
在做产品的某个核心模块的性能优化时,发现压到100并发时应用服务器的CPU就飙升至90%以上,50并发以后TPS就基本定格在一个数值上.使用性能监视器收集应用服务器的数据,发现每秒的.NET CLR ...
- 【java】serialVersionUID作用
serialVersionUID适用于Java的序列化机制.简单来说,Java的序列化机制是通过判断类的serialVersionUID来验证版本一致性的.在进行反序列化时,JVM会把传来的字节流中的 ...
- Javascript中函数的四种调用方式
一.Javascript中函数的几个基本知识点: 1.函数的名字只是一个指向函数的指针,所以即使在不同的执行环境,即不同对象调用这个函数,这个函数指向的仍然是同一个函数. 2.函数中有两个特殊的内部属 ...
- 51nod 1056
n<=10000000000 然后欧拉函数的前缀和可以用莫比乌斯函数的前缀和快速求,注意各种取模 #include<cstdio> typedef long long i64; ,P ...
- 我的wordpress插件总结
酷壳(CoolShell.cn)WordPress的插件 注意: 下面的这些插件的链接是其插件主页的链接,你可以在WordPress后台管理中添加插件时直接搜索安装就可以了. 插件不是越多越好.WP的 ...
- iphone Dev 开发实例8: Parsing an RSS Feed Using NSXMLParser
From : http://useyourloaf.com/blog/2010/10/16/parsing-an-rss-feed-using-nsxmlparser.html Structure o ...
- 偷懒小工具 - SSO单点登录通用类(可跨域)(上)
目的 目的很明确,就是搭建单点登录的帮助类,并且是一贯的极简风格(调用方法保持5行以内). 并且与其他类库,关联性降低.所以,不使用WebAPI或者WebService等. 思路 因为上次有朋友 ...
- room-views-用窗口颜色清除背景(Clear Background with Window Colour)选项
这个选项是默认开启的,它的作用是在游戏每一帧绘制以前,都用一个颜色打底(覆盖整个游戏场景包括背景,从而实现背景清除),然后在这个基础上再画背景.场景等等. 如果关闭,则在游戏每一帧以前绘制背景(绘制背 ...
- Apriori 关联算法学习
1. 挖掘关联规则 1.1 什么是关联规则 一言蔽之,关联规则是形如X→Y的蕴涵式,表示通过X可以推导“得到”Y,其中X和Y分别称为关联规则的先导(antecedent或left-hand-sid ...
- PLSQL_性能优化工具系列05_SQL Trace/Event 10046 Trace
2014-06-25 Created By BaoXinjian