mongo数据库CPU使用率飙高及优化

出现的情况

某天高高兴兴上班,发现一台mongo数据库负载较高,达到了300多,且一直很稳定,难道是硬件过时了.

检查主机的资源使用发现一个mongod进程使用了几乎所有的CPU资源.

top - 13:55:32 up 134 days, 20:24,  4 users,  load average: 313.44, 304.76, 306.63

Tasks: 850 total,   2 running, 848 sleeping,   0 stopped,   0 zombie

%Cpu(s): 95.1 us,  3.2 sy,  0.0 ni,  1.5 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st

KiB Mem : 39464614+total, 14463147+free, 79623008 used, 17039164+buff/cache

KiB Swap:        0 total,        0 free,        0 used. 31228908+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

219329 mongo     20   0   33.1g  28.6g   9456 S  9286  7.6  21315:15 mongod

310443 mongo     20   0 9264864 672576   5432 S  57.4  0.2 904:50.28 mongos

229368 mongo     20   0   19.9g  13.6g  10460 S  54.2  3.6  33754:51 mongod

213236 mongo     20   0   16.1g  13.0g   8664 S  31.3  3.4 669:49.10 mongod

229190 mongo     20   0 3111000 819996   9856 S   1.9  0.2   1607:46 mongod

444095 mongo     20   0 3004816 720496   9736 S   1.6  0.2   1347:41 mongod

通过perf查看资源的使用情况,wt_btcur_next,wt_cursor_get_key两个函数使用了几乎所有资源,看名字就可以猜个大概是mongo数据库wt引擎负责b+树和游标读取数据的函数.大概率和mongo的读写消耗有很大关系.

使用mongostat查看了对应mongo分片的资源消耗情况,执行的sql比较少,但是队列特别长,猜测有mongo实例在作妖.

############################ xx.xx.xx.48 mongostat at 2024-11-07-13-59-01 ###################################

insert query update delete getmore command dirty  used flushes vsize   res    qrw   arw net_in net_out conn       set repl                time

    52   519     75    340     131   263|0  2.0% 58.2%       0 33.3G 28.8G 0|1776 1|126   478k    502k 2214 shardsrv2  PRI Nov  7 13:59:10.224

    50   556     66    370     118   231|0  2.0% 58.2%       0 33.3G 28.8G 0|1783 1|127   465k    515k 2214 shardsrv2  PRI Nov  7 13:59:12.217

    57   544     72    348     126   243|0  2.0% 58.2%       0 33.3G 28.8G 0|1777 1|128   474k    508k 2214 shardsrv2  PRI Nov  7 13:59:14.222

这个mongo是分片集群中的一个分片,看到其他分片的资源消耗都很低,判断有两种情况,一种是分片数据分配不均匀,另一种情况是有问题的表不是一个分片表.

首先查询了分片的balance情况,设置的凌晨开始,早上六点结束,看了几天的日志,发现并没有报错,查询每个分片中的数据也很均匀,可以排除是数据分配不均匀产生的问题.

再检查mongo的日志,发现了端倪,一个删除的sql执行了将近两个小时

2024-11-07T13:32:33.227+0800 I WRITE    [conn444] remove PS_DB.PS_ACCOUNT query: { "Uin": 2527524969 } planSummary: COLLSCAN keysExamined:0 docsExamined:123458497 ndeleted:0 numYields:970758 locks:{ Global: { acquireCount: { r: 970759, w: 970759 } }, Database: { acquireCount: { w: 970759 } }, Collection: { acquireCount: { w: 970759 } } } 7993866ms

2024-11-07T13:32:33.239+0800 I COMMAND  [conn444] command PS_DB.$cmd command: delete { delete: "PS_ACCOUNT", deletes: [ { q: { "Uin": 2527524969 }, limit: 0 } ], writeConcern: {}, ordered: true, shardVersion: [ Timestamp 0|0, ObjectId('000000000000000000000000') ] } numYields:0 reslen:221 locks:{ Global: { acquireCount: { r: 970760, w: 970760 } }, Database: { acquireCount: { w: 970760 } }, Collection: { acquireCount: { w: 970760 } }, Metadata: { acquireCount: { W: 1 } } } protocol:op_command 7993879ms

可以确认就是这个sql消耗了所有的CPU资源,成为了电费杀手,简直就是电网战略合作伙伴的关系了.

查看慢SQL,先筛选执行时间特别长的,找到更多类似的SQL



		{

				"desc" : "conn2389",

				"threadId" : "140234727237376",

				"connectionId" : 2389,

				"client" : "xx.xx.xx.xx:38706",

				"clientMetadata" : {

						"driver" : {

								"name" : "MongoDB Internal Client",

								"version" : "x.x.xx"

						},

						"os" : {

								"type" : "Linux",

								"name" : xxxxxxx ",

								"architecture" : "x86_64",

								"version" : "Kernel xxxxxx"

						}

				},

				"active" : true,

				"opid" : 10218935,

				"secs_running" : 6566,

				"microsecs_running" : NumberLong("6566038844"),

				"op" : "remove",

				"ns" : "PS_DB.PS_ACCOUNT",

				"query" : {

						""Uin"" : NumberLong(1586603223)

				},

				"planSummary" : "COLLSCAN",

				"numYields" : 551828,

				"locks" : {

						"Global" : "r"

				},

				"waitingForLock" : false,

				"lockStats" : {

						"Global" : {

								"acquireCount" : {

										"r" : NumberLong(551829),

										"w" : NumberLong(551828)

								}

						},

						"Database" : {

								"acquireCount" : {

										"w" : NumberLong(551828)

								}

						},

						"Collection" : {

								"acquireCount" : {

										"w" : NumberLong(551828)

								}

						}

				}

}

执行时间都特别长,数量大,并且使用了全表扫描.

检查表上的索引情况

shardsrv2:PRIMARY> db.PS_ACCOUNT.getIndexes()

[

        {

                "v" : 2,

                "key" : {

                        "UIN" : 1

                },

                "name" : "UIN_1",

                "ns" : "PS_DB.PS_ACCOUNT"

        },

        {

                "v" : 2,

                "key" : {

                        "Uin" : 1

                },

                "name" : "Uin_1",

                "ns" : "PS_DB.PS_ACCOUNT"

        },

        {

                "v" : 2,

                "key" : {

                        "_id" : 1

                },

                "name" : "_id_",

                "ns" : "PS_DB.PS_ACCOUNT"

        }

]

表上是有合适索引可以使用的,但是并没有使用,难道是选择性不好,经过检查,UIN的条件还是不错的.

自己跑个执行计划试试看

mongos> db.PS_ACCOUNT.find({ Uin: { $gte: 0 } }).explain()

{

        "queryPlanner" : {

                "mongosPlannerVersion" : 1,

                "winningPlan" : {

                        "stage" : "SINGLE_SHARD",

                        "shards" : [

                                {

                                        "shardName" : "config",

                                        "connectionString" : "configsrv/xx.xx.xx.47:23000,xx.xx.xx.48:23000,xx.xx.xx.49:23000",

                                        "serverInfo" : {

                                                "host" : "mongo-016",

                                                "port" : 23000,

                                                "version" : "x.x.xx",

                                                "gitVersion" : "00000000000000000000"

                                        },

                                        "plannerVersion" : 1,

                                        "namespace" : "admin.PS_ACCOUNT",

                                        "indexFilterSet" : false,

                                        "parsedQuery" : {

                                                "Uin" : {

                                                        "$gte" : 0

                                                }

                                        },

                                        "winningPlan" : {

                                                "stage" : "EOF"

                                        },

                                        "rejectedPlans" : [ ]

                                }

                        ]

                }

        },

        "ok" : 1

}

mongos> db.PS_ACCOUNT.find({ Uin: "1294288363" }).explain()

{

        "queryPlanner" : {

                "mongosPlannerVersion" : 1,

                "winningPlan" : {

                        "stage" : "SINGLE_SHARD",

                        "shards" : [

                                {

                                        "shardName" : "config",

                                        "connectionString" : "configsrv/xx.xx.xx.47:23000,xx.xx.xx.48:23000,xx.xx.xx.49:23000",

                                        "serverInfo" : {

                                                "host" : "mongo-016",

                                                "port" : 23000,

                                                "version" : "x.x.xx",

                                                "gitVersion" : "0000000000000000000000"

                                        },

                                        "plannerVersion" : 1,

                                        "namespace" : "admin.PS_ACCOUNT",

                                        "indexFilterSet" : false,

                                        "parsedQuery" : {

                                                "Uin" : {

                                                        "$eq" : "1294288363"

                                                }

                                        },

                                        "winningPlan" : {

                                                "stage" : "EOF"

                                        },

                                        "rejectedPlans" : [ ]

                                }

                        ]

                }

        },

        "ok" : 1

}

自己跑了执行计划,也执行了条件相同的SQL,都可以正常使用到索引,带着疑问打开了详细的日志,等多记录点日志再分析.

2024-11-06T19:00:45.013+0800 I COMMAND  [conn2108] command PS_DB.$cmd command: delete { delete: "PS_ACCOUNT", deletes: [ { q: { Uin: 3801898340 }, limit: 0 } ], writeConcern: { w: 1 }, ordered: true, shardVersion: [ Timestamp 0|0, ObjectId('000000000000000000000000') ] } numYields:0 reslen:221 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_command 12ms

2024-11-07T13:32:33.227+0800 I WRITE    [conn444] remove PS_DB.PS_ACCOUNT query: { "Uin": 2527524969 } planSummary: COLLSCAN keysExamined:0 docsExamined:123458497 ndeleted:0 numYields:970758 locks:{ Global: { acquireCount: { r: 970759, w: 970759 } }, Database: { acquireCount: { w: 970759 } }, Collection: { acquireCount: { w: 970759 } } } 7993866ms

2024-11-07T13:32:33.239+0800 I COMMAND  [conn444] command PS_DB.$cmd command: delete { delete: "PS_ACCOUNT", deletes: [ { q: { "Uin": 2527524969 }, limit: 0 } ], writeConcern: {}, ordered: true, shardVersion: [ Timestamp 0|0, ObjectId('000000000000000000000000') ] } numYields:0 reslen:221 locks:{ Global: { acquireCount: { r: 970760, w: 970760 } }, Database: { acquireCount: { w: 970760 } }, Collection: { acquireCount: { w: 970760 } }, Metadata: { acquireCount: { W: 1 } } } protocol:op_command 7993879ms

日志中记录了大致相同的两个SQL语句,一个耗时12ms,另一个耗时7993879ms,同样是SQL,怎么差距就这么大呢.

通过二十年玩找不同游戏的经验,仔细观察发现第二个SQL的Uin外面有个双引号.

去测试环境测试了下正常SQL和不正常SQL的

--正常的SQL

db.PS_USERROUTER.find({"Uin" : 1234576}).explain()

{

        "queryPlanner" : {

                "plannerVersion" : 1,

                "namespace" : "PS_CACHE.PS_USERROUTER",

                "indexFilterSet" : false,

                "parsedQuery" : {

                        "Uin" : {

                                "$eq" : 1234576

                        }

                },

                "winningPlan" : {

                        "stage" : "FETCH",

                        "inputStage" : {

                                "stage" : "IXSCAN",

                                "keyPattern" : {

                                        "Uin" : 1

                                },

                                "indexName" : "Uin_1",

                                "isMultiKey" : false,

                                "isUnique" : false,

                                "isSparse" : false,

                                "isPartial" : false,

                                "indexVersion" : 1,

                                "direction" : "forward",

                                "indexBounds" : {

                                        "Uin" : [

                                                "[1234576.0, 1234576.0]"

                                        ]

                                }

                        }

                },

                "rejectedPlans" : [ ]

        },

        "serverInfo" : {

                "host" : "test9211",

                "port" : 27017,

                "version" : "3.4.10",

                "gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9"

        },

        "ok" : 1

}

--不正常的SQL

db.PS_USERROUTER.find({"\"Uin\"" : 1234576}).explain()

{

        "queryPlanner" : {

                "plannerVersion" : 1,

                "namespace" : "PS_CACHE.PS_USERROUTER",

                "indexFilterSet" : false,

                "parsedQuery" : {

                        ""Uin"" : {

                                "$eq" : 1234576

                        }

                },

                "winningPlan" : {

                        "stage" : "COLLSCAN",

                        "filter" : {

                                ""Uin"" : {

                                        "$eq" : 1234576

                                }

                        },

                        "direction" : "forward"

                },

                "rejectedPlans" : [ ]

        },

        "serverInfo" : {

                "host" : "test9211",

                "port" : 27017,

                "version" : "3.4.10",

                "gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9"

        },

        "ok" : 1

}

模拟出了想要的效果,第一个SQL会使用索引,第二个SQL不会使用索引.

解决方案

知道了是双引号引发的问题,也就知道了问题的原因.

对应的解决方案也就有了两种:

1.研发同事找到对应有问题的SQL并修改为正确的SQL.

2.创建一个奇葩的带有双引号的索引.

db.PS_USERROUTER.createIndex({"\"Uin\"": 1});

> db.PS_USERROUTER.find({"\"Uin\"" : 1234576}).explain()

{

        "queryPlanner" : {

                "plannerVersion" : 1,

                "namespace" : "PS_CACHE.PS_USERROUTER",

                "indexFilterSet" : false,

                "parsedQuery" : {

                        ""Uin"" : {

                                "$eq" : 1234576

                        }

                },

                "winningPlan" : {

                        "stage" : "FETCH",

                        "inputStage" : {

                                "stage" : "IXSCAN",

                                "keyPattern" : {

                                        ""Uin"" : 1

                                },

                                "indexName" : "\"Uin\"_1",

                                "isMultiKey" : false,

                                "multiKeyPaths" : {

                                        ""Uin"" : [ ]

                                },

                                "isUnique" : false,

                                "isSparse" : false,

                                "isPartial" : false,

                                "indexVersion" : 1,

                                "direction" : "forward",

                                "indexBounds" : {

                                        ""Uin"" : [

                                                "[1234576.0, 1234576.0]"

                                        ]

                                }

                        }

                },

                "rejectedPlans" : [ ]

        },

        "serverInfo" : {

                "host" : "test9211",

                "port" : 27017,

                "version" : "x.x.xx",

                "gitVersion" : "00000000000000000000000"

        },

        "ok" : 1

测试环境测试了使用带双引号的索引完美解决了这个问题.

负载的情况看着就吓人,第一个解决方案要等上线窗口才能操作.因此在生产库也创建了带有双引号的索引,先临时解决了这个问题.

总结

如果是关系型数据库,写错列名的情况一定会被解析器拦截.但是在mongo中,集合中并没有严格的列限制,如果给了一个错误名字或多加了东西的条件,没有命中索引,就会去表中查找一遍,结果注定是没有的,数据库的特性决定了一定要去集合中一丝不苟的查找一遍.