Slony-I的 RemoteWorker重试调查

客户的问题是：

向Slony-I运行环境中，增加新的slaveDB节点的时候发生错误。

log中反复出现错误，然后再重新开始（重新开始部分的log省略）：

CONFIG remoteWorkerThread_1: connected to provider DB

CONFIG remoteWorkerThread_1: prepare to copy table "tst"."a_tbl"

CONFIG remoteWorkerThread_1: prepare to copy table "tst"."b_tbl"

CONFIG remoteWorkerThread_1: prepare to copy table "tst"."c_tbl"

CONFIG remoteWorkerThread_1: all tables for set  found on subscriber

CONFIG remoteWorkerThread_1: copy sequence "tst"."a_no_seq"

CONFIG remoteWorkerThread_1: copy sequence "tst"."b_no_seq"

CONFIG remoteWorkerThread_1: copy sequence "tst"."c_no_seq"

CONFIG remoteWorkerThread_1: copy table "tst"."a_tbl"

CONFIG remoteWorkerThread_1: Begin COPY of table "tst"."a_tbl"

NOTICE:  truncate of "tst"."a_tbl" succeeded

CONFIG remoteWorkerThread_1:  bytes copied for table "tst"."a_tbl"

CONFIG remoteWorkerThread_1: 27.97 seconds to copy table "tst"."a_tbl"

CONFIG remoteWorkerThread_1: copy table "tst"."b_tbl"

CONFIG remoteWorkerThread_1: Begin COPY of table "tst"."b_tbl"

ERROR  remoteWorkerThread_1: "select "_mycluster".copyFields(2);" 

WARN   remoteWorkerThread_1: data copy for set  failed  times - sleep  seconds

NOTICE:  Slony-I: Logswitch to sl_log_2 initiated

CONTEXT:  SQL statement "SELECT "_mycluster".logswitch_start()"

经过查阅资料，并且和客户沟通，发现是他们的网络环境有问题：原有节点所在网段和新增节点不在一个网段。而他们又使用了网络工具来监控网络，在某些特定情况下，网络工具会切点网络连接。

正式此原因，导致出错。然后我进行了代码分析，发现remoteworker是很勤劳的，如果发生了通讯错误，它会反复重试的：

remoteWorkerThread_main函数的while循环，就会完成这个工作。

/* ----------

 * slon_remoteWorkerThread

 *

 * Listen for events on the local database connection. This means, events

 * generated by the local node only.

 * ----------

 */

void *

remoteWorkerThread_main(void *cdata)

{

    …

    /*

     * Work until shutdown or node destruction

     */

    while (true)

      {

        …

        /*

         * Event type specific processing

         */

        if (strcmp(event->ev_type, "SYNC") == )

        {

            …

        }

        else    /* not SYNC */

        {

            …

            /*

             * Simple configuration events. Call the corresponding runtime

             * config function, add the query to call the configuration event

             * specific stored procedure.

             */

            if (strcmp(event->ev_type, "STORE_NODE") == )

            {

                …

            }

            …

            else if (strcmp(event->ev_type, "ENABLE_SUBSCRIPTION") == )

            {

                …

                int         copy_set_retries = ;

                …                                    

                if (sub_receiver == rtcfg_nodeid &&

                    event->ev_origin == node->no_id)

                {

                    ScheduleStatus            sched_rc;

                    int            sleeptime = ;

                    …

                    while (true) 

                    {

                        …

                        /*

                         * If the copy succeeds, exit the loop and let the

                         * transaction commit.

                         */ 

                        if (copy_set(node, local_conn, sub_set, event) == )

                        {

                            …

                            copy_set_retries = ;

                            break;

                        }

                        copy_set_retries++;                            

                        /*

                         * Data copy for new enabled set has failed. Rollback

                         * the transaction, sleep and try again.

                         */

                        slon_log(SLON_WARN, "remoteWorkerThread_%d: "

                                 "data copy for set %d failed %d times - "

                                 "sleep %d seconds\n",

                                 node->no_id, sub_set, copy_set_retries,

                                 sleeptime);

                        …      

                    }

                }

                else

                {

                    …

                }

                …

            }

            …

            else

            {

                …

            }                                        

            /*

             * All simple configuration events fall through here. Commit the

             * transaction.

             */

            …

        }

        …

    }

    …

}                                                    

/* ----------

 * copy_set

 * ----------

 */

static int

copy_set(SlonNode *node, SlonConn *local_conn, int set_id,

         SlonWorkMsg_event *event)

{

    …

    /*

     * Connect to the provider DB

     */

    …

    slon_log(SLON_CONFIG, "remoteWorkerThread_%d: "

             "connected to provider DB\n",

             node->no_id);

    …

    /*

     * For each table in the set

     */

    for (tupno1 = ; tupno1 < ntuples1; tupno1++)

    {

        char       *tab_fqname = PQgetvalue(res1, tupno1, );                                        

        gettimeofday(&tv_start2, NULL);

        slon_log(SLON_CONFIG, "remoteWorkerThread_%d: "

                 "prepare to copy table %s\n",

                 node->no_id, tab_fqname);                                    

        (void) slon_mkquery(&query3, "select * from %s limit 0;",

                            tab_fqname);

        res2 = PQexec(loc_dbconn, dstring_data(&query3));

        …

    }

    …

    slon_log(SLON_CONFIG, "remoteWorkerThread_%d: "

             "all tables for set %d found on subscriber\n",

             node->no_id, set_id);

    …

    for (tupno1 = ; tupno1 < ntuples1; tupno1++)

    {

        …

        slon_log(SLON_CONFIG, "remoteWorkerThread_%d: "

                 "copy sequence %s\n",

                 node->no_id, seq_fqname);

        …

    }

    …                                                

    /*

     * For each table in the set

     */

    for (tupno1 = ; tupno1 < ntuples1; tupno1++)

    {

        …

        slon_log(SLON_CONFIG, "remoteWorkerThread_%d: "

                 "copy table %s\n",

                 node->no_id, tab_fqname);

        …

        if (omit_copy) {

            …

        } else {

            slon_log(SLON_CONFIG, "remoteWorkerThread_%d: "

                 "Begin COPY of table %s\n",

                 node->no_id, tab_fqname);                                    

            (void) slon_mkquery(&query2, "select %s.copyFields(%d);",

                            rtcfg_namespace, tab_id);                        

            res3 = PQexec(pro_dbconn, dstring_data(&query2));                                        

            if (PQresultStatus(res3) != PGRES_TUPLES_OK)

            {

                slon_log(SLON_ERROR, "remoteWorkerThread_%d: \"%s\" %s\n",

                         node->no_id, dstring_data(&query2),

                         PQresultErrorMessage(res3));

                …

                return -;

            }                                        

        …

        slon_log(SLON_CONFIG, "remoteWorkerThread_%d: "

                 INT64_FORMAT " bytes copied for table %s\n",

                 node->no_id, copysize, tab_fqname);

        …

        slon_log(SLON_CONFIG, "remoteWorkerThread_%d: "

                 "%.3f seconds to copy table %s\n",

                 node->no_id,

                 TIMEVAL_DIFF(&tv_start2, &tv_now), tab_fqname);

    }

    …

    return ;

}

Slony-I的 RemoteWorker重试调查的更多相关文章

ElasticSearch 的一次非正常master脱离的调查（转和我碰到的情况一模一样）
转自 http://simonlei.iteye.com/blog/1669992 一共有4个节点的cluster,其中es4 是master,某个时间突然es1脱离了整个cluster,调查过程如下 ...
由VIP漂移引发的算法异常问题调查和解决
最近工作中的一个问题,耗时一个月之久终于调查完毕且顺利解决,顿时感慨万千.耗时之久和预期解决时间和环境搭建以及日志不合理等等有关,当然这个并非此文的重点.之所以在很久以后的今天又开始写文,主要是这个问 ...
easyui模板页面不良调查
<%@page import="com.xy.cc.util.CUtil" %><%@page import="com.xy.cc.bean.UserP ...
当master down掉后，pt-heartbeat不断重试会导致内存缓慢增长
最近同事反映,在使用pt-heartbeat监控主从复制延迟的过程中,如果master down掉了,则pt-heartbeat则会连接失败,但会不断重试. 重试本无可厚非,毕竟从使用者的角度来说,希 ...
O365（世纪互联）SharePoint 之调查列表简单介绍
前言 SharePoint中为了提供了很多开箱即用的应用程序,比如调查列表就是其中之一,同样,在O365版本里(国际版和世纪互联版本均可),也有这样的调查列表可以供我们使用,而使用起来非常方便和快速, ...
Office 365使用情况调查不完全分析报告
感谢大家参与了9月13日在Office 365技术群(O萌)中发起的一个关于Office 365使用情况的调查,在一天左右的时间内,我们一共收到了67份反馈,其中绝大部分是在3分钟内提交的. 本次调查 ...
【Java EE 学习 73】【数据采集系统第五天】【参与调查】【导航处理】【答案回显】【保存答案】
一.参与调查的流程单击导航栏上的“参与调查”按钮->EntrySurveyAction做出相应,找到所有的Survey对象并转发到显示所有survey对象的页面上供用户选择->用户单击其 ...
【Java EE 学习 72 上】【数据采集系统第四天】【增加调查logo】【文件上传】【动态错误页指定】【上传限制】【国际化】
增加logo的技术点:文件上传,国际化文件上传的功能在struts2中是使用文件上传拦截器完成的. 1.首先需要在页面上添加一个文件上传的超链接. 点击该超链接能够跳转到文件上传页面.我给该表单页面 ...

随机推荐

android定时三种方式
一.采用Handler与线程的sleep(long)方法二.采用Handler的postDelayed(Runnable, long)方法三.采用Handler与timer及TimerTask结合的方 ...
[Everyday Mathematics]20150220
试求 $$\bex \sum_{k=0}^\infty\frac{1}{(4k+1)(4k+2)(4k+3)(4k+4)}. \eex$$
-Xbootclasspath参数、java -jar参数运行应用时classpath的设置方法
当用java -jar yourJarExe.jar来运行一个经过打包的应用程序的时候,你会发现如何设置-classpath参数应用程序都找不到相应的第三方类,报ClassNotFound错误.实际上 ...
YII Framework学习教程-YII的安全
web应用的安全问题是很重要的,在“黑客”盛行的年代,你的网站可能明天都遭受着攻击,为了从某种程度上防止被攻击,YII提供了防止攻击的几种解决方案.当然这里讲的安全是片面的,但是值得一看. 官方提供的 ...
Raspberry Pi3 ~ Eclipse中添加wiringPi 库函数
这篇是在博客园原创转载注明出处啊以前用单片机.STM32之类的时候都是在一个集成的开发环境下进行的比如Keil.IAR等那么linux下编程,eclipse是个不错的选择关于树莓派的GPIO ...
庞锋 OpenCV 视频学习进度备忘
书签:另外跳过的内容有待跟进学习资源: opencv视频教程目录(初级) 主讲:庞锋,毕业于电子科技大学知识基础支持: 线性代数应用数学跳过的内容: 1.第1~6集跳过,简单.(2014- ...
centos6.3 安装配置redis
1.下载安装 1.1 下载包注:在http://download.redis.io/releases查询需要下载的版本 wget http://download.redis.io/releases/ ...
如何通过Android Studio发布library到jCenter和Maven Central
http://www.jianshu.com/p/3c63ae866e52# 在Android Studio里,如果你想引入任何library到自己的项目中,只需要很简单的在module的build. ...
php 开发最好的ide: PhpStorm
PhpStorm 跨平台. 对PHP支持refactor功能. 自动生成phpdoc的注释,非常方便进行大型编程. 内置支持Zencode. 生成类的继承关系图,如果有一个类,多次继承之后,可以通过这 ...
linux rar工具
rar系统工具: wget http://www.rarlab.com/rar/rarlinux-3.8.0.tar.gz tar -zxvf rarlinux-3.8.0.tar.gz cd rar ...

Slony-I的 RemoteWorker重试调查

Slony-I的 RemoteWorker重试调查的更多相关文章

随机推荐

热门专题