C++中三种正则表达式比较（C regex，C ++regex，boost regex）

工作需要用到C++中的正则表达式，以下三种正则可供参考

1，C regex

#include <regex.h>

#include <iostream>

#include <sys/types.h>

#include <stdio.h>

#include <cstring>

#include <sys/time.h>

using namespace std;

const int times = 1000000;

int main(int argc,char** argv)

{

    char pattern[512]="finance\.sina\.cn|stock1\.sina\.cn|3g\.sina\.com\.cn.*(channel=finance|_finance$|ch=stock|/stock/)|dp.sina.cn/.*ch=9&";

    const size_t nmatch = 10;

    regmatch_t pm[10];

    int z ;

    regex_t reg;

    char lbuf[256]="set",rbuf[256];

    char buf[3][256] = {"finance.sina.cn/google.com/baidu.com.google.sina.cndddddddddddddddddddddda.sdfasdfeoasdfnahsfonadsdf",

                    "3g.com.sina.cn.google.com.dddddddddddddddddddddddddddddddddddddddddddddddddddddbaidu.com.sina.egooooooooo",

                    "http://3g.sina.com.cn/google.baiduchannel=financegogo.sjdfaposif;lasdjf.asdofjas;dfjaiel.sdfaosidfj"};

    printf("input strings:\n");

    timeval end,start;

    gettimeofday(&start,NULL);

    regcomp(&reg,pattern,REG_EXTENDED|REG_NOSUB);

    for(int i = 0 ; i < times; ++i)

    {

        for(int j = 0 ; j < 3; ++j)

        {

            z = regexec(&reg,buf[j],nmatch,pm,REG_NOTBOL);

/*          if(z==REG_NOMATCH)

                printf("no match\n");

            else

                printf("ok\n");

                */

        }

    }

    gettimeofday(&end,NULL);

    uint time = (end.tv_sec-start.tv_sec)*1000000 + end.tv_usec - start.tv_usec;

    cout<<time/1000000<<" s and "<<time%1000000<<" us."<<endl;

    return 0 ;

}

使用正则表达式可简单的分成几步：

1.编译正则表达式

2.执行匹配

3.释放内存

首先，编译正则表达式

int regcomp(regex_t *preg, const char *regex, int cflags);

reqcomp()函数用于把正则表达式编译成某种格式，可以使后面的匹配更有效。

preg： regex_t结构体用于存放编译后的正则表达式；

regex：指向正则表达式指针；

cflags：编译模式

共有如下四种编译模式：

REG_EXTENDED：使用功能更强大的扩展正则表达式

REG_ICASE：忽略大小写

REG_NOSUB：不用存储匹配后的结果

REG_NEWLINE：识别换行符，这样‘$’就可以从行尾开始匹配，‘^’就可以从行的开头开始匹配。否则忽略换行符，把整个文本串当做一个字符串处理。

其次，执行匹配

int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags);

preg：已编译的正则表达式指针；

string：目标字符串；

nmatch:pmatch数组的长度；

pmatch：结构体数组，存放匹配文本串的位置信息；

eflags：匹配模式

共两种匹配模式：

REG_NOTBOL：The match-beginning-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above). This flag may be used when different portions of a string are passed to regexec and the beginning of the string should not be interpreted as the beginning of the line.

REG_NOTEOL:The match-end-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above)

最后，释放内存
void regfree(regex_t *preg);
当使用完编译好的正则表达式后，或者需要重新编译其他正则表达式时，一定要使用这个函数清空该变量。

其他，处理错误
size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);
当执行regcomp 或者regexec 产生错误的时候，就可以调用这个函数而返回一个包含错误信息的字符串。
errcode：由regcomp 和 regexec 函数返回的错误代号。
preg：已经用regcomp函数编译好的正则表达式，这个值可以为NULL。
errbuf：指向用来存放错误信息的字符串的内存空间。
errbuf_size：指明buffer的长度，如果这个错误信息的长度大于这个值，则regerror 函数会自动截断超出的字符串，但他仍然会返回完整的字符串的长度。所以我们可以用如下的方法先得到错误字符串的长度。

当然我在测试的时候用到的也比较简单，所以就直接用了，速度一会再说！

2，C++ regex

#include <regex>

#include <iostream>

#include <stdio.h>

#include <string>

using namespace std;

int main(int argc,char** argv)

{

    regex pattern("[[:digit:]]",regex_constants::extended);

    printf("input strings:\n");

    string buf;

    while(cin>>buf)

    {

        printf("*******\n%s\n********\n",buf.c_str());

        if(buf == "quit")

        {

            printf("quit just now!\n");

            break;

        }

        match_results<string::const_iterator> result;

        printf("run compare now!  '%s'\n", buf.c_str());

        bool valid = regex_match(buf,result,pattern);

        printf("compare over now!  '%s'\n", buf.c_str());

        if(!valid)

            printf("no match!\n");

        else

            printf("ok\n");

    }

    return 0 ;

}

/*  write by xingming

 *  time:2012年10月19日15:51:53

 *  for: test regex

 *  */

#include <regex>

#include <iostream>

#include <stdio.h>

#include <string>

using namespace std;

int main(int argc,char** argv)

{

    regex pattern("[[:digit:]]",regex_constants::extended);

    printf("input strings:\n");

    string buf;

    while(cin>>buf)

    {

        printf("*******\n%s\n********\n",buf.c_str());

        if(buf == "quit")

        {

            printf("quit just now!\n");

            break;

        }

        match_results<string::const_iterator> result;

        printf("run compare now!  '%s'\n", buf.c_str());

        bool valid = regex_match(buf,result,pattern);

        printf("compare over now!  '%s'\n", buf.c_str());

        if(!valid)

            printf("no match!\n");

        else

            printf("ok\n");

    }

    return 0 ;

}

C++这个真心不想多说它，测试过程中发现字符匹配的时候 ‘a' 是可以匹配的，a+也是可以的，[[:w:]]也可以匹配任意字符，但[[:w:]]+就只能匹配一个字符，+号貌似不起作用了。所以后来就干脆放弃了这伟大的C++正则，如果有大牛知道这里面我错在哪里了，真心感谢你告诉我一下，谢谢。

3，boost regex

#include <iostream>

#include <string>

#include <sys/time.h>

#include "boost/regex.hpp"

using namespace std;

using namespace boost;

const int times = 10000000;

int main()

{

    regex  pattern("finance\\.sina\\.cn|stock1\\.sina\\.cn|3g\\.sina\\.com\\.cn.*(channel=finance|_finance$|ch=stock|/stock/)|dp\\.s

ina\\.cn/.*ch=9&");

    cout<<"input strings:"<<endl;

    timeval start,end;

    gettimeofday(&start,NULL);

    string input[] = {"finance.sina.cn/google.com/baidu.com.google.sina.cn",

                      "3g.com.sina.cn.google.com.baidu.com.sina.egooooooooo",

                      "http://3g.sina.com.cn/google.baiduchannel=financegogo"};

    for(int i = 0 ;i < times; ++ i)

    {

        for(int j = 0 ; j < 3;++j)

        {

            //if(input=="quit")

            //  break;

            //cout<<"string:'"<<input<<'\''<<endl;

            cmatch what;

            if(regex_search(input[j].c_str(),what,pattern)) ;

            //  cout<<"OK!"<<endl;

            else ;

            //  cout<<"error!"<<endl;

        }

    }

    gettimeofday(&end,NULL);

    uint time = (end.tv_sec-start.tv_sec)*1000000 + end.tv_usec - start.tv_usec;

    cout<<time/1000000<<" s and "<<time%1000000<<" us."<<endl;

    return 0 ;

}

boost正则不用多说了，要是出去问，C++正则怎么用啊？那90%的人会推荐你用boost正则，他实现起来方便，正则库也很强大，资料可以找到很多，所以我也不在阐述了。

4，对比情况

单位(us)	boost regex						单位(us)	C regex
	1	2	3	4	5	平均		1	2	3	4	5	平均
1w	218,699					218,700	1w	90,631					90,632
10w	2,186,109	2,194,524	2,188,762	2,186,343	2,192,902	2,191,350	10w	902,658	907,547	915,934	891,250	903,899	900,113
100w	25,606,021	28,633,984	28,956,997	26,912,245	26,909,788	27,669,546	100w	9,030,497	9,016,080	8,939,238	8,953,076	9,041,565	8,983,831
1000w	218,126,580					218,126,581	1000w	89,609,061					89,609,062



正则	finance\\.sina\\.cn\|stock1\\.sina\\.cn\|3g\\.sina\\.com\\.cn.(channel=finance\|_finance$\|ch=stock\|/stock/)\|dp\\.s ina\\.cn/.ch=9&						正则	finance\.sina\.cn\|stock1\.sina\.cn\|3g\.sina\.com\.cn.(channel=finance\|_finance$\|ch=stock\|/stock/)\|dp.sina.cn/.ch=9&
字符串	{"finance.sina.cn/google.com/baidu.com.google.sina.cn" ,						字符串	{"finance.sina.cn/google.com/baidu.com.google.sina.cn" ,
	"3g.com.sina.cn.google.com.baidu.com.sina.egooooooooo" ,							"3g.com.sina.cn.google.com.baidu.com.sina.egooooooooo" ,
	"http://3g.sina.com.cn/google.baiduchannel=financegogo"};							http://3g.sina.com.cn/google.baiduchannel=financegogo};

总结：

C regex的速度让我吃惊啊，相比boost的速度，C regex的速度几乎要快上3倍，看来正则引擎的选取上应该有着落了！

上面的表格中我用到的正则和字符串是一样的（在代码中C regex的被我加长了），速度相差几乎有3倍，C的速度大约在30+w/s , 而boost的速度基本在15-w/s ,所以对比就出来了！

在这里Cregex的速度很让我吃惊了已经，但随后我的测试更让我吃惊。

我以前在.net正则方面接触的比较多，就写了一个.net版本的作为对比，

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Text.RegularExpressions;

namespace 平常测试

{

    class Program

    {

        static int times = 1000000;

        static void Main(string[] args)

        {

            Regex reg = new Regex(@"(?>finance\.sina\.cn|stock1\.sina\.cn|3g\.sina\.com\.cn.*(?:channel=finance|_finance$|ch=stock|/stock/)|dp.sina.cn/.*ch=9&)",RegexOptions.Compiled);

            string[] str = new string[]{@"finance.sina.cn/google.com/baidu.com.google.sina.cn",

                    @"3g.com.sina.cn.google.com.baidu.com.sina.egooooooooo",

                    @"http://3g.sina.com.cn/google.baiduchannel=financegogo"};

            int tt = 0;

            DateTime start = DateTime.Now;

            for (int i = 0; i < times; ++i)

            {

                for (int j = 0; j < 3; ++j)

                {

                    if (reg.IsMatch(str[j])) ;

                        //Console.WriteLine("OK!");

                    //else

                        //Console.WriteLine("Error!");

                }

            }

            DateTime end = DateTime.Now;

            Console.WriteLine((end - start).TotalMilliseconds);

            Console.WriteLine(tt);

            Console.ReadKey();

        }

    }

}

结果发现，正则在不进行RegexOptions.Compiled 的时候，速度和C regex的基本一样，在编译只会，速度会比C regex快上一倍，这不由得让我对微软的那群人的敬畏之情油然而生啊。

但随后我去查看了一下该博客上面C regex的描述，发现我可以再申明正则的时候加入编译模式，随后我加入了上面代码里的 REG_NOSUB（在先前测试的时候是没有加入的），结果让我心理面很激动的速度出来了，C regex 匹配速度竟然达到了 300+w/s，也就是比原来的（不加入REG_NOSUB)的代码快了将近10倍。

之后我变换了匹配的字符串，将其长度生了一倍，达到每个100字符左右（代码里面所示），匹配速度就下来了，但是也能达到 100w/s左右，这肯定满足我们现在的需求了。

结果很显然，当然会选择C regex了

C++中三种正则表达式比较（C regex，C ++regex，boost regex）的更多相关文章

Spring中三种配置Bean的方式
Spring中三种配置Bean的方式分别是: 基于XML的配置方式基于注解的配置方式基于Java类的配置方式一.基于XML的配置这个很简单,所以如何使用就略掉. 二.基于注解的配置 Sprin ...
iOS开发UI篇—iOS开发中三种简单的动画设置
iOS开发UI篇—iOS开发中三种简单的动画设置 [在ios开发中,动画是廉价的] 一.首尾式动画代码示例: // beginAnimations表示此后的代码要“参与到”动画中 [UIView b ...
C#中三种定时器对象的比较
·关于C#中timer类在C#里关于定时器类就有3个1.定义在System.Windows.Forms里2.定义在System.Threading.Timer类里3.定义在System.Timers ...
转-Web Service中三种发送接受协议SOAP、http get、http post
原文链接:web服务中三种发送接受协议SOAP/HTTP GET/HTTP POST 一.web服务中三种发送接受协议SOAP/HTTP GET/HTTP POST 在web服务中,有三种可供选择的发 ...
C#中三种定时器对象的比较【转】
https://www.cnblogs.com/zxtceq/p/5667281.html C#中三种定时器对象的比较 ·关于C#中timer类在C#里关于定时器类就有3个1.定义在System.W ...
深入浅出spring IOC中三种依赖注入方式
深入浅出spring IOC中三种依赖注入方式 spring的核心思想是IOC和AOP,IOC-控制反转,是一个重要的面向对象编程的法则来消减计算机程序的耦合问题,控制反转一般分为两种类型,依赖注入和 ...
Android中三种超实用的滑屏方式汇总(转载)
Android中三种超实用的滑屏方式汇总现如今主流的Android应用中,都少不了左右滑动滚屏这项功能,(貌似现在好多人使用智能机都习惯性的有事没事的左右滑屏,也不知道在干什么...嘿嘿),由于 ...
VMWare中三种网络连接模式的区别
VMWare中有桥接.NAT.host-only三种网络连接模式,在搭建伪分布式集群时,需要对集群的网络连接进行配置,而这一操作的前提是理解这三种网络模式的区别. 参考以下两篇文章可以更好的理解: V ...
js中三种定义变量 const， var， let 的区别
js中三种定义变量的方式const, var, let的区别 1.const定义的变量不可以修改,而且必须初始化. 1 const b = 2;//正确 2 // const b;//错误,必须初始化 ...

随机推荐

fork之后，子进程从父进程那继承了什么(转载)
转载自:https://blog.csdn.net/xiaojun111111/article/details/51764389 知道子进程自父进程继承什么或未继承什么将有助于我们.下面这个名单会因为 ...
CS5265 demoboard|CS5265测试板电路参考|CS5265 Typec转HDMI 4K60HZ方案
CS5265是TYPEC转HDMI2.0音视频转换芯片,CS5265符合DP1.4协议,且输出的视频信号是HDMI2.0 即4K60HZ CS5265集成了DP1.4兼容接收机和HDMI2.0b兼容 ...
linux脚本重启java服务
!/bin/bashpid=$(ps -ef | grep zwdatatransfer-1.0.0.jar | grep -v 'grep' | awk '{print $2}')kill -9 $ ...
SpringBoot集成MyBatis-Plus框架
1.说明本文介绍Spring Boot集成MyBatis-Plus框架, 重点介绍需要注意的地方, 是SpringBoot集成MyBatis-Plus框架详细方法这篇文章的脱水版, 主要是三个步骤 ...
PHP 开启 Opcache 功能提升程序处理效率
简介 Opcache 的前生是 Optimizer+ ,它是 Zend 开发的 PHP 优化加速组件.Optimizer+ 将 PHP 代码预编译生成的脚本文件 Opcode 缓存在共享内存中供以后反 ...
分享一篇：sql语句中使用子查询，可能会引起查询的性能问题，查询时间会变长
前段时间,做自动化适配的时候,查找需要的数据的时候,使用到了dblink,跨数据库实例进行访问,整段sql拼接再加上dblink,在plsql查询的时候,性能还不是很长时间,最多2分钟可以查到,前期调 ...
初识python 之爬虫：爬取某电影网站信息
注:此代码仅用于个人爱好学习使用,不涉及任何商业行为! 话不多说,直接上代码: 1 #!/user/bin env python 2 # author:Simple-Sir 3 # time:201 ...
spring boot + spring security +前后端分离【跨域】配置 + ajax的json传输数据
1.前言网上各个社区的博客参差不齐 ,给初学者很大的困扰 , 我琢磨了一天一夜,到各个社区找资料,然后不断测试,遇到各种坑,一言难尽啊,要么源码只有一部分,要么直接报错... 最后实在不行,直接去看 ...
《剑指offer》面试题05. 替换空格
问题描述请实现一个函数,把字符串 s 中的每个空格替换成"%20". 示例 1: 输入:s = "We are happy." 输出:"We%20a ...
通过Javascript实现把数组里的内容以表格方式呈现到页面从
一.把数组里的内容呈现到页面从,以表格方式 <!doctype html> <html> <head> <meta charset="utf-8&q ...

C++中三种正则表达式比较（C regex，C ++regex，boost regex）

C++中三种正则表达式比较（C regex，C ++regex，boost regex）的更多相关文章

随机推荐

热门专题