boost-tokenizer分词库学习

boost-tokenizer学习

tokenizer库是一个专门用于分词（token）的字符串处理库;
可以使用简单易用的方法把一个字符串分解成若干个单词;
tokenizerl类是该库的核心，它以容器的外观提供分词序列;
TokenizerFunc:专门的分词函数对象，默认使用空格和标点分词

char_delimiters_separator 使用标点符号分词
char_separator 使用字符集合作为分词符
escaped_list_separator 使用CSV的逗号分割
offset_separator 使用偏移量来分词

缺陷：
1、只支持使用单个字符进行分词；
2、对wstring(UNICODE)缺乏完善的考虑；

正则表达式xpressive和string_algo可以提供更好的实现，可以对字符串操作工作的更好！

C++ Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149

/*
    tokenizer库是一个专门用于分词（token）的字符串处理库;
    可以使用简单易用的方法把一个字符串分解成若干个单词;
    tokenizerl类是该库的核心，它以容器的外观提供分词序列;
    TokenizerFunc:专门的分词函数对象，默认使用空格和标点分词
    char_delimiters_separator    使用标点符号分词
    char_separator               使用字符集合作为分词符
    escaped_list_separator       使用CSV的逗号分割
    offset_separator             使用偏移量来分词

缺陷：
1、只支持使用单个字符进行分词；
2、对wstring(UNICODE)缺乏完善的考虑；

正则表达式xpressive和string_algo可以提供更好的实现，可以对字符串操作工作的更好！
*/

/*
template <
typename TokenizerFunc = char_delimiters_separator<char>,
typename Iterator = std::string::const_iterator,
typename Type = std::string
>
class tokenizer {
private:
typedef token_iterator_generator<TokenizerFunc,Iterator,Type> TGen;

// It seems that MSVC does not like the unqualified use of iterator,
// Thus we use iter internally when it is used unqualified and
// the users of this class will always qualify iterator.
typedef typename TGen::type iter;

public:

typedef iter iterator;
typedef iter const_iterator;
typedef Type value_type;
typedef value_type& reference;
typedef const value_type& const_reference;
typedef value_type* pointer;
typedef const pointer const_pointer;
typedef void size_type;
typedef void difference_type;

tokenizer(Iterator first, Iterator last,
const TokenizerFunc& f = TokenizerFunc())
: first_(first), last_(last), f_(f) { }

template <typename Container>
tokenizer(const Container& c)
: first_(c.begin()), last_(c.end()), f_() { }

template <typename Container>
tokenizer(const Container& c,const TokenizerFunc& f)
: first_(c.begin()), last_(c.end()), f_(f) { }

void assign(Iterator first, Iterator last){
first_ = first;
last_ = last;
}

void assign(Iterator first, Iterator last, const TokenizerFunc& f){
assign(first,last);
f_ = f;
}

template <typename Container>
void assign(const Container& c){
assign(c.begin(),c.end());
}

template <typename Container>
void assign(const Container& c, const TokenizerFunc& f){
assign(c.begin(),c.end(),f);
}

iter begin() const { return iter(f_,first_,last_); }
iter end() const { return iter(f_,last_,last_); }

/************************************************************************/
/* C++ stl Library */
/************************************************************************/
#include <iostream>
#include <string>

/************************************************************************/
/* C++ boost Library */
/************************************************************************/
#include "boost/tokenizer.hpp"
#include <boost/typeof/typeof.hpp>

using namespace boost;
using namespace std;

template<typename T>
void print(T &tok)
{
    for(BOOST_AUTO(pos, tok.begin()); pos != tok.end(); pos++)
    {
        cout << "[" << *pos << "]" ;
    }
    cout << endl;
}

int main(void)
{
    //char_delimiters_separator
    string str1 = "I love my town!xian";
    tokenizer<> tok1(str1);          //默认使用空格和标点分词
    print(tok1);

string str2 = "I,love,my,town!";
tokenizer<> tok2(str2); //默认使用空格和标点分词
print(tok2);

//char_separator
    string str3("I love my town!xian");
    char_separator<char> sep;
    tokenizer<char_separator<char> > tok3(str3, sep);
    print(tok3);

string str4 = ";!!;Hello|world||-Michael--Joessy;yoo;handsome|";
    char_separator<char> sep1("-;|");
    tokenizer<char_separator<char> > tok4(str4, sep1);
    print(tok4);

char_separator<char> sep2("-;", "|", keep_empty_tokens);
    tokenizer<char_separator<char> > tok5(str4, sep2);
    print(tok5);

//escaped_list_separator
    string str5 = "aa,Int32,localTag1,23";
    tokenizer<escaped_list_separator<char> > tok6(str5);
    print(tok6);

//offset_separator
string str6 = "1225200140023";

};
    offset_separator f(offsets, offsets + );
    tokenizer<offset_separator> tok7(str6, f);
    print(tok7);

cin.get();
;
}

boost-tokenizer分词库学习的更多相关文章

【Todo】Boost安装与学习
现在这里找下载包 http://sourceforge.net/projects/boost 我找的是 1_62_0 下面是从公司wiki上找到的一个说明. boost & thrift安装步 ...
【Boost】boost::tokenizer详解
分类: [C++]--[Boost]2012-12-28 21:42 2343人阅读评论(0) 收藏举报目录(?)[+] tokenizer 库提供预定义好的四个分词对象, 其中char ...
boost::tokenizer详解
tokenizer 库提供预定义好的四个分词对象, 其中char_delimiters_separator已弃用. 其他如下: 1. char_separator char_separator有两个构 ...
boost::tuple 深入学习解说
#include<iostream> #include<string> #include<boost/tuple/tuple.hpp> #include<bo ...
Boost线程库学习笔记
一.创建一个线程创建线程 boost::thread myThread(threadFun); 需要注意的是:参数可以是函数对象或者函数指针.并且这个函数无参数,并返回void类型. 当一个thre ...
boost asio io_service学习笔记
构造函数构造函数的主要动作就是调用CreateIoCompletionPort创建了一个初始iocp. Dispatch和post的区别 Post一定是PostQueuedCompletionSta ...
boost timer代码学习笔记
socket连接中需要判断超时所以这几天看了看boost中计时器的文档和示例一共有五个例子从简单的同步等待到异步调用超时处理先看第一个例子 // timer1.cpp: 定义控制台应用程序的入 ...
Boost.Coroutine2：学习使用Coroutine（协程）
function(函数)routine(例程)coroutine (协程) 函数,例程以及协程都是指一系列的操作的集合. 函数(有返回值)以及例程(没有返回值)也被称作subroutine(子例程), ...
Lucene 中的Tokenizer, TokenFilter学习
lucene中的TokenStream,TokenFilter之间关系 TokenStream是一个能够在被调用后产生语汇单元序列的类,其中有两个类型:Tokenizer和TokenFilte ...

随机推荐

算法笔记_053:最优二叉查找树（Java）
目录 1 问题描述 2 解决方案 1 问题描述在了解最优二叉查找树之前,我们必须先了解何为二叉查找树? 引用自百度百科一段讲解: 二叉排序树(Binary Sort Tree)又称二叉查找树(B ...
js 终止 for 循环
1.break语句会使运行的程序立刻退出包含在最内层的循环或者退出一个switch语句. 2.for循环如果是多层循环可以将循环命名,跳出指定的循环. first://需要将循环命名 for(var ...
ionic 隐藏header-ionic中隐藏头部header
ionic 中隐藏头部header 通过 hide-nav-bar="true" 来实现 <ion-view hide-nav-bar="true"> ...
详细解说 STL 排序(Sort)（转）
作者Winter 详细解说 STL 排序(Sort) 0 前言: STL,为什么你必须掌握 1 STL提供的Sort 算法 1.1 所有sort算法介绍 1.2 sort 中的比较函数 1.3 sor ...
12-spring学习-基本表达式
基本表达式一,字面表达式二,数学表达式三,关系表达式四,字符串表达式 String类中所有操作方法都是开发过程中最常用的. 五,正则表达式
Azure Storage Blob 属性设置
概述在使用SDK做Blob对象属性的获取或设置时,如果只是直接使用get或set方法,是无法成功获取或设置blob对象的属性.主要是因为在获取对象时,对象的属性默认并未被填充到对象,这就需要执行额外 ...
C# Oracle.ManagedDataAccess 批量更新表数据
这是我第一次发表博客.以前经常到博客园查找相关技术和代码,今天在写一段小程序时出现了问题, 但在网上没能找到理想的解决方法.故注册了博客园,想与新手分享(因为本人也不是什么高手). vb.net和C# ...
jQuery源代码学习：经常使用正則表達式
转载自:http://nuysoft.iteye.com/blog/1217898 经常使用的数字正则(严格匹配) 正则含义 ^[1-9]\d*$ 匹配正整数 ^-[1-9]\d*$ 匹配负整数 ^ ...
Web檢測
腾讯电脑管家http://guanjia.qq.com/online_server/webindex.html 安全联盟http://zhanzhang.anquan.org/physical/my_ ...
点滴积累【C#】---对上传文件的路径进行加密，以免将路径暴露在浏览器上，避免一些安全隐患！
效果: 描述: 本事例是为解决在上传或下载文件时避免将路径暴露在外.在上传时将路径进行加密保存到DataTable或数据库中,在下载是再读取DataTable中加密数据进行解密下载. 代码: [前台代 ...

boost-tokenizer分词库学习

boost-tokenizer分词库学习的更多相关文章

随机推荐

热门专题