c++实现文本中英文单词和汉字字符的统计

源代码下载：http://download.csdn.net/detail/nuptboyzhb/4987141

1.统计文本中汉字的频数，为后续的文本分类做基础。对于汉字的统计，需要判断读取的是否为汉字。源代码如下：

[C++ code]

[cpp] view plain copy

/*
*@author:郑海波 http://blog.csdn.net/NUPTboyZHB
*参考：实验室小熊
*注：有删改
*/
#pragma warning(disable:4786)
#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <map>
#include <queue>
#include <ctime>
using namespace std;
void topK(const int &K)
{
double t=clock();
ifstream infile("test.txt");
if (!infile)
cout<<"can not open file"<<endl;
string s="";
map<string,int>wordcount;
unsigned char temp[2];
while(true)//国标2312
{
infile>>temp[0];
if(infile.eof()) break;
if (temp[0]>=0xB0)//GB2312下的汉字，最小是0XB0
{
s+=temp[0];
infile>>temp[1];
s+=temp[1];
}
else//非汉字字符不统计
{
s="";
continue;
}
wordcount[s]++;
s="";
}
cout<<"单词种类："<<wordcount.size()<<endl;
//优先队列使用小顶堆，排在前面的数量少，使用">";
priority_queue< pair< int,string >,vector< pair< int,string > >,greater< pair< int,string> > > queueK;
for (map<string,int>::iterator iter=wordcount.begin(); iter!=wordcount.end(); iter++)
{
queueK.push(make_pair(iter->second,iter->first));
if(queueK.size()>K)
queueK.pop();
}
pair<int,string>tmp;
//将排在后面的数量少，排在前面的数量多
priority_queue< pair< int,string >,vector< pair< int,string > >,less< pair< int,string> > > queueKless;
while (!queueK.empty())
{
tmp=queueK.top();
queueK.pop();
queueKless.push(tmp);
}
while(!queueKless.empty())
{
tmp=queueKless.top();
queueKless.pop();
cout<<tmp.second<<"\t"<<tmp.first<<endl;
}
cout<<"< Elapsed Time: "<<(clock()-t)/CLOCKS_PER_SEC<<" s>"<<endl;
}
int main()
{
int k=0;
cout<<"http://blog.csdn.net/NUPTboyZHB\n";
while (true)
{
cout<<"查看前K个频率最高的汉字，K=";
cin>>k;
if(k<=0)break;
topK(k);
}
return 0;
}

[图1]

2.统计英文单词的出现频率。这比统计汉字更加的容易，因为单词和单词之间是用空格分开的，所以，直接将单词保存到string中即可。

[c++ code]

[cpp] view plain copy

/*
*@author:郑海波 http://blog.csdn.net/NUPTboyZHB
*参考：实验室小熊
*注：有删改
*/
#pragma warning(disable:4786)
#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <map>
#include <queue>
#include <ctime>
using namespace std;
void topK(const int &K)
{
double t=clock();
ifstream infile;
infile.open("test.txt");
if (!infile)
cout<<"can not open file"<<endl;
string s;
map<string,int>wordcount;
while(true)
{
infile>>s;
if(infile.eof()) break;
wordcount[s]++;
}
cout<<"单词种类："<<wordcount.size()<<endl;
//优先队列使用小顶堆，排在前面的数量少，使用">";
priority_queue< pair< int,string >,vector< pair< int,string > >,greater< pair< int,string> > > queueK;
for (map<string,int>::iterator iter=wordcount.begin(); iter!=wordcount.end(); iter++)
{
queueK.push(make_pair(iter->second,iter->first));
if(queueK.size()>K)
queueK.pop();
}
pair<int,string>tmp;
priority_queue< pair< int,string >,vector< pair< int,string > >,less< pair< int,string> > > queueKless;
while (!queueK.empty())
{
tmp=queueK.top();
queueK.pop();
queueKless.push(tmp);
}
while(!queueKless.empty())
{
tmp=queueKless.top();
queueKless.pop();
cout<<tmp.second<<"\t"<<tmp.first<<endl;
}
cout<<"< Elapsed Time: "<<(clock()-t)/CLOCKS_PER_SEC<<" >"<<endl;
}
int main()
{
int k=0;
cout<<"http://blog.csdn.net/NUPTboyZHB\n";
while (true)
{
cout<<"PUT IN K: ";
cin>>k;
if(k<=0)break;
topK(k);
}
return 0;
}

[图2]

参考：实验室小熊

c++实现文本中英文单词和汉字字符的统计的更多相关文章

题目--统计一行文本的单词个数（PTA预习题）
PTA预习题——统计一行文本的单词个数 7-1 统计一行文本的单词个数 (15 分) 本题目要求编写程序统计一行字符中单词的个数.所谓“单词”是指连续不含空格的字符串,各单词之间用空格分隔,空格数可以 ...
《c程序设计语言》读书笔记--统计行数、单词数、字符数
#include <stdio.h> int main() { int lin = 0,wor = 0,cha = 0; int flag = 0; int c; while((c = g ...
C语言输出单个汉字字符
#include "stdio.h" #include "windows.h" int main() { ] = { "多字节字符串!OK!" ...
shell统计文本中单词的出现次数
Ubuntu14.04 给定一个文本,统计其中单词出现的次数方法1 # solution 1 grep与awk配合使用,写成一个sh脚本 fre.sh sh fre.sh wordfretest.t ...
JS实现文本中查找并替换字符
JS实现文本中查找并替换字符效果图: 代码如下,复制即可使用: <!DOCTYPE html><html> <head> <style type=" ...
java统计文本中单词出现的个数
package com.java_Test; import java.io.File; import java.util.HashMap; import java.util.Iterator; imp ...
C 循环统计输入的单词个数和字符长度
C 循环统计输入的单词个数和字符长度 #include <stdio.h> #include <Windows.h> int main(void) { ]; ; ; print ...
linux wc 的用法-linux 下统计行数、单词数、字符个数
linux wc 的用法-linux 下统计行数.单词数.字符个数 wc : wc -l 统计有多少行 wc -w 统计有多少个单词 wc -c 统计有多少个字符
华为oj之字符个数统计
题目:字符个数统计热度指数:4720 时间限制:1秒空间限制:32768K 本题知识点: 字符串题目描述编写一个函数,计算字符串中含有的不同字符的个数.字符在ACSII码范围内(0~127). ...

随机推荐

SQL_转换格式的函数—CAST()和CONVERT()
将一种数据类型的表达式显式转换为另一种数据类型的表达式.CAST 和 CONVERT 提供相似的功能. cast SELECT CAST('12.5' AS int) --在将 varchar 值 ' ...
SQL Server的三种物理连接之Merge join（二）
简介 merge join 对两个表在连接列上按照相同的规则排序,然后再做merge,匹配的输出. 下面这个动态图展示了merge join的详细过程. merge join示例创建两个表 IF O ...
asp.net中C#对象与方法属性详解
C#对象与方法一.相关概念: 1.对象:现实世界中的实体 2. 类:具有相似属性和方法的对象的集合 3.面向对象程序设计的特点:封装继承多态二.类的定义与语法 1.定义类: 修饰符类名称 ...
找不到System.Runtime.Serialization.Json的解决方案
System.ServiceModel System.ServiceModel.Web System.Runtime.Serialization 三者均要添加引用
template_1
0: 模板是一些为多种类型而编写的函数和类,而且这些类型都没有指定.当使用模板的时候,只需要把所希望的类型作为一个(显示或隐示的)实参传递给模板.模板是语言本身所具有的特效,她完全支持类型检查和作用域 ...
上下问语句句柄Release地方
OCI--在QUERY中 CLI--在FETCH中在父类中定义了public—Release和protected—Release,protected—Release在public—Release中被 ...
.net 将excel转成html文件
最近在做一个打印预览功能,但是开始没有头绪后来用excel做了一个模板,然后根据excel模板来生成新的excel并将其存储为html,可以通过http请求在浏览器中读取,并且打印,其他的不多说.方法 ...
[Oracle]Oracle数据库任何用户密码都能以sysdba角色登入
* 本文相关环境:Windows 10,64位操作系统:Oracle 11gR2:toad for Oracle12.1 最近在学习Oracle数据库,使用Toad for Oracle来查看数据库的 ...
统计某一字段等于不同值的个数的sql语句(分享)
本文介绍下,用一条sql语句统计某一字段等于不同值的个数,方法很独特,有需要的朋友参考下. 表t,数据: id type001 1001 0002 1001 ...
VB.Net 文件处理类
1.编写日志 2.本地文件的读取和写入 3.Base64与图片的互相转换 Imports System.IO Imports System.Text Public Class Cls_File #Re ...

c++实现文本中英文单词和汉字字符的统计

c++实现文本中英文单词和汉字字符的统计的更多相关文章

随机推荐

热门专题