原题链接：http://codeforces.com/gym/100338/attachments/download/2136/20062007-winter-petrozavodsk-camp-andrew-stankevich-contest-22-asc-22-en.pdf

题意

这是一个过滤垃圾邮件的算法，叫贝叶斯算法。这个算法的第一步是训练过程，通过人工给定的邮件，来确定每个词语在垃圾邮件中的概率和在普通邮件的概率。然后通过贝叶斯公式来计算每个邮件是否为垃圾邮件。具体过程可以看题，或者维基百科。

题解

模拟题目的过程即可，不过要注意的是，为了避免超时，必须哈希，使用最小表示来记录字符串。

代码

//#include<iostream>

#include<cstring>

#include<fstream>

#include<vector>

#include<map>

#include<algorithm>

#include<string>

#include<set>

#include<cmath>

#include<queue>

#define eps 1e-10

#define MAX_N 1234

#define Pr 131

#define mod 1000000009

using namespace std;

typedef long long ll;

int s,g,n,t;

string sp[MAX_N];

string go[MAX_N];

string ma[MAX_N];

set<ll> spam[MAX_N];

set<ll> good[MAX_N];

set<ll> mail[MAX_N];

set<ll> allWord;

string bankLine;

map<ll, double> wordIsSpam;

map<ll, double> wordIsGood;

double pSpam;

double pGood;

ll Hash(string ss) {

    ll tmp = Pr;

    ll res = ;

    for (auto c:ss) {

        res = (res + c * tmp) % mod;

        tmp = (tmp * Pr) % mod;

    }

    return res;

}

string changeToSmall(string ss) {

    string res = "";

    for (auto c:ss) {

        if (c <= 'Z' && c >= 'A')c = c - 'A' + 'a';

        res = res + c;

    }

    return res;

}

bool isAl(char c) {

    if (c <= 'Z' && c >= 'A')return true;

    return (c <= 'z' && 'a' <= c);

}

void divi(string v[],set<ll> G[],int x) {

    for (int i = ; i < x; i++) {

        int p = ;

        while ((!isAl(v[i][p])) && p < v[i].length())p++;

        bool flag = true;

        for (int j = p; j < v[i].length(); j++) {

            if ((!isAl(v[i][j])) && flag) {

                flag = false;

                string tmp;

                tmp.assign(v[i].begin() + p, v[i].begin() + j);

                tmp = changeToSmall(tmp);

                allWord.insert(Hash(tmp));

                G[i].insert(Hash(tmp));

            }

            else if (isAl(v[i][j]) && flag == false) {

                p = j;

                flag = true;

            }

        }

    }

}

double divide(double a,double b) {

    if (fabs(b)<eps)return ;

    return a / b;

}

double P(ll word) {

    double ws, wg;

    if (wordIsSpam.find(word) == wordIsSpam.end())ws = ;

    else ws = wordIsSpam[word];

    if (wordIsGood.find(word) == wordIsGood.end())wg = ;

    else wg = wordIsGood[word];

    return divide(ws * pSpam, ws * pSpam + wg * pGood);

}

int main() {

    ifstream cin("spam.in");

    ofstream cout("spam.out");

    cin.sync_with_stdio(false);

    cin >> s >> g >> n >> t;

    pSpam = divide(s, s + g);

    pGood = divide(g, s + g);

    getline(cin, bankLine);

    for (int i = ; i < s; i++) {

        getline(cin, sp[i]);

        sp[i] = sp[i] + " ";

    }

    for (int i = ; i < g; i++) {

        getline(cin, go[i]);

        go[i] = go[i] + " ";

    }

    for (int i = ; i < n; i++) {

        getline(cin, ma[i]);

        ma[i] = ma[i] + " ";

    }

    divi(sp, spam, s);

    divi(go, good, g);

    divi(ma, mail, n);

    for (auto word:allWord) {

        int cnt = ;

        for (int i = ; i < s; i++)

            if (spam[i].find(word) != spam[i].end())cnt++;

        wordIsSpam[word] = divide(cnt, s);

        cnt = ;

        for (int i = ; i < g; i++)

            if (good[i].find(word) != good[i].end())cnt++;

        wordIsGood[word] = divide(cnt, g);

    }

    for (int i = ; i < n; i++) {

        double ans = ;

        for (auto word:mail[i]) {

            double p = P(word);

            //cout<<word<<": "<<p<<endl;

            if (p > 0.5 || fabs(p - 0.5) < eps)ans++;

        }

        if (ans *  / mail[i].size() < t)cout << "good" << endl;

        else cout << "spam" << endl;

    }

    return ;

}

Codeforces Gym 100338B Spam Filter 字符串哈希+贝叶斯公式的更多相关文章

codeforces Gym 100338F Spam Filter 垃圾邮件过滤器（模拟，实现）
阅读题, 概要:给出垃圾邮件和非垃圾邮件的集合,然后按照题目给出的贝叶斯公式计算概率一封邮件是垃圾邮件的概率. 逐个单词判断,将公式化简一下就是在垃圾邮件中出现的次数和在总次数的比值,大于二分之一就算 ...
【CodeForces】961 F. k-substrings 字符串哈希+二分
[题目]F. k-substrings [题意]给定长度为n的串S,对于S的每个k-子串$s_ks_{k+1}...s_{n-k+1},k\in[1,\left \lceil \frac{n}{2} ...
codeforces gym 100286 I iSharp (字符串模拟)
题目链接给定一个字符串.输入是int& a*[]&, b, c*; 输出是 int&&[]* a;int& b;int&* c; 输入格式里逗号后面一 ...
codeforces gym 101164 K Cutting 字符串hash
题意:给你两个字符串a,b,不区分大小写,将b分成三段,重新拼接,问是否能得到A: 思路:暴力枚举两个断点,然后check的时候需要字符串hash,O(1)复杂度N*N: 题目链接:传送门 #prag ...
Codeforces Gym 100338B Geometry Problem 计算几何
Problem B. Geometry ProblemTime Limit: 20 Sec Memory Limit: 256 MB 题目连接 http://acm.hust.edu.cn/vjudg ...
Codeforces Round #543 (Div. 2) F dp + 二分 + 字符串哈希
https://codeforces.com/contest/1121/problem/F 题意给你一个有n(<=5000)个字符的串,有两种压缩字符的方法: 1. 压缩单一字符,代价为a 2 ...
CodeForces Gym 100213F Counterfeit Money
CodeForces Gym题目页面传送门有$1$个$n1\times m1$的字符矩阵$a$和$1$个$n2\times m2$的字符矩阵$b$,求$a,b$的最大公共 ...
HDU 1880 魔咒词典（字符串哈希）
题目链接 Problem Description 哈利波特在魔法学校的必修课之一就是学习魔咒.据说魔法世界有100000种不同的魔咒,哈利很难全部记住,但是为了对抗强敌,他必须在危急时刻能够调用任何一 ...
洛谷P3370 【模板】字符串哈希
P3370 [模板]字符串哈希 143通过 483提交题目提供者HansBug 标签难度普及- 提交讨论题解最新讨论看不出来,这题哪里是哈希了- 题目描述如题,给定N个字符串(第i个 ...

随机推荐

python 面对对象基础
目录面向对象基础面向对象编程(抽象) 类与对象给对象定制独有的特征对象的属性查找顺序类与对象的绑定方法类与数据类型对象的高度整合面向对象基础面向对象编程(抽象) 回顾一下面向过程编 ...
w3resource_MySQL练习：Joins
w3resource_MySQL练习题:Joins 1. Write a query to find the addresses (location_id, street_address, city, ...
Applied Nonparametric Statistics-lec2
Ref: https://onlinecourses.science.psu.edu/stat464/print/book/export/html/3 The Binomial Distributio ...
python数据类型之元组(tuple)
元组是python的基础类型之一,是有序的. 元组是不可变的,一旦创建便不能再修改,所以叫只读列表. name = ('alex', 'jack') name[0] = 'mark' # TypeEr ...
JAVA基础篇—多态
class ColaEmployee父类 package com.cola; public class ColaEmployee { private String name; private int ...
poj 3262 牛毁坏花问题贪心算法
题意:有n头牛,每头牛回去都需要一定时间,如果呆在原地就会毁坏花朵.问:怎么安排使得毁坏的花朵最少? 思路: 拉走成本最高的. 什么是成本?毁坏花朵的数量. 例如有两种排序 (这里用(a,b)表示 ...
LA 3667 Ruler 搜索
题意: 给出$n$个长度,要设计一个有$m$个刻度的刻度尺,刻度尺的刻度从$0$开始. 使得任意一个长度都能被该刻度尺度量出来. 首先要使$m$最小,在$m$最小的前提下尺子的长度 ...
hihoCoder #1117 战争年代
题目大意对一棵树的节点染色.初始时每个点都染成颜色 $0$,然后进行 $m$ 轮操作.第 $i$ 轮操作:从 $[0,d_i]$ 中随机选出一个整数 $d$,将距离点 $x_i$ 不超过 $d$ 的 ...
HDU-1528/1962 Card Game Cheater
两组牌中两张牌相比能赢的就连,后求最大匹配. #include <cmath> #include <cstdlib> #include <cstdio> #incl ...
hosts文件位置
windows C:\WINDOWS\system32\drivers\etc mac /etc/hosts 修改hosts文件会遇到无法保存的问题,解法方法参考下文 http://mtoou.inf ...

Codeforces Gym 100338B Spam Filter 字符串哈希+贝叶斯公式

题意

题解

代码

Codeforces Gym 100338B Spam Filter 字符串哈希+贝叶斯公式的更多相关文章

随机推荐

热门专题