使用Python 2.7实现的垃圾短信识别器

　　最近参加比赛，写了一个垃圾短信识别器，在这里做一下记录。

　　官方提供的数据是csv文件，其中训练集有80万条数据，测试集有20万条数据，训练集的格式为：行号标记(0为普通短信，1为垃圾短信) 短信内容；测试集的格式为：行号短信内容；要求输出的数据格式要求为：行号标记，以csv格式保存。

　　实现的原理可概括为以下几步：

　　　　1.读取文件，输入数据

　　　　2.对数据进行分割，将每一行数据分成行号、标记、短信内容。由于短信内容中可能存在空格，故不能简单地用split()分割字符串，应该用正则表达式模块re进行匹配分割。

　　　　3.将分割结果存入数据库(MySQL)，方便下次测试时直接从数据库读取结果，省略步骤。

　　　　4.对短信内容进行分词，这一步用到了第三方库结巴分词：https://github.com/fxsjy/jieba

　　　　5.将分词的结果用于训练模型，训练的算法为朴素贝叶斯算法，可调用第三方库Scikit-Learn：http://scikit-learn.org/stable

　　　　6.从数据库中读取测试集，进行判断，输出结果并写入文件。

　　最终实现出来一共有4个py文件：

　　　　1.ImportIntoDB.py 将数据进行预处理并导入数据库，仅在第一次使用。

　　　　2.DataHandler.py 从数据库中读取数据，进行分词，随后处理数据，训练模型。

　　　　3.Classifier.py 从数据库中读取测试集数据，利用训练好的模型进行判断，输出结果到文件中。

　　　　4.Main.py 程序的入口

　　最终程序每次运行耗时平均在260秒-270秒之间，附代码：

　　ImportIntoDB.py：

 # -*- coding:utf-8 -*-

 __author__ = 'Jz'

 import MySQLdb

 import codecs

 import re

 import time

 # txt_path = 'D:/coding_file/python_file/Big Data/trash message/train80w.txt'

 txt_path = 'D:/coding_file/python_file/Big Data/trash message/test20w.txt'

 # use regular expression to split string into parts

 # split_pattern_80w = re.compile(u'([0-9]+).*?([01])(.*)')

 split_pattern_20w = re.compile(u'([0-9]+)(.*)')

 txt = codecs.open(txt_path, 'r')

 lines = txt.readlines()

 start_time = time.time()

 #connect mysql database

 con = MySQLdb.connect(host = 'localhost', port = 3306, user = 'root', passwd = '*****', db = 'TrashMessage', charset = 'UTF8')

 cur = con.cursor()

 # insert into 'train' table

 # sql = 'insert into train(sms_id, sms_type, content) values (%s, %s, %s)'

 # for line in lines:

 #     match = re.match(split_pattern_80w, line)

 #     sms_id, sms_type, content = match.group(1), match.group(2), match.group(3).lstrip()

 #     cur.execute(sql, (sms_id, sms_type, content))

 #     print sms_id

 # # commit transaction

 # con.commit()

 # insert into 'test' table

 sql = 'insert into test(sms_id, content) values (%s, %s)'

 for line in lines:

     match = re.match(split_pattern_20w, line)

     sms_id, content = match.group(1), match.group(2).lstrip()

     cur.execute(sql, (sms_id, content))

     print sms_id

 # commit transaction

 con.commit()

 cur.close()

 con.close()

 txt.close()

 end_time = time.time()

 print 'time-consuming: ' + str(end_time - start_time) + 's.'

　　DataHandler.py：

 # -*- coding:utf-8 -*-

 __author__ = 'Jz'

 import MySQLdb

 import jieba

 import re

 class DataHandler:

     def __init__(self):

         try:

             self.con = MySQLdb.connect(host = 'localhost', port = 3306, user = 'root', passwd = '*****', db = 'TrashMessage', charset = 'UTF8')

             self.cur = self.con.cursor()

         except MySQLdb.OperationalError, oe:

             print 'Connection error! Details:', oe

     def __del__(self):

         self.cur.close()

         self.con.close()

     # obsolete function

     # def getConnection(self):

     #     return self.con

     # obsolete function

     # def getCursor(self):

     #     return self.cur

     def query(self, sql):

         self.cur.execute(sql)

         result_set = self.cur.fetchall()

         return result_set

     def resultSetTransformer(self, train, test):

         # list of words divided by jieba module after de-duplication

         train_division = []

         test_division = []

         # list of classification of each message

         train_class = []

         # divide messages into words

         for record in train:

             train_class.append(record[1])

             division = jieba.cut(record[2])

             filtered_division_set = set()

             for word in division:

                 filtered_division_set.add(word + ' ')

             division = list(filtered_division_set)

             str_word = ''.join(division)

             train_division.append(str_word)        

         # handle test set in a similar way as above

         for record in test:

             division = jieba.cut(record[1])

             filtered_division_set = set()

             for word in division:

                 filtered_division_set.add(word + ' ')

             division = list(filtered_division_set)

             str_word = ''.join(division)

             test_division.append(str_word)

         return train_division, train_class, test_division

　　Classifier.py：

 # -*- coding:utf-8 -*-

 __author__ = 'Jz'

 from DataHandler import DataHandler

 from sklearn.feature_extraction.text import TfidfVectorizer

 from sklearn.feature_extraction.text import TfidfTransformer

 from sklearn.feature_extraction.text import CountVectorizer

 from sklearn.naive_bayes import MultinomialNB

 import time

 class Classifier:

     def __init__(self):

         start_time = time.time()

         self.data_handler = DataHandler()

         # get result set

         self.train = self.data_handler.query('select * from train')

         self.test = self.data_handler.query('select * from test')

         self.train_division, self.train_class, self.test_division = self.data_handler.resultSetTransformer(self.train, self.test)

         end_time = time.time()

         print 'Classifier finished initializing, time-consuming:' + str(end_time - start_time) + 's.'

     def getMatrices(self):

         start_time = time.time()

         # convert a collection of raw documents to a matrix of TF-IDF features.

         self.tfidf_vectorizer = TfidfVectorizer()

         # learn vocabulary and idf, return term-document matrix [sample, feature]

         self.train_count_matrix = self.tfidf_vectorizer.fit_transform(self.train_division)

         # transform the count matrix of the train set to a normalized tf-idf representation

         self.tfidf_transformer = TfidfTransformer()

         self.train_tfidf_matrix = self.tfidf_transformer.fit_transform(self.train_count_matrix)

         end_time = time.time()

         print 'Classifier finished getting matrices, time-consuming:' + str(end_time - start_time) + 's.'

     def classify(self):

         self.getMatrices()

         start_time = time.time()

         # convert a collection of text documents to a matrix of token counts

         # scikit-learn doesn't support chinese vocabulary

         test_tfidf_vectorizer = CountVectorizer(vocabulary = self.tfidf_vectorizer.vocabulary_)

         # learn the vocabulary dictionary and return term-document matrix.

         test_count_matrix = test_tfidf_vectorizer.fit_transform(self.test_division)

         # transform a count matrix to a normalized tf or tf-idf representation

         test_tfidf_transformer = TfidfTransformer()

         test_tfidf_matrix = test_tfidf_transformer.fit(self.train_count_matrix).transform(test_count_matrix)

         # the multinomial Naive Bayes classifier is suitable for classification with discrete features

         # e.g., word counts for text classification).

         naive_bayes = MultinomialNB(alpha = 0.65)

         naive_bayes.fit(self.train_tfidf_matrix, self.train_class)

         prediction = naive_bayes.predict(test_tfidf_matrix)

         # output result to a csv file

         index = 0

         csv = open('result.csv', 'w')

         for sms_type in prediction:

             csv.write(str(self.test[index][0]) + ',' + str(sms_type) + '\n')

             index += 1

         csv.close()

         end_time = time.time()

         print 'Classifier finished classifying, time-consuming: ' + str(end_time - start_time) + 's.'

　　Main.py：

 # -*- coding:utf-8 -*-

 __author__ = 'Jz'

 import time

 from Classifier import Classifier

 start_time = time.time()

 classifier = Classifier()

 classifier.classify()

 end_time = time.time()

 print 'total time-consuming: ' + str(end_time - start_time) + 's.'

使用Python 2.7实现的垃圾短信识别器的更多相关文章

python数据挖掘第三篇-垃圾短信文本分类
数据挖掘第三篇-文本分类文本分类总体上包括8个步骤.数据探索分析->数据抽取->文本预处理->分词->去除停用词->文本向量化表示->分类器->模型评估.重 ...
利用python库twilio来免费发送短信
大家好,我是四毛,最近开通了个人公众号“用Python来编程”,欢迎大家“关注”,这样您就可以收到优质的文章了. 今天跟大家分享的主题是利用python库twilio来免费发送短信. 先放一张成品图 ...
新出台的治理iMessage垃圾短信的规则
工信部拟制定<通信短信息服务管理规定>,为治理垃圾短信提供执法根据.当中,对于苹果iMessage垃圾信息泛滥现象,工信部也将跟踪研究技术监測和防范手段.这意味着长期以来处于监管" ...
ML.NET 示例：二元分类之垃圾短信检测
写在前面准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...
用python twilio模块实现发手机短信的功能
前排提示:这个模块不是用于对陌生人进行短信轰炸和电话骚扰的,这个模块也没有这个功能,如果是抱着这个心态来的,可以关闭网页了语言:python 步骤一:安装twilio模块 pip install t ...
XSS之偷梁换柱--盲打垃圾短信平台
https://www.t00ls.net/thread-49742-1-1.html
R 基于朴素贝叶斯模型实现手机垃圾短信过滤
# 读取数数据, 查看数据结构 df_raw <- read.csv("sms_spam.csv", stringsAsFactors=F) str(df_raw) leng ...
用Python免费发短信，实现程序实时报警
进入正文今天跟大家分享的主题是利用python库twilio来免费发送短信. 先放一张成品图: 代码放在了本文最后的地址中正文眼尖的小伙伴已经发现了上面的短信的前缀显示这个短信来自于一个叫Twi ...
python调用腾讯云短信接口
目录 python调用腾讯云短信接口账号注册 python中封装腾讯云短信接口 python调用腾讯云短信接口账号注册去腾讯云官网注册一个腾讯云账号,通过实名认证然后开通短信服务,创建短信应用 ...

随机推荐

python16_day07【Socket网络编程】
一.简介 1.理解C/S,B/S 2.IOS七层模型(http://www.cnblogs.com/linhaifeng/articles/5937962.html) 二.什么是Socket 我们看看 ...
github-----文件项目的推拉二式
将本地项目文件推送上线: $ git init $ git add . $ git commit -m "第一次修改" $ git log $ git remote add ori ...
入门拾遗 day2
一.类和对象对于Python,一切事物都是对象,对象基于类创建学会查看帮助 type(类型名) 查看对象的类型dir(类型名) 查看类中提供的所有功能help(类型名) 查看类中所有详细的功能he ...
Xcode 错误问题以及解决方法（后期遇到还会添加）
1,/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhon ...
Microservice 概念
一天我司招财猫姐(HR 大人)问我,你给我解释一下 Microservice 是什么吧.故成此文.一切都是从一个创业公司开始的. 故事最近的创业潮非常火爆,我禁不住诱惑也掺和了进去,创建了一家公司. ...
Python学习之：pycharm配置
最近需要做一些小工具,听说Python不错,就学习一下.工欲善其事必先利其器,一个好的IDE对于学习一门新知识是很有帮助的,边写代码边换IDE,纠结了几天,最终还是选择了pycharm,之前觉得不够好 ...
React Native混合开发中必须要学会点FlexBox布局
在前面的案例中,界面的搭建都是采用CSS的布局,基于盒子模型,依赖 display属性 , position属性, float属性.但对于那些特殊布局非常不方便,比如,垂直居中. 一种全新的针对web ...
查询当天数据（mysql）
SELECT count(*) as nums FROM go_member_share WHERE DATEDIFF(FROM_UNIXTIME(time, '%Y-%m-%d') , now()) ...
spring整合redis配置
第一步:添加需要的jar包  <dependency> <groupId>redis.clients</groupId> & ...
skynet 创建存储过程脚本
最近主程更改了数据库的操作方案,由之前的拼写sql脚本转为在mysql端创建好存储过程后,直接调用存储过程. 首先对一个表测试上述过程: 数据库端存储过程:(测试表) CREATE TABLE `ra ...

使用Python 2.7实现的垃圾短信识别器

使用Python 2.7实现的垃圾短信识别器的更多相关文章

随机推荐

热门专题