学习笔记CB005:关键词、语料提取

关键词提取。pynlpir库实现关键词提取。

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import pynlpir

pynlpir.open()
s = '怎么才能把电脑里的垃圾文件删除'

key_words = pynlpir.get_key_words(s, weighted=True)
for key_word in key_words:
print(key_word[0], 't', key_word[1])

pynlpir.close()

百度接口：https://www.baidu.com/s?wd=机器学习数据挖掘信息检索

安装scrapy pip install scrapy。创建scrapy工程 scrapy startproject baidu_search。做抓取器，创建baidu_search/baidu_search/spiders/baidu_search.py文件。

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import scrapy

class BaiduSearchSpider(scrapy.Spider):
name = "baidu_search"
allowed_domains = ["baidu.com"]
start_urls = [
"https://www.baidu.com/s?wd=电脑垃圾文件删除"
]

def parse(self, response):
filename = "result.html"
with open(filename, 'wb') as f:
f.write(response.body)

修改settings.py文件，ROBOTSTXT_OBEY = False，USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' ，DOWNLOAD_TIMEOUT = 5 ，

进入baidu_search/baidu_search/目录，scrapy crawl baidu_search 。生成result.html，正确抓取网页。

语料提取。搜索结果只是索引。真正内容需进入链接。分析抓取结果，链接嵌在class=c-container Div h3 a标签 href属性。url添加到抓取队列抓取。提取正文，去掉标签，保存摘要。提取url时，提取标题和摘要，scrapy.Request meta传递到处理函数parse_url，抓取完成后能接到这两个值，提取content。完整数据：url、title、abstract、content。

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import scrapy
from scrapy.utils.markup import remove_tags

class BaiduSearchSpider(scrapy.Spider):
name = "baidu_search"
allowed_domains = ["baidu.com"]
start_urls = [
"https://www.baidu.com/s?wd=电脑垃圾文件删除"
]

def parse(self, response):
# filename = "result.html"
# with open(filename, 'wb') as f:
# f.write(response.body)
hrefs = response.selector.xpath('//div[contains(@class, "c-container")]/h3/a/@href').extract()
# for href in hrefs:
# print(href)
# yield scrapy.Request(href, callback=self.parse_url)
containers = response.selector.xpath('//div[contains(@class, "c-container")]')
for container in containers:
href = container.xpath('h3/a/@href').extract()[0]
title = remove_tags(container.xpath('h3/a').extract()[0])
c_abstract = container.xpath('div/div/div[contains(@class, "c-abstract")]').extract()
abstract = ""
if len(c_abstract) > 0:
abstract = remove_tags(c_abstract[0])
request = scrapy.Request(href, callback=self.parse_url)
request.meta['title'] = title
request.meta['abstract'] = abstract
yield request

def parse_url(self, response):
print(len(response.body))
print("url:", response.url)
print("title:", response.meta['title'])
print("abstract:", response.meta['abstract'])
content = remove_tags(response.selector.xpath('//body').extract()[0])
print("content_len:", len(content))

参考资料：

《Python 自然语言处理》

http://www.shareditor.com/blogshow/?blogId=43

http://www.shareditor.com/blogshow?blogId=76

欢迎推荐上海机器学习工作机会，我的微信：qingxingfengzi

学习笔记CB005:关键词、语料提取的更多相关文章

IOS学习笔记之关键词@dynamic
IOS学习笔记之关键词@dynamic @dynamic这个关键词,通常是用不到的. 它与@synthesize的区别在于: 使用@synthesize编译器会确实的产生getter和setter方法 ...
ArcGIS案例学习笔记2_1_山顶点提取最大值提取
ArcGIS案例学习笔记2_1_山顶点提取最大值提取计划时间:第二天上午目的:最大值提取教程:Pdf page=343 数据:chap8/ex5/dem.tif 背景知识:等高线种类基本等高线 ...
GIS案例学习笔记-明暗等高线提取地理模型构建
GIS案例学习笔记-明暗等高线提取地理模型构建联系方式:谢老师,135-4855-4328,xiexiaokui#qq.com 目的:针对数字高程模型,通过地形分析,建立明暗等高线提取模型,生成具有 ...
GIS案例学习笔记-水文分析河网提取地理建模
GIS案例学习笔记-水文分析河网提取地理建模联系方式:谢老师,135-4855-4328,xiexiaokui#qq.com 目的:针对数字高程模型,通过水文分析,提取河网操作时间:25分钟数据 ...
Python学习笔记（2) Python提取《釜山行》人物关系
参考:http://www.jianshu.com/p/3bd06f8816d7 项目原理: 实验基于简单共现关系,编写 Python 代码从纯文本中提取出人物关系网络,并用Gephi 将生成的网 ...
【视频编解码·学习笔记】11. 提取SPS信息程序
一.准备工作: 回到之前SimpleH264Analyzer程序,找到SPS信息,并对其做解析调整项目目录结构: 修改Global.h文件中代码,添加新数据类型UINT16,之前编写的工程中,UIN ...
ArcGIS案例学习笔记2_2_模型构建器和山顶点提取批处理
ArcGIS案例学习笔记2_2_模型构建器和山顶点提取批处理计划时间:第二天下午背景:数据量大,工程大目的:自动化,批处理,定制业务流程,不写程序教程:Pdf/343 数据:chap8/ex5 ...
【视频编解码·学习笔记】13. 提取PPS信息程序
PPS结构解析与之前解析SPS方式类似一.定义PPS类: 在3.NAL Unit目录下,新建PicParamSet.cpp和PicParamSet.h,在这两个文件中写入类的定义和函数实现. 类定 ...
swift学习笔记5——其它部分（自动引用计数、错误处理、泛型...）
之前学习swift时的个人笔记,根据github:the-swift-programming-language-in-chinese学习.总结,将重要的内容提取,加以理解后整理为学习笔记,方便以后查询 ...

随机推荐

DataGridView中的DataGridViewComboBoxColumn 让其值改变联动
在工作中自己也遇到过这类问题, 最近也有很多人问我这个问题, 就此机会写出来记录一下．首先,顾名思义,值改变事件我们会想到 dataGridView1_CellValueChanged 这个事件,想 ...
Windows浏览器无法连接VM虚拟机Centos并打开nginx页面
装的是centos6.7minimal版本,搜了下,需要关闭防火墙于是 yum install iptables 然后关闭防火墙 service iptables stop 再打开浏览器,成功进入页 ...
在Linux CentOS6系统中安装开源CMS程序OpenCart的教程
OpenCart是一个开放源码的店面,旨在为您提供灵活和细粒度的在线店面管理.在开始之前,您应该已经在您的Linode上设置了一个LAMP堆栈.您还应该设置主机名. PHP设置为了使用OpenCar ...
mpeg4文件分析(纯c解析代码)
参考链接: 1. MPEG4码流的帧率计算 https://blog.csdn.net/littlebee90/article/details/68924690 2. M ...
【转】git - 简易指南
原文链接:http://www.bootcss.com/p/git-guide/ 作者:罗杰·杜德勒感谢:@tfnico, @fhd and Namics 其他语言 english, deutsch ...
admin 显示多对多字段
class BookAdmin(admin.ModelAdmin): def 作者(self, object): return [a.name for a in object.author.all() ...
微信退款回调AES算法（AES-256-ECB）
解密步骤如下: (1)对加密串A做base64解码,得到加密串B (2)对商户key做md5,得到32位小写key* ( key设置路径:微信商户平台(pay.weixin.qq.com)--> ...
java集合（二）
python中__name__
1.在运行程序时一般会写 if __name__ == "__main__" 当一个py文件被程序当做模块导入时,python会将文件中的代码执行一遍,如果我们在py文件中写了一些 ...
Delphi非官方的补丁
http://git.kngstr.com/KngStr/delphi-fixes 使用方法方法一: 1. 拷贝需要的补丁文件到你的工程目录 2. 拷贝编译的时候缺少或错误的文件到你的工程目录优缺 ...

学习笔记CB005:关键词、语料提取

学习笔记CB005:关键词、语料提取的更多相关文章

随机推荐

热门专题