scrapy中输出中文保存中文

1.json文件中文解码：

#!/usr/bin/python

#coding=utf-8

#author=dahu

import json

with open('huxiu.json','r') as f:

    data=json.load(f)

print data[0]['title']

for key in data[0]:

    print '\"%s\":\"%s\",'%(key,data[0][key])

read_from_json

中文写入json：

#!/usr/bin/python

#coding=utf-8

#author=dahu

import json

data={

"desc":"女友不是你想租想租就能租",

"link":"/article/214877.html",

"title":"押金8000元，共享女友门槛不低啊"

}

with open('tmp.json','w') as f:

    json.dump(data,f,ensure_ascii=False)        #指定ensure_ascii

write_to_json

2.scrapy在保存json文件时，容易乱码，

例如：

scrapy crawl huxiu --nolog -o huxiu.json

$ head huxiu.json

[

{"title": "\u62bc\u91d18000\u5143\uff0c\u5171\u4eab\u5973\u53cb\u95e8\u69db\u4e0d\u4f4e\u554a", "link": "/article/214877.html", "desc": "\u5973\u53cb\u4e0d\u662f\u4f60\u60f3\u79df\u60f3\u79df\u5c31\u80fd\u79df"},

{"title": "\u5f20\u5634\uff0c\u817e\u8baf\u8981\u5582\u4f60\u5403\u836f\u4e86", "link": "/article/214879.html", "desc": "\u201c\u8033\u65c1\u56de\u8361\u7740Pony\u9a6c\u7684\u6559\u8bf2\uff1a\u597d\u597d\u7528\u8111\u5b50\u60f3\u60f3\uff0c\u4e0d\u5145\u94b1\uff0c\u4f60\u4eec\u4f1a\u53d8\u5f3a\u5417\uff1f\u201d"},

结合上面保存json文件为中文的技巧：

settings.py文件改动：

ITEM_PIPELINES = {

   'coolscrapy.pipelines.CoolscrapyPipeline': 300,

}

注释去掉

pipelines.py改成如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

# import codecs

class CoolscrapyPipeline(object):

    # def __init__(self):

        # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):

        # line = json.dumps(dict(item),ensure_ascii=False) + '\n'

        # self.file.write(line)

        with open('data_cn1.json', 'a') as f:

            json.dump(dict(item), f, ensure_ascii=False)

            f.write(',\n')

        return item

注释的部分是另一种写法，核心在于settings里启动pipeline，会自动运行process_item程序，所以就可以保存我们想要的任何格式

此时终端输入命令

scrapy crawl huxiu --nolog

如果仍然加 -o file.json ，file和pipeline里定义文件都会生成，但是file的json格式仍然是乱码。

3.进一步

由上分析可以得出另一个结论，setting里的ITEM_PIPELINES 是控制着pipeline的，如果我们多开启几个呢：

ITEM_PIPELINES = {

   'coolscrapy.pipelines.CoolscrapyPipeline': 300,

   'coolscrapy.pipelines.CoolscrapyPipeline1': 300,

}

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

# import codecs

class CoolscrapyPipeline(object):

    # def __init__(self):

        # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):

        # line = json.dumps(dict(item),ensure_ascii=False) + '\n'

        # self.file.write(line)

        with open('data_cn1.json', 'a') as f:

            json.dump(dict(item), f, ensure_ascii=False)

            f.write(',\n')

        return item

class CoolscrapyPipeline1(object):

    def process_item(self, item, spider):

        with open('data_cn2.json', 'a') as f:

            json.dump(dict(item), f, ensure_ascii=False)

            f.write(',hehe\n')

        return item

pipelines.py

运行：

$ scrapy crawl huxiu --nolog

$ head -n  data_cn*

==> data_cn1.json <==

{"title": "押金8000元，共享女友门槛不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"},

{"title": "张嘴，腾讯要喂你吃药了", "link": "/article/214879.html", "desc": "“耳旁回荡着Pony马的教诲：好好用脑子想想，不充钱，你们会变强吗？”"},

==> data_cn2.json <==

{"title": "押金8000元，共享女友门槛不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"},hehe

{"title": "张嘴，腾讯要喂你吃药了", "link": "/article/214879.html", "desc": "“耳旁回荡着Pony马的教诲：好好用脑子想想，不充钱，你们会变强吗？”"},hehe

可以看到两个文件都生成了！而且还是按照我们想要的格式！

scrapy中输出中文保存中文的更多相关文章

python3中OpenCV imwrite保存中文路径文件
原先一段将特征值保存为图片的代码,这部分学生的电脑上运行没有生成图片代码的基本样子是: import os import cv2 import numpy as np def text_to_pic ...
MyEclipse右键new菜单项的设置及 Eclipse中各种文件不能保存中文的问题
有时候,myeclipse右键new的时候经常出现一些ejb等文件你懂的,很是恶心~~ Window --> Customize Perspective --> Submenus --&g ...
iText中输出中文
原文链接 http://hintcnuie.iteye.com/blog/183690 转载内容 iText中输出中文,有三种方式: 1.使用iTextAsian.jar中的字体 BaseFont.c ...
iText中输出中文
iText中输出中文,有三种方式: 1.使用iTextAsian.jar中的字体 BaseFont.createFont("STSong-Light", "UniG ...
JAVA- JSP中解决无法在Cookie当中保存中文字符的问题
因为cookie的值是ASCII字符,不能直接把自定义cookie的值直接赋值为中文,但是要实现这个功能,还是有方法的. 1.java中已经给我们提供了方法,此时只需要导入该包就行 <%@ pa ...
织梦后台系统设置在PHP5.4环境中不能保存中文参数的解决方法
在没用PHP5.4的环境做Dede后台的时候,织梦58一直没有遇到这个问题,昨天上传一个新的模版到空间去测试发现后台的系统基本参数设置中所有的中文内容都无法保存,关于这个问题,其实以前也听说过,知识一 ...
scrapy抓取的页面中文会变成unicode字符串
不了解编码的,需要先补下:http://www.cnblogs.com/jiangtu/p/6245264.html 在学习&使用scrapy抓取网上信息时,发现scrapy 会将含有中文的f ...
EF 连接MySQL 数据库保存中文数据后乱码问题
EF 连接MySQL 数据库保存中文数据后乱码问题采用Code First 生成的数据库,MySQL数据库中,生成的表的编码格式为***** 发现这个问题后,全部手动改成UTF8(图是另一个表的 ...
处理SecureCRT中使用vim出现中文乱码问题
处理SecureCRT中使用vim出现中文乱码问题引用原文:http://blog.chinaunix.net/uid-20639775-id-3475608.html因为cat没有问题,定位是vi ...

随机推荐

打印机wifi
给人修理了半天共享打印机问题,连接不上,被共享机为32位xp系统,共享机为64位win7系统,共享时无法安装驱动,最后知道打印机具备连接wifi功能,然后用官网驱动连接打印机即可.out了,现在打印机 ...
用Openssl计算ECDSA签名
ECDSA的全名是Elliptic Curve DSA,即椭圆曲线DSA.它是Digital Signature Algorithm (DSA)应用了椭圆曲线加密算法的变种.椭圆曲线算法的原理很复杂, ...
shell 学习之if语句
bash中如何实现条件判断?条件测试类型: 整数测试字符测试文件测试一.条件测试的表达式: [ expression ] 括号两端必须要有空格 [[ expres ...
bootstrap简单使用布局、栅格系统、modal标签页等常用组件入门
<!DOCTYPE html> <html> <head> <title>bootstrap</title> <!-- 引入boots ...
PHP IDE选择标准
2017年11月17日09:35:01 这里记录一下PHP IDE的选择标准 1. 是否有错误提示, 对于一些 `缺少分号`, `花括号不配对`, `变量未定义就使用`等待的提示是要有的 2. 代码 ...
Arrays.asList（）vs Collections.singletonList（）
Collections.singletonList(something)是不可变的, 对Collections.singletonList(something)返回的列表所做的任何更改将导致Unsup ...
Linux安装nodejs和npm
先安装,nvm,即是Node Version Manager(Node版本管理器) curl -o- https://raw.githubusercontent.com/creationix/nvm/ ...
System.TypeInitializationException: The type initializer for 'Oracle.DataAccess.Client.OracleConnection' threw an exception. ---> Oracle.DataAccess.Client.OracleException: 提供程序与此版本的 Oracle 客户机不兼容”
.net应用程序通过Oracle.DataAccess.dll访问64位的Oracle服务器,在连接时出现以下异常:“System.TypeInitializationException: The t ...
Vuejs的一些总结
http://blog.csdn.net/xllily_11/article/details/52312044 原文链接:http://mrzhang123.github.io/2016/07/14/ ...
RefineDet算法笔记
---恢复内容开始--- 一.创新点针对two-stage的速度慢以及one-stage精度不足提出的方法,refinedet 包括三个核心部分:使用TCB来转换ARM的特征,送入ODM中进行检测: ...