多线程爬取 threading.Thread 文件名支持gbk编码

# - *- coding:utf-8-*-
import urllib2
import re
import os
import threading
import sys
reload(sys)
sys.setdefaultencoding('utf-8') #编码
from bs4 import BeautifulSoup
os.mkdir(u'小说0')
os.chdir(u'小说0')
def get_url():
    User_Agent= 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0'
    url="http://f.qidian.com/all?size=-1&sign=-1&tag=-1&chanId=-1&subCateId=-1&orderId=&update=-1&page=1&month=-1&style=1&action=1"
    headers={'User-Agent':User_Agent}
    request=urllib2.Request(url,headers=headers)
    html=urllib2.urlopen(request).read()
    soup = BeautifulSoup(html, 'html.parser')
    l = soup.find_all('div', class_ = 'book-mid-info')
    print #

    for htmltile in l:
        name = htmltile.find('h4').encode('utf-8')
        reg=r'<h4><a data-bid=".*?" data-eid=".*?" href="(.*?)" target="_blank">(.*?)</a></h4>'
        text=re.findall(reg,name)

        return text
def get_content(curl,title):
    os.mkdir(title.encode('gbk'))  #创建目录
    #os.chdir(title.encode('gbk'))   #在当前目录下操作
    html1 = urllib2.urlopen('http:'+curl+'#Catalog').read()
    reg=re.compile(r'<li data-rid=".*?"><a href="(.*?)" target="_blank" data-eid="qd_G55" data-cid=".*?" title=".*?">(.*?)</a>')
    titles=re.finditer(reg,html1)

    for n in titles:
        curl_=n.group(1)
        names=n.group(2)

        fd=open(title.encode('gbk')+'/'+names.encode('gbk')+'.txt','wb') #在指定目录下创建文件
        #fd=open(names.encode('gbk')+'.txt','wb')
        print "正在爬取%s本"%names
        htmlll=urllib2.urlopen('http:'+curl_).read()
        regs=re.compile(r'<div class="read-content j_readContent">\s*([\s\S]*?)\s*</div>') #正则多行时注意用\s*
        content=re.findall(regs,htmlll)
        for m in content:
            contents=m.replace('<p>','\r\n')
            fd.write(names+'\r\n'+contents)
            print "已完成%s"%names
            fd.close()

threads=[]
def main():
    for i in get_url():
        th=threading.Thread(target= get_content,args=(i[0],i[1]))
        threads.append(th)
    for t in threads:
        t.start()
        while True:
            if len(threading.enumerate())<10:#控制线程数量
                break
if __name__=='__main__':
    main()

多线程爬取 threading.Thread 文件名支持gbk编码的更多相关文章

Python3 多线程爬取梨视频
多线程爬取梨视频 from threading import Thread import requests import re # 访问链接 def access_page(url): respons ...
Python爬虫入门教程 14-100 All IT eBooks多线程爬取
All IT eBooks多线程爬取-写在前面对一个爬虫爱好者来说,或多或少都有这么一点点的收集癖 ~ 发现好的图片,发现好的书籍,发现各种能存放在电脑上的东西,都喜欢把它批量的爬取下来. 然后放着 ...
实现多线程爬取数据并保存到mongodb
多线程爬取二手房网页并将数据保存到mongodb的代码: import pymongo import threading import time from lxml import etree impo ...
Python爬虫入门教程： All IT eBooks多线程爬取
All IT eBooks多线程爬取-写在前面对一个爬虫爱好者来说,或多或少都有这么一点点的收集癖 ~ 发现好的图片,发现好的书籍,发现各种能存放在电脑上的东西,都喜欢把它批量的爬取下来. 然后放着 ...
python多线程爬取斗图啦数据
python多线程爬取斗图啦网的表情数据使用到的技术点 requests请求库 re 正则表达式 pyquery解析库,python实现的jquery threading 线程 queue 队列 ' ...
使用selenium 多线程爬取爱奇艺电影信息
使用selenium 多线程爬取爱奇艺电影信息转载请注明出处. 爬取目标:每个电影的评分.名称.时长.主演.和类型爬取思路: 源文件:(有注释) from selenium import webd ...
Python爬虫入门教程 13-100 斗图啦表情包多线程爬取
斗图啦表情包多线程爬取-写在前面今天在CSDN博客,发现好多人写爬虫都在爬取一个叫做斗图啦的网站,里面很多表情包,然后瞅了瞅,各种实现方式都有,今天我给你实现一个多线程版本的.关键技术点 aioht ...
Python爬虫入门教程 11-100 行行网电子书多线程爬取
行行网电子书多线程爬取-写在前面最近想找几本电子书看看,就翻啊翻,然后呢,找到了一个叫做周读的网站 ,网站特别好,简单清爽,书籍很多,而且打开都是百度网盘可以直接下载,更新速度也还可以,于是乎, ...
ubuntu中eclipse 不支持gbk编码问题解决办法
今天在ubuntu 下, 把Windows下工程导入Linux下Eclipse中,由于工程代码,是GBK编码,而Ubuntu默认不支持GBK编码,所以,要让Ubuntu支持GBK. 方法如下: 1.修 ...

随机推荐

ses_cations 值顺序
16个位置的字符所代表的操作依次如下: 1. ALTER 2. AUDIT 3.COMMENT 4.DELETE 5.GRANT 6.INDEX 7.INSERT 8.LOCK 9.RENAME 10 ...
Python爬虫学习1
#coding=utf-8 from urllib2 import urlopen from bs4 import BeautifulSoup import urllib2 url="htt ...
stickUp让页面元素“固定”位置
stickUp能让页面目标元素“固定”在浏览器窗口的顶部,即便页面在滚动,目标元素仍然能出现在设定的位置. http://www.bootcss.com/p/stickup/
SQL Server 2014 SP2发布下载：数十项更新修复
微软发布了数据库工具SQL Server 2014 SP2服务包下载,本次更新集合了数十项更新修复,涉及安全和功能性补丁,使用SQL Server 2014的用户应该及时安装该服务包. 文件内容版本 ...
关于Unity游戏开发方向找工作方面的一些个人看法
这是个老生常谈,却又是谁绕不过去的话题,而对于每个人来说,所遇到的情况又不尽相同,别人的求职方式和路线不一定适合你,即使是背景很相似的两个人,有时候机遇也很重要. 我本人的工作经验只有一年,就业方式 ...
ubuntu 14.04 compiz的ALT + TAB切换程序
安装完ubuntu,发现不能使用ALT + TAB切换应用程序,翻遍所有百度结果,没有可行,都是拷这个拷那个...真实无语...FQgoogle,看的第一个就完美解决.记录下来,方便国人少走弯路. 首 ...
centos 7 + mono + jexus 环境安装
1.安装 mlocate yum list|grep locate yum install mlocate.x86_64 updatedb 2.安装 yum-utils yum list|grep y ...
C#设计模式——解释器模式(Interpreter Pattern)
一.概述在软件开发特别是DSL开发中常常需要使用一些相对较复杂的业务语言,如果业务语言使用频率足够高,且使用普通的编程模式来实现会导致非常复杂的变化,那么就可以考虑使用解释器模式构建一个解释器对复杂 ...
zookeeper入门学习
1.基本概念 ZooKeeper是一个分布式的,开放源码的分布式应用程序协调服务,是Google的Chubby一个开源的实现,是Hadoop和Hbase的重要组件.它是一个为分布式应用提供一致性服务的 ...
echarts学习网站
echarts : http://echarts.baidu.com/echarts2/doc/example.html 相关脚本学习网站:http://www.jb51.net/html/list/ ...

多线程爬取 threading.Thread 文件名支持gbk编码

多线程爬取 threading.Thread 文件名支持gbk编码的更多相关文章

随机推荐

热门专题