python网络爬虫（3）python爬虫遇到的各种问题（python版本、进程等）

import urllib2

在python3.3里面，用urllib.request代替urllib2

import urllib.request as urllib2

import cookielib

源地址

Python3中，import cookielib改成 import http.cookiejar

import  http.cookiejar as cookielib

from urlparse import urlparse

源地址

from urllib.parse import urlparse

PermissionError: [WinError 5] 拒绝访问

这是在进程之间通信中使用windows过程中出现的问题。参阅：https://blog.csdn.net/m0_37422289/article/details/80186288

原代码：

import queue

from multiprocessing.managers import BaseManager

from multiprocessing import freeze_support

task_number=1

task_queue=queue.Queue(task_number)

result_queue=queue.Queue(task_number)

def win_run():

    BaseManager.register('task',callable=lambda :task_queue)

    BaseManager.register('result',callable=lambda :result_queue)

    manager=BaseManager(address=('127.0.0.1',8001),authkey='123')

    manager.start()

if __name__=="__main__":

    freeze_support()

    win_run()

问题探讨：

在Unix/Linux下，multiprocessing模块封装了fork()调用。

Windows没有fork调用，因此，multiprocessing需要“模拟”出fork的效果，父进程所有Python对象都必须通过pickle序列化再传到子进程去。

pickling序列化中对匿名函数的不支持,导致创建进程失败

解决方案：

修改匿名函数为普通函数

为了实现windows平台对于python多进程实现的要求，并区分是自身运行还是被调用导入而运行，加入if __name__的判断。参阅：https://blog.csdn.net/qq_27017791/article/details/80212016

现代码：

import queue

from multiprocessing.managers import BaseManager

from multiprocessing import freeze_support

task_number=1

task1=queue.Queue(task_number)

result1=queue.Queue(task_number)

def task_queue():

    return task1

def result_queue():

    return result1

def win_run():

    BaseManager.register('task',callable=task_queue)

    BaseManager.register('result',callable=result_queue)

    manager=BaseManager(address=('127.0.0.1',8001),authkey='123')

    manager.start()

if __name__=="__main__":

    freeze_support()

    win_run()

PermissionError: [WinError 5] 拒绝访问

这是在进程使用过程中windows系统下出现的问题。

出现问题的代码部分如下：

问题出现在最后一行。

import time

import queue

from DistributedSpider.control.UrlManager import UrlManager

from multiprocessing import freeze_support,Process

from multiprocessing.managers import BaseManager

from BaseSpider import DataOutput

url_q=queue.Queue()

result_q=queue.Queue()

store_q=queue.Queue()

conn_q=queue.Queue()

def url_manager_proc(url_q,conn_q,root_url):

    url_manager=UrlManager()

    url_manager.add_new_url(root_url)

    while True:

        while(url_manager.get_new_url()):

            new_url=url_manager.get_new_url()

            url_q.put(new_url)

            print(url_manager.old_url_size())

            if(url_manager.old_url_size()>2000 or not url_manager.has_new_url()):

                url_q.put('end')

                print('end')

                url_manager.save_process('new_url.txt',url_manager.new_urls)

                url_manager.save_process('old_url.txt',url_manager.old_urls)

                return

    try:

        if not conn_q.empty():

            urls=conn_q.get()

            url_manager.add_new_urls(urls)

    except BaseException:

        time.sleep(0.1)

if __name__=='__main__':

    freeze_support()

    url='https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711?fr=aladdin'

    url_manager=Process(target=url_manager_proc,args=(url_q,conn_q,url,))

处理方案：参阅：https://blog.csdn.net/weixin_41935140/article/details/81153611

将创建进程的函数参数中涉及到自定义的类，修改到函数内部而不是作为参数传递。

def url_manager_proc(root_url):

    url_manager=UrlManager()

    url_manager.add_new_url(root_url)

    while True:

        while(url_manager.get_new_url()):

            new_url=url_manager.get_new_url()

            url_q.put(new_url)

            print(url_manager.old_url_size())

            if(url_manager.old_url_size()>2000 or not url_manager.has_new_url()):

                url_q.put('end')

                print('end')

                url_manager.save_process('new_url.txt',url_manager.new_urls)

                url_manager.save_process('old_url.txt',url_manager.old_urls)

                return

    try:

        if not conn_q.empty():

            urls=conn_q.get()

            url_manager.add_new_urls(urls)

    except BaseException:

        time.sleep(0.1)

    url='https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711?fr=aladdin'

    url_manager=Process(target=url_manager_proc,args=(url,))

import cPickle

源地址：https://blog.csdn.net/zcf1784266476/article/details/70655192

import pickle

TypeError: a bytes-like object is required, not 'str'

存储前使用str.encode()

源地址：https://www.fujieace.com/python/str-bytes.html

python网络爬虫（3）python爬虫遇到的各种问题（python版本、进程等）的更多相关文章

Python网络数据采集PDF高清完整版免费下载|百度云盘|Python基础教程免费电子书
点击获取提取码:jrno 内容提要本书采用简洁强大的 Python 语言,介绍了网络数据采集,并为采集新式网络中的各种数据类型提供了全面的指导.第一部分重点介绍网络数据采集的基本原理:如何用 Py ...
Python网络数据采集7-单元测试与Selenium自动化测试
Python网络数据采集7-单元测试与Selenium自动化测试单元测试 Python中使用内置库unittest可完成单元测试.只要继承unittest.TestCase类,就可以实现下面的功能. ...
Python网络请求urllib和urllib3详解
Python网络请求urllib和urllib3详解 urllib是Python中请求url连接的官方标准库,在Python2中主要为urllib和urllib2,在Python3中整合成了urlli ...
[Python] 网络爬虫和正则表达式学习总结
以前在学校做科研都是直接利用网上共享的一些数据,就像我们经常说的dataset.beachmark等等.但是,对于实际的工业需求来说,爬取网络的数据是必须的并且是首要的.最近在国内一家互联网公司实习, ...
Python即时网络爬虫项目: 内容提取器的定义(Python2.7版本)
1. 项目背景在Python即时网络爬虫项目启动说明中我们讨论一个数字:程序员浪费在调测内容提取规则上的时间太多了(见上图),从而我们发起了这个项目,把程序员从繁琐的调测规则中解放出来,投入到更高端 ...
关于Python网络爬虫实战笔记③
Python网络爬虫实战笔记③如何下载韩寒博客文章 Python网络爬虫实战笔记③如何下载韩寒博客文章 target:下载全部的文章 1. 博客列表页面规则也就是, http://blog.sina ...
关于Python网络爬虫实战笔记①
python网络爬虫项目实战笔记①如何下载韩寒的博客文章 python网络爬虫项目实战笔记①如何下载韩寒的博客文章 1. 打开韩寒博客列表页面 http://blog.sina.com.cn/s/ar ...
python 网络爬虫（二） BFS不断抓URL并放到文件中
上一篇的python 网络爬虫(一) 简单demo 还不能叫爬虫,只能说基础吧,因为它没有自动化抓链接的功能. 本篇追加如下功能: [1]广度优先搜索不断抓URL,直到队列为空 [2]把所有的URL写 ...
python网络爬虫学习笔记
python网络爬虫学习笔记 By 钟桓 9月 4 2014 更新日期:9月 4 2014 文章文件夹 1. 介绍: 2. 从简单语句中開始: 3. 传送数据给server 4. HTTP头-描写叙述 ...
Python网络爬虫
http://blog.csdn.net/pi9nc/article/details/9734437 一.网络爬虫的定义网络爬虫,即Web Spider,是一个很形象的名字. 把互联网比喻成一个蜘蛛 ...

随机推荐

应用程序无法正常启动（0xc000007b）请单击确定关闭程序
1.问题在win10 VS2105 环境下面开发了一个调用get接口获取数据然后写入pg数据库的程序,在自己电脑上运行正常.复制到win7环境下运行,单击出现如下图所示的提示框. 2.原因分析出现 ...
BZOJ3331压力
码量略大. 题意就是求路径必经点. tarjan缩点,所有的非割点只有是起点终点时才必经,直接开个ans数组就OK了. 至于割点,因为缩完点之后的图是vDcc和割点共同组成的,而且题目说连通,那就是棵 ...
centos调整屏幕亮度
笔记本安装centos6.5后亮度无法通过键盘快捷键调节,可以通过安装软件来调节. 安装:yum install xgamma 设置亮度:xgamma -gamma n( 0.1 < n < ...
Nginx数据结构之红黑树ngx_rbtree_t
1. 什么是红黑树? 1.1 概述红黑树实际上是一种自平衡二叉查找树. 二叉树是什么?二叉树是每个节点最多有两个子树的树结构,每个节点都可以用于存储数据,可以由任 1 个节点访问它的左右子树或父节 ...
JDBC——DBHelper代码模版
JDBC数据库操作代码模版 package com.oolong.util; import java.sql.Connection; import java.sql.DriverManager; im ...
leetcode-hard-array- 227. Basic Calculator II
mycode 29.58% class Solution(object): def calculate(self, s): """ :type s: str :rtyp ...
Php+Redis函数使用总结
因项目需求,冷落了redis,今天再重新熟悉一下: <?php //连接 $redis = New Redis(); $redis->connect('127.0.0.1','6379', ...
Failed to start LSB: start and stop MariaDB
Failed to start LSB: start and stop MariaDB */--> Failed to start LSB: start and stop MariaDB Tab ...
Appium移动自动化测试(三)之元素定位
实验简介做过UI自动化(web自动化, 移动自动化)的同学都会知道, 除去框架的选型和搭建以外, 落到实处的对元素进行定位就成了最重要的技能. 做过UI自动化的同学会知道, 对页面元素的定位方式有8 ...
dbgrid中移动焦点到指定的行和列
dbgrid是从TCustomGrid继承下来的,它有col与row属性,只不过是protected的,不能直接访问,要处理一下,可以这样:TDrawGrid(dbgrid1).row:=row;TD ...

python网络爬虫（3）python爬虫遇到的各种问题（python版本、进程等）

import urllib2

import cookielib

from urlparse import urlparse

PermissionError: [WinError 5] 拒绝访问

原代码：

问题探讨：

解决方案：

现代码：

PermissionError: [WinError 5] 拒绝访问

import cPickle

TypeError: a bytes-like object is required, not 'str'

python网络爬虫（3）python爬虫遇到的各种问题（python版本、进程等）的更多相关文章

随机推荐

热门专题