python3中使用builtwith的方法（很详细）

1. 首先通过pip install builtwith安装builtwith

C:\Users\Administrator>pip install builtwith

Collecting builtwith

  Downloading builtwith-1.3.2.tar.gz

Installing collected packages: builtwith

  Running setup.py install for builtwith ... done

Successfully installed builtwith-1.3.2

2. 在pycharm中新建工程并输入下面测试代码

import builtwith

tech_used = builtwith.parse('http://www.baidu.com')

print(tech_used)

运行会得到下面的错误：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy

Traceback (most recent call last):

  File "F:/python/first/FirstPy", line 1, in <module>

    import builtwith

  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43

    except Exception, e:

                    ^

SyntaxError: invalid syntax  

Process finished with exit code 1

原因是builtwith是基于2.x版本的，需要修改几个地方，在pycharm出错信息中双击出错文件，进行修改，主要修改下面三种：
1. Python2中的 “Exception ,e”的写法已经不支持，需要修改为“Exception as e”。
2. Python2中print后的表达式在Python3中都需要用括号括起来。
3. builtwith中使用的是Python2中的urllib2工具包，这个工具包在Python3中是不存在的，需要修改urllib2相关的代码。
1和2容易修改，下面主要针对第3点进行修改：
首先将import urllib2替换为下面的代码：

import urllib.request

import urllib.error

然后将urllib2的相关方法替换如下：

request = urllib.request.Request(url, None, {'User-Agent': user_agent})

response = urllib.request.urlopen(request)

再次运行项目，遇到下面错误：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy

Traceback (most recent call last):

  File "F:/python/first/FirstPy", line 3, in <module>

    builtwith.parse('http://www.baidu.com')

  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62,

in builtwith

    if contains(html, snippet):

  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105,

in contains

    return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)

TypeError: cannot use a string pattern on a bytes-like object  

Process finished with exit code 1

这是因为urllib返回的数据格式已经发生了改变，需要进行转码，将下面的代码：

if html is None:

    html = response.read()

修改为

if html is None:

     html = response.read()

     html = html.decode('utf-8')

再次运行得到最终结果如下：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy

{'javascript-frameworks': ['jQuery']}  

Process finished with exit code 0

但是如果把网站换成 'www.163.com'，运行再次报错如下：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy

Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte

Traceback (most recent call last):

  File "F:/python/first/FirstPy", line 2, in <module>

    tech_used = builtwith.parse('http://www.163.com')

  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63,

in builtwith

    if contains(html, snippet):

  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106,

in contains

    return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)

TypeError: cannot use a string pattern on a bytes-like object  

Process finished with exit code 1

似乎还是编码的问题，将编码设置成 ‘GBK’，运行成功如下：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy

{'web-servers': ['Nginx']}  

Process finished with exit code 0

所以不同的网站需要用不同的解码方式么？下面介绍一种判别网站编码格式的方法。
我们需要安装一个叫chardet的工具包，如下：

C:\Users\Administrator>pip install chardet

Collecting chardet

  Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)

    100% |████████████████████████████████| 184kB 616kB/s

Installing collected packages: chardet

Successfully installed chardet-2.3.0  

C:\Users\Administrator>

将byte数据传入chardet的detect方法后会得到一个Dict，里面有两个值，一个是置信值，一个是编码方式

{'encoding': 'utf-8', 'confidence': 0.99}

将builtwith对应的代码做下面修改：

encode_type = chardet.detect(html)

  if encode_type['encoding'] == 'utf-8':

    html = html.decode('utf-8')

  else:

    html = html.decode('gbk')

记得 import chardet！！！！
加入chardet判断字符编码的方式后，就能适配网站了~~~~

http://blog.csdn.net/fengzhizi76506/article/details/61617067

python3中使用builtwith的方法（很详细）的更多相关文章

Python3中使用PyMongo的方法详解
前言本文主要给大家介绍的是关于在Python3使用PyMongo的方法,分享出来供大家参考学习,下面话不多说了,来一起看看详细介绍: MongoDB存储在这里我们来看一下Python3下Mongo ...
Python3中BeautifulSoup的使用方法
BeautifulSoup的使用我们学习了正则表达式的相关用法,但是一旦正则写的有问题,可能得到的就不是我们想要的结果了,而且对于一个网页来说,都有一定的特殊的结构和层级关系,而且很多标签都有id或 ...
c++中的const用法（很详细）——转
http://www.cnblogs.com/ymy124/archive/2012/04/16/2451433.html const给人的第一印象就是定义常量. (1)const用于定义常量. 例如 ...
转载过来的参考内容---常规36个WEB渗透测试漏洞描述及修复方法----很详细
常规WEB渗透测试漏洞描述及修复 --转自:http://www.51testing.com/html/92/n-3723692.html (1). Apache样例文件泄漏漏洞描述 apa ...
VS2015ASP.NET MVC5项目中Spring.NET配置方法(超详细)
首先,在ASP.NET MVC5项目右键,如下图所示,选择“管理Nuget程序包...” 然后,在弹出的页面的搜索框中输入“spring.web”,在返回结果中选择Spring.Web和Spring. ...
Python3中使用urllib的方法详解(header,代理,超时,认证,异常处理)_python
我们可以利用urllib来抓取远程的数据进行保存哦,以下是python3 抓取网页资源的多种方法,有需要的可以参考借鉴. 1.最简单 import urllib.request response = ...
Python3中使用urllib的方法详解(header,代理,超时,认证,异常处理)
出自 http://www.jb51.net/article/93125.htm
【spring data jpa】jpa中使用count计数方法
spring data jpa中使用count计数方法很简单直接在dao层写方法即可 int countByUidAndTenementId(String parentUid, String ten ...
[翻译]python3中新的字符串格式化方法-----f-string
从python3.6开始,引入了新的字符串格式化方式,f-字符串. 这使得格式化字符串变得可读性更高,更简洁,更不容易出现错误而且速度也更快. 在本文后面,会详细介绍f-字符串的用法. 在此之前,让我 ...

随机推荐

Android进阶(十七)AndroidAPP开发问题汇总(一)
首先来看一下猎头公司对于Android方向人才招聘的需求: 猎头公司推荐------资深Java软件工程师(Android方向) 岗位职责: 1.熟悉Java语言,熟悉B/S开发的基本结构 2.能运用 ...
Unity Editor 编写unity插件类
在unity写了一个编辑类,基于iTweenpath插件,为了更方便的操作iTweenpath,顺便练习UnityEditor的操作,写了一个CreateiTweenPath,放在Editor文件夹中 ...
iOS下FMDB的多线程操作（二）
上一篇记录不使用FMDatabaseQueue来使用多线程,这一篇记录一下使用FMDatabaseQueue的方式. 需要注意的时queue操作中不能嵌套queue操作,否则会各种错误. 当使用FMD ...
OpenCV——PS图层混合算法（六）
具体的算法原理可以参考: PS图层混合算法之六(差值,溶解, 排除) // PS_Algorithm.h #ifndef PS_ALGORITHM_H_INCLUDED #define PS_ALGO ...
【uWSGI】实战之Django配置经验
uWSGI 是应用服务器,实现了uwsgi, wsgi等协议,可以运行wsgi 应用 uwsgi 是协议名 Django配置下面是自己经常用的一个配置模板,基于1.9以上的版本使用的, 主要基于dj ...
openresty+websocket+redis simple chat
openresty 很早就支持websocket了,但是早期的版本cosocket是单工的,处理起来比较麻烦参见邮件列表讨论 websocket chat,后来的版本cosocket是双全工的,就可以 ...
软件开发顶尖高手的杀手锏SQL语句
软件开发顶尖高手的杀手锏SQL语句 ...
Xshell 5 配置上传下载命令
可以在官网https://www.netsarang.com/products/main.html 下载Xshell, 目前最新的版本已经到Xshell 6了本人记录下安装的目录截图: 安装命令: ...
各种代码版本控制工具下使用http代理的方法
原文:各种SCM工具下使用http代理下载源码:http://www.linuxeden.com/html/develop/20090723/66951.html SCM是软件配置管理的简称,常见的S ...
食物链-HZUN寒假集训
食物链总时间限制: 1000ms 内存限制: 65536kB 描述动物王国中有三类动物A,B,C,这三类动物的食物链构成了有趣的环形.A吃B, B吃C,C吃A. 现有N个动物,以1-N编号.每个动 ...

python3中使用builtwith的方法（很详细）

python3中使用builtwith的方法（很详细）的更多相关文章

随机推荐

热门专题