四大解析器（BeautifulSoup、PyQuery、lxml、正则）性能比较

用标题中的四种方式解析网页，比较其解析速度。当然比较结果数值与电脑配置，python版本都有关系，但总体差别不会很大。

下面是我的结果，lxml xpath最快，bs4最慢

==== Python version: 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] =====

==== Total trials: 10000 =====

bs4 total time: 5.5

pq total time: 0.9

lxml (cssselect) total time: 0.8

lxml (xpath) total time: 0.5

regex total time: 1.1 (doesn't find all p)

　以下是测试代码

# -*- coding: utf-8 -*-

"""

@Datetime: 2019/3/13

@Author: Zhang Yafei

"""

import re

import sys

import time

import requests

from lxml.html import fromstring

from pyquery import PyQuery as pq

from bs4 import BeautifulSoup as bs

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}

def Timer():

    a = time.time()

    while True:

        c = time.time()

        yield time.time() - a

        a = c

# ################# start request #################

timer = Timer()

url = "https://www.python.org/"

html = requests.get(url, headers=headers).text

num = 10000

print('\n==== Python version: %s =====' % sys.version)

print('\n==== Total trials: %s =====' % num)

next(timer)

# ################# bs4 #########################

soup = bs(html, 'lxml')

for x in range(num):

    paragraphs = soup.findAll('p')

t = next(timer)

print('bs4 total time: %.1f' % t)

# ################ pyquery #######################

d = pq(html)

for x in range(num):

    paragraphs = d('p')

t = next(timer)

print('pq total time: %.1f' % t)

# ############### lxml css #########################

tree = fromstring(html)

for x in range(num):

    paragraphs = tree.cssselect('p')

t = next(timer)

print('lxml (cssselect) total time: %.1f' % t)

# ############## lxml xpath #######################

tree = fromstring(html)

for x in range(num):

    paragraphs = tree.xpath('.//p')

t = next(timer)

print('lxml (xpath) total time: %.1f' % t)

# ############### re ##########################

for x in range(num):

    paragraphs = re.findall('<[p ]>.*?</p>', html)

t = next(timer)

print('regex total time: %.1f (doesn\'t find all p)\n' % t)

测试代码二

# -*- coding: utf-8 -*-

"""

@Datetime: 2019/3/13

@Author: Zhang Yafei

"""

import functools

import re

import sys

import time

import requests

from bs4 import BeautifulSoup as bs

from lxml.html import fromstring

from pyquery import PyQuery as pq

def timeit(fun):

    @functools.wraps(fun)

    def wrapper(*args, **kwargs):

        start_time = time.time()

        res = fun(*args, **kwargs)

        print('运行时间为%.6f' % (time.time() - start_time))

        return res

    return wrapper

@timeit  # time1 = timeit(time)

def time1(n):

    return [i * 2 for i in range(n)]

# ################# start request #################

url = "https://www.taobao.com/"

html = requests.get(url).text

num = 10000

print('\n==== Python version: %s =====' % sys.version)

print('\n==== Total trials: %s =====' % num)

@timeit

def bs4_test():

    soup = bs(html, 'lxml')

    for x in range(num):

        paragraphs = soup.findAll('p')

    print('bs4 total time:')

@timeit

def pq_test():

    d = pq(html)

    for x in range(num):

        paragraphs = d('p')

    print('pq total time:')

@timeit

def lxml_css():

    tree = fromstring(html)

    for x in range(num):

        paragraphs = tree.cssselect('p')

    print('lxml (cssselect) total time:')

@timeit

def lxml_xpath():

    tree = fromstring(html)

    for x in range(num):

        paragraphs = tree.xpath('.//p')

    print('lxml (xpath) total time:')

@timeit

def re_test():

    for x in range(num):

        paragraphs = re.findall('<[p ]>.*?</p>', html)

    print('regex total time:')

if __name__ == '__main__':

    bs4_test()

    pq_test()

    lxml_css()

    lxml_xpath()

    re_test()

　　测试结果

==== Python version: 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] =====

==== Total trials: 10000 =====

bs4 total time:

运行时间为9.049424

pq total time:

运行时间为0.899639

lxml (cssselect) total time:

运行时间为0.841596

lxml (xpath) total time:

运行时间为0.619440

regex total time:

运行时间为1.207861

四大解析器（BeautifulSoup、PyQuery、lxml、正则）性能比较的更多相关文章

Python HTML解析器BeautifulSoup(爬虫解析器)
BeautifulSoup简介我们知道,Python拥有出色的内置HTML解析器模块——HTMLParser,然而还有一个功能更为强大的HTML或XML解析工具——BeautifulSoup(美味的 ...
转：Python网页解析：BeautifulSoup vs lxml.html
转自:http://www.cnblogs.com/rzhang/archive/2011/12/29/python-html-parsing.html Python里常用的网页解析库有Beautif ...
正则表达式、BeautifulSoup、Lxml进行性能对比
爬取方法性能使用难度安装难度正则表达式快困难简单(内置) BeautifulSoup 慢简单简单 Lxml 快简单相对困难
HTML解析器BeautifulSoup
BeautifulSoup是Python的一个库,可解析用urllib2抓取下来的HTML 1.Beautiful Soup 安装可以利用 pip 来安装,在Python程序中导入 pip inst ...
爬虫----爬虫解析库Beautifulsoup模块
一:介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你 ...
爬虫解析库——BeautifulSoup
解析库就是在爬虫时自己制定一个规则,帮助我们抓取想要的内容时用的.常用的解析库有re模块的正则.beautifulsoup.pyquery等等.正则完全可以帮我们匹配到我们想要住区的内容,但正则比较麻 ...
爬虫解析库BeautifulSoup的一些笔记
BeautifulSoup类使用基本元素说明 Tag 标签,最基本的信息组织单元,分别是<>和</>标明开头和结尾 Name 标签的名字,<p></p ...
爬虫解析库beautifulsoup
一.介绍 Beautiful Soup是一个可以从HTML或XML文件中提取数据的python库. #安装Beautiful Soup pip install beautifulsoup4 #安装解析 ...
Beautiful Soup常见的解析器
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快 ...

随机推荐

EntityFramework Code-First 简易教程（六）-------领域类配置之DataAnnotations
EF Code-First提供了一个可以用在领域类或其属性上的DataAnnotation特性集合,DataAnnotation特性会覆盖默认的EF约定. DataAnnotation存在于两个命名空 ...
【PAT】B1013 数素数
用埃氏筛筛出素数表(节约时间) 素数的筛选范围不能小了,一定要够大 #include<stdio.h> int main(){ int N,M;scanf("%d %d" ...
LV 指定或修改逻辑卷的major, minor号[RHEL6]
在创建逻辑卷时,可以指定逻辑卷的major和minor设备号. [-M|--persistent {y|n}] //Set to y to make the minor number specifie ...
完成代码将x插入到该顺序有序线性表中，要求该线性表依然有序
#include <stdio.h> #include <malloc.h> int main(void) { int i, n; double s = 1.3; double ...
spark-2.4.0-hadoop2.7-安装部署
1. 主机规划主机名称 IP地址操作系统部署软件运行进程备注 mini01 172.16.1.11[内网] 10.0.0.11 [外网] CentOS 7.5 Jdk-8.zookeepe ...
SQLServer之修改FOREIGN KEY约束
使用SSMS数据库管理工具修改FOREIGN KEY约束 1.连接数据库,选择数据表->右键点击->选择设计(或者展开键,选择要修改的外键,右键点击,选择修改,后面修改步骤相同). 2.在 ...
SQLServer删除数据
使用SSMS删除数据 1.连接数据库.选择数据表->右键点击,选择所有行(或者选择前200行). 2.在数据窗口中选择数据行(注意点击最左边列选择整个数据行)->在最左侧右键点击-> ...
spingboot一键部署到阿里云(Cloud Toolkit工具)
一般做法一键部署工具前些天在完成一个项目时候需要将springboot项目部署到服务器上, 以下是两种做法前面介绍的是一般做法: 后面将介绍省去这些步骤的一键部署工具Cloud Toolki ...
KERBEROS PROTOCOL TUTORIAL
KERBEROS PROTOCOL TUTORIAL This tutorial was written by Fulvio Ricciardi and is reprinted here wit ...
用CMD打开chrome并导航到百度(golang)
首选在cmd中输入(注意:根据你的电脑路径修改,可能是Progra~1): C:\Progra~\Google\Chrome\Application\chrome.exe www.baidu.com ...

四大解析器（BeautifulSoup、PyQuery、lxml、正则）性能比较

四大解析器（BeautifulSoup、PyQuery、lxml、正则）性能比较的更多相关文章

随机推荐

热门专题