PuppeteerSharp+AngleSharp的爬虫实战之汽车之家数据抓取
参考了DotNetSpider示例,
感觉DotNetSpider太重了,它是一个比较完整的爬虫框架。
对比了以下各种无头浏览器,最终采用PuppeteerSharp+AngleSharp写一个爬虫示例。
和上面的博文一样,都是用汽车之家的https://store.mall.autohome.com.cn/83106681.html这个页面做数据采集示例。
本文中使用PuppeteerSharp获取最终页面(即加载JavaScript之后的页面),使用AngleSharp进行Html documents解析处理。
Headless Browsers
A list of (almost) all headless web browsers in existence
A web browser without a graphical user interface, controlled programmatically. Used for automation, testing, and other purposes.
Browser engines
These browser engines fully render web pages or run JavaScript in a virtual DOM
| Name | About | Supported Languages | License |
|---|---|---|---|
| Chromium Embedded Framework | CEF is a open source project based on the Google Chromium project. | JavaScript | BSD |
| Erik | Headless browser on top of Kanna and WebKit. | Swift | MIT |
| jBrowserDriver | A Selenium-compatible headless browser which is written in pure Java. WebKit-based. Works with any of the Selenium Server bindings. | Java | Apache License v2.0 |
| PhantomJS | [Unmaintained] PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. | JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R(via Selenium) | BSD 3-Clause |
| Splash | Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT. | Any | BSD 3-Clause |
Multi drivers
These libraries can control multiple browser engines (typically using Selenium)
| Name | About | Supported Languages | License |
|---|---|---|---|
| CasperJS | CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). | JavaScript | MIT |
| Geb | Geb is a Groovy interface to WebDriver. | Groovy | Apache |
| Selenium | Selenium is a suite of tools to automate web browsers across many platforms. | JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R | Apache |
| Splinter | Splinter is an open source tool for testing web applications using Python. It lets you automate browser actions, such as visiting URLs and interacting with their items. | Python | - |
| SST | SST (selenium-simple-test) is a web test framework that uses Python to generate functional browser-based tests. | Python | - |
| Watir | The most elegant way to use Selenium WebDriver with ruby. | Ruby | MIT |
PhantomJS drivers
These libraries control PhantomJS
| Name | About | Supported Languages | License |
|---|---|---|---|
| Ghostbuster | Automated browser testing via phantom.js, with all of the pain taken out! That means you get a real browser, with a real DOM, and can do real testing! | JavaScript | Not specified |
| jedi-crawler | Lightsabing Node/PhantomJS crawler; scrape dynamic content : without the hassle | JavaScript | Not specified |
| Lotte | Lotte is a headless, automated testing framework built on top of PhantomJS and inspired by Ghostbuster. | JavaScript | MIT |
| phantompy | Phantompy is a headless WebKit engine with powerful pythonic api build on top of Qt5 Webkit | Python | LGPL-2.1 |
| X-RAY | Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP) | JavaScript | MIT |
| Horseman | Promise based Node.js module for PhantomJS. Features chainable API, understandable control-flow, support for multiple tabs, and built-in jQuery. | JavaScript | MIT |
Chromium drivers
These libraries control Chromium
| Name | About | Supported Languages | License |
|---|---|---|---|
| Awesomium | Chromium-based headless browser engine | C++, | Free/Commercial |
| Headless Chromium | Chromium feature activated with the --headlesss flag, currently availible in the nightly build of Chromium, not yet released |
C++ | Opensource |
| Puppeteer | Headless Chrome Node API from the Chrome DevTools team | JavaScript | Apache |
| PuppeteerSharp | PuppeteerSharp is a port of the official Headless Chrome Node.JS Puppeteer API | MIT | |
| chrome-remote-interface | Chrome Debugging Protocol interface for Node.js | JavaScript | MIT |
| Chromy | Features chainable API, mobile emulation, fundamental API such as javascript evaluation. | JavaScript | MIT |
| chromedp | A faster, simpler way to drive browsers (Chrome, Edge, Safari, Android, etc) without external dependencies (ie, Selenium, PhantomJS, etc) using the Chrome Debugging Protocol. | Go | MIT |
| Chromeless | Chrome automation made simple. Runs locally or headless on AWS Lambda. | JavaScript | MIT |
Webkit drivers
These drivers control an in-process instance of Webkit
| Name | About | Supported Languages | License |
|---|---|---|---|
| Browserjet | Runs a custom build of webkit, controlled by node.js interface. | JavaScript | Not specified |
| ghost.py | ghost.py is a webkit web client written in python. | Python | MIT |
| headless_browser | Headless browser based on WebKit written in C++. | C++ | Not Specified |
| Jabba-Webkit | Jabba's headless webkit browser for scraping AJAX-powered webpages. | Python | Not specified |
| Jasmine-Headless-Webkit | jasmine-headless-webkit uses the QtWebKit widget to run your specs without needing to render a pixel. | Python, JavaScript, Ruby | Free |
| Python-Webkit | Python-Webkit is a python extension to Webkit to add full, complete access to Webkit's DOM | Python | GNU |
| Spynner | Programmatic web browsing module with AJAX support for Python | Python | Not specified |
| Webloop | Scriptable, headless WebKit with a Go API. | Go | BSD 3-Clause |
| wkhtmltopdf wkhtmltox wkhtmltoimage | Command line tool rendering HTML into PDF and other image formats. | shell, C | LGPLv3 |
| WKZombie | Functional headless browser (with JSON support) for iOS using WebKit and hpple/libxml2. | Swift | MIT |
Other drivers
These libraries control lesser known browsers or OS-provided web libraries
| Name | About | Supported Languages | License |
|---|---|---|---|
| Nightmare | Nightmare is a high-level browser automation library built as an easier alternative to PhantomJS. It runs on the Electron engine. | JavaScript | MIT |
| grope | A RubyCocoa interface to the macOS WebKit Framework | RubyCocoa | MIT |
| SlimerJS | SlimerJS is similar to PhantomJs, except that it runs Gecko, the browser engine of Mozilla Firefox, instead of Webkit (And it is not yet truly headless). | JavaScript | Mozilla 2.0 |
| SpecterJS | A scriptable headless Internet Explorer port of PhantomJS. | JavaScript | MIT |
| trifleJS | A headless Internet Explorer browser using the WebBrowser Class with a Javascript API running on the V8 engine. | JavaScript | MIT |
Fake Browser Engine
These libraries are typically naive or HTML-only browsers
| Name | About | Supported Languages | License |
|---|---|---|---|
| AngleSharp | Http Parsing Library | MIT | |
| Guillotine | A headless browser, written in C# | LGPL-3.0 | |
| benv | Stub a browser environment in node.js and headlessly test your client-side code. | JavaScript | MIT |
| browser.rb | Headless Ruby browser on top of Nokogiri and TheRubyRacer | Ruby | Not specified |
| BrowserKit | BrowserKit simulates the behavior of a web browser. | PHP | MIT |
| DamonJS | Bot navigating urls and doing tasks. | JavaScript | Apache |
| Headless | Headless browser support for fast web acceptance testing in | MIT | |
| HeadlessBrowser | A very miniature headless browser, for testing the DOM on Node.js | JavaScript | Not specified |
| HtmlUnit | HtmlUnit is a "GUI-Less browser for Java programs". | Java | Apache |
| Jaunt | Java Web Scraping & Automation API | Java | Not specified |
| JSDom | A JavaScript implementation of the WHATWG DOM and HTML standards, for use with Node.js. | JavaScript | MIT |
| MechanicalSoup | A Python library for automating interaction with websites. | Python | MIT |
| mechanize | Stateful programmatic web browsing. | Python | BSD 3-Clause, ZPL 2.1 |
| node-as-browser | Create a browser-like environment within Node.js | JavaScript | MIT |
| RoboBrowser | A simple, Pythonic library for browsing the web without a standalone web browser. | Python | BSD 3-Clause |
| SimpleBrowser | A flexible and intuitive web browser engine designed for automation tasks. Built on the 4 framework. | BSD 3-Clause | |
| stanislaw | Naive, mechanize-like HTML parser/form driver. | Python | Not specified |
| twill | Twill is a simple language that interacts with basic HTML pages (no JavaScript support). | Python | MIT |
| WeasyPrint | WeasyPrint is a visual rendering engine for HTML and CSS that can export to PDF. It aims to support web standards for printing. | Python | BSD 3-Clause |
| WWW::Mechanize | Headless browser for Perl with many plugins and extensions, notably Test::WWW:Mechanize for testing | Perl | Perl 5 |
| X-RAY | Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP) | JavaScript | MIT |
| Xidel (Internet Tools) | An XQuery-based cli web scraper for static X/HTML pages and JSON-APIs. | FreePascal, XQuery | GPL-2 |
| Zombie.js | Zombie.js is a lightweight framework for testing client-side JavaScript code in a simulated environment. No browser required. | JavaScript | MIT |
Runs in a browser
| Name | About | Supported Languages | License |
|---|---|---|---|
| DalekJS | [unmaintained and recommend TestCafé] Automated cross browser testing with JavaScript. | JavaScript | MIT |
| TestCafé | Automated browser testing for the modern web development stack. | JavaScript | MIT |
| Sahi | Sahi is a cross-browser automation/testing tool with the facility to record and playback scripts. | JavaScript, Java, Ruby, PHP | Apache / Commercial |
| WatiN | Web Application Testing In | Apache 2.0 |
Misc tools
| Name | About | Supported Languages | License |
|---|---|---|---|
| browser-launcher | Detect and launch browser versions, headlessly or otherwise | JavaScript | MIT |
其实如果没有JavaScripts加载数据需求,单独用AngleSharp就可以搞定了。
但涉及到JavaScripts加载数据需求的,就需要上真正的无头浏览器组件才能搞定了。
AngleSharp现在只支持简单的JavaScripts代码执行,稍微复杂点的,都不行,听说以后要完整支持JavaScripts,敬请期待吧!
Code
/*
* This is a Puppeteer+AngleSharp crawler console app samples
*/
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using AngleSharp;
using AngleSharp.Dom;
using AngleSharp.Html.Parser;
using Newtonsoft.Json;
using PuppeteerSharp;
namespace CrawlerSamples
{
internal class Program
{
private const string Url = "https://store.mall.autohome.com.cn/83106681.html";
private const int ChromiumRevision = BrowserFetcher.DefaultRevision;
private static async Task Main(string[] args)
{
//Download chromium browser revision package
await new BrowserFetcher().DownloadAsync(ChromiumRevision);
//Test AngleSharp
await TestAngleSharp();
Console.ReadKey();
}
private static async Task TestAngleSharp()
{
/*
* Used AngleSharp loading of HTML document
* TODO: Used WithJavaScript function need install AngleSharp.Scripting.Javascript nuget package
* Note: that JavaScripts support is an experimental and does not support complex JavaScripts code.
*/
//IConfiguration config = Configuration.Default.WithDefaultLoader().WithCss().WithCookies().WithJavaScript();
//IBrowsingContext context = BrowsingContext.New(config);
//IDocument document = await context.OpenAsync(url);
//Used PuppeteerSharp loading of HTML document
var htmlString = await TestPuppeteerSharp();
/*
* Parsing of HTML document string
*/
var context = BrowsingContext.New(Configuration.Default);
var parser = context.GetService<IHtmlParser>();
var document = parser.ParseDocument(htmlString);
//Selector carbox element list
var carboxList = document.QuerySelectorAll("div.shop-content div.content div.list li.carbox");
var carModelList = new List<CarModel>();
foreach (var carbox in carboxList)
{
//Parsing and converting to the car model object.
var model = CreateModelWithAngleSharp(carbox);
carModelList.Add(model);
//Printing to console windows
var jsonString = JsonConvert.SerializeObject(model);
Console.WriteLine(jsonString);
Console.WriteLine();
}
Console.WriteLine("Total count:" + carModelList.Count);
}
private static async Task<string> TestPuppeteerSharp()
{
//Enabled headless option
var launchOptions = new LaunchOptions { Headless = true };
//Starting headless browser
var browser = await Puppeteer.LaunchAsync(launchOptions);
//New tab page
var page = await browser.NewPageAsync();
//Request URL to get the page
await page.GoToAsync(Url);
//Get and return the HTML content of the page
var htmlString = await page.GetContentAsync();
#region Dispose resources
//Close tab page
await page.CloseAsync();
//Close headless browser, all pages will be closed here.
await browser.CloseAsync();
#endregion
return htmlString;
}
private static CarModel CreateModelWithAngleSharp(IParentNode node)
{
var model = new CarModel
{
Title = node.QuerySelector("a div.carbox-title").TextContent,
ImageUrl = node.QuerySelector("a div.carbox-carimg img").GetAttribute("src"),
ProductUrl = node.QuerySelector("a").GetAttribute("href"),
Tip = node.QuerySelector("a div.carbox-tip").TextContent,
OrdersNumber = node.QuerySelector("a div.carbox-number span").TextContent
};
return model;
}
}
}
Result

Note
注意,第一次运行,这一句代码:
await new BrowserFetcher().DownloadAsync(ChromiumRevision);
会从网络上下载浏览器便捷式安装包download-Win64-536395.zip到你本地,里面解压后是一个Chromium浏览器。这里需要等待一些时间。
Source
https://github.com/VAllens/CrawlerSamples
PuppeteerSharp+AngleSharp的爬虫实战之汽车之家数据抓取的更多相关文章
- python3爬虫再探之豆瓣影评数据抓取
一个关于豆瓣影评的爬虫,涉及:模拟登陆,翻页抓取.直接上代码: import re import time import requests import xlsxwriter from bs4 imp ...
- 爬虫实战:汽车之家配置页面 破解伪元素和混淆JS
本篇介绍如何破解汽车之家配置页面的伪元素和混淆的JS. ** 温馨提示:如需转载本文,请注明内容出处.** 本文链接:https://www.cnblogs.com/grom/p/9242156.ht ...
- Python爬虫入门教程 29-100 手机APP数据抓取 pyspider
1. 手机APP数据----写在前面 继续练习pyspider的使用,最近搜索了一些这个框架的一些使用技巧,发现文档竟然挺难理解的,不过使用起来暂时没有障碍,估摸着,要在写个5篇左右关于这个框架的教程 ...
- Python爬虫入门教程 28-100 虎嗅网文章数据抓取 pyspider
1. 虎嗅网文章数据----写在前面 今天继续使用pyspider爬取数据,很不幸,虎嗅资讯网被我选中了,网址为 https://www.huxiu.com/ 爬的就是它的资讯频道,本文章仅供学习交流 ...
- Python爬虫入门教程 33-100 电影评论数据抓取 scrapy
1. 海王评论数据爬取前分析 海王上映了,然后口碑炸了,对咱来说,多了一个可爬可分析的电影,美哉~ 摘录一个评论 零点场刚看完,温导的电影一直很不错,无论是速7,电锯惊魂还是招魂都很棒.打斗和音效方面 ...
- Python爬虫工程师必学——App数据抓取实战 ✌✌
Python爬虫工程师必学——App数据抓取实战 (一个人学习或许会很枯燥,但是寻找更多志同道合的朋友一起,学习将会变得更加有意义✌✌) 爬虫分为几大方向,WEB网页数据抓取.APP数据抓取.软件系统 ...
- Python爬虫工程师必学APP数据抓取实战✍✍✍
Python爬虫工程师必学APP数据抓取实战 整个课程都看完了,这个课程的分享可以往下看,下面有链接,之前做java开发也做了一些年头,也分享下自己看这个视频的感受,单论单个知识点课程本身没问题,大 ...
- Python爬虫工程师必学——App数据抓取实战
Python爬虫工程师必学 App数据抓取实战 整个课程都看完了,这个课程的分享可以往下看,下面有链接,之前做java开发也做了一些年头,也分享下自己看这个视频的感受,单论单个知识点课程本身没问题,大 ...
- 网络爬虫:使用Scrapy框架编写一个抓取书籍信息的爬虫服务
上周学习了BeautifulSoup的基础知识并用它完成了一个网络爬虫( 使用Beautiful Soup编写一个爬虫 系列随笔汇总 ), BeautifulSoup是一个非常流行的Python网 ...
随机推荐
- IIS 运行ASP.Net的基本配置(编辑中。。。)
今天在新建的IIS上运行Asp.net 程序,发现IIS根本没有走asp的路由系统,直接返回了404,后来发现是IIS没有正确安装,需要安装以下的组件: 未安装前,IIS里的样子: 安装后,IIS的样 ...
- 总结fiddle
fiddler重新发送请求 模拟限速 http://caibaojian.com/fiddler.html fiddler模拟限速的原理 我们可以通过fiddler来模拟限速,因为fiddler本 ...
- 四丶前端基础之jquery
知识预览 一 jQuery是什么? 二 什么是jQuery对象? 三 寻找元素(选择器和筛选器) 四 操作元素(属性,css,文档处理) 扩展方法 (插件机制) 回到顶部 一 jQuery是什么? [ ...
- curl常用命令备忘
#####(输出请求头信息) curl -I xxx-Pro:test xxx$ curl -I https://www.baidu.com/ HTTP/1.1 200 OK Accept-Range ...
- JAVA -数据类型与表达式---变量与赋值
变量 程序中所管理的大部分信息,都是用变量表示的.下面讨论如何在程序中声明和使用变量. 变量(variable)代表保存数据的内存单元,变量名是内存单元的符号地址.变量声明要求编译器分配足够大的内存单 ...
- 蓝桥杯 黄金连分数(BigDecimal的使用)
标题: 黄金连分数 黄金分割数0.61803... 是个无理数,这个常数十分重要,在许多工程问题中会出现.有时需要把这个数字求得很精确. 对于某些精密工程,常数的精度很重要.也许你听说过哈勃太空望远镜 ...
- Django“少折腾”
1.Django中文语言.时区 修改项目setting文件 LANGUAGE_CODE = 'zh-hans' TIME_ZONE = 'Asia/Shanghai'
- PCB Mark点相关
1)Mark点用于锡膏印刷和元件贴片时的光学定位.根据Mark点在PCB上的作用,可分为拼板Mark点.单板Mark点.局部Mark点(也称器件级MARK点) 2)拼板的工艺边上和不需拼板的单板上应至 ...
- browser-sync + http-proxy-middleware 配置代理跨域
写代理js文件下面是文件内容 /** * Module dependencies. */ var browserSync = require('browser-sync').create() var ...
- VMware14 安装CentOS7 实现宿主机ping通虚拟机、虚拟机ping通宿主机、虚拟机能上网且能ping通百度
本文旨在通过通过虚拟机VMware14来安装CentOS7 系统,并配置固定IP来实现在Windows系统中使用Linux环境. 本文目录: 0.本机环境 1.VMware14 初始化 1.1.安装V ...