参考了DotNetSpider示例，

感觉DotNetSpider太重了，它是一个比较完整的爬虫框架。

对比了以下各种无头浏览器，最终采用PuppeteerSharp+AngleSharp写一个爬虫示例。

和上面的博文一样，都是用汽车之家的https://store.mall.autohome.com.cn/83106681.html这个页面做数据采集示例。

本文中使用PuppeteerSharp获取最终页面(即加载JavaScript之后的页面)，使用AngleSharp进行Html documents解析处理。

Headless Browsers

A list of (almost) all headless web browsers in existence

A web browser without a graphical user interface, controlled programmatically. Used for automation, testing, and other purposes.

Browser engines

These browser engines fully render web pages or run JavaScript in a virtual DOM

Name	About	Supported Languages	License
Chromium Embedded Framework	CEF is a open source project based on the Google Chromium project.	JavaScript	BSD
Erik	Headless browser on top of Kanna and WebKit.	Swift	MIT
jBrowserDriver	A Selenium-compatible headless browser which is written in pure Java. WebKit-based. Works with any of the Selenium Server bindings.	Java	Apache License v2.0
PhantomJS	[Unmaintained] PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.	JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R(via Selenium)	BSD 3-Clause
Splash	Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT.	Any	BSD 3-Clause

Multi drivers

These libraries can control multiple browser engines (typically using Selenium)

Name	About	Supported Languages	License
CasperJS	CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko).	JavaScript	MIT
Geb	Geb is a Groovy interface to WebDriver.	Groovy	Apache
Selenium	Selenium is a suite of tools to automate web browsers across many platforms.	JavaScript, Python, Ruby, Java, C#, Haskell, Objective-C, Perl, PHP, R	Apache
Splinter	Splinter is an open source tool for testing web applications using Python. It lets you automate browser actions, such as visiting URLs and interacting with their items.	Python	-
SST	SST (selenium-simple-test) is a web test framework that uses Python to generate functional browser-based tests.	Python	-
Watir	The most elegant way to use Selenium WebDriver with ruby.	Ruby	MIT

PhantomJS drivers

These libraries control PhantomJS

Name	About	Supported Languages	License
Ghostbuster	Automated browser testing via phantom.js, with all of the pain taken out! That means you get a real browser, with a real DOM, and can do real testing!	JavaScript	Not specified
jedi-crawler	Lightsabing Node/PhantomJS crawler; scrape dynamic content : without the hassle	JavaScript	Not specified
Lotte	Lotte is a headless, automated testing framework built on top of PhantomJS and inspired by Ghostbuster.	JavaScript	MIT
phantompy	Phantompy is a headless WebKit engine with powerful pythonic api build on top of Qt5 Webkit	Python	LGPL-2.1
X-RAY	Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP)	JavaScript	MIT
Horseman	Promise based Node.js module for PhantomJS. Features chainable API, understandable control-flow, support for multiple tabs, and built-in jQuery.	JavaScript	MIT

Chromium drivers

These libraries control Chromium

Name	About	Supported Languages	License
Awesomium	Chromium-based headless browser engine	C++,	Free/Commercial
Headless Chromium	Chromium feature activated with the `--headlesss` flag, currently availible in the nightly build of Chromium, not yet released	C++	Opensource
Puppeteer	Headless Chrome Node API from the Chrome DevTools team	JavaScript	Apache
PuppeteerSharp	PuppeteerSharp is a port of the official Headless Chrome Node.JS Puppeteer API		MIT
chrome-remote-interface	Chrome Debugging Protocol interface for Node.js	JavaScript	MIT
Chromy	Features chainable API, mobile emulation, fundamental API such as javascript evaluation.	JavaScript	MIT
chromedp	A faster, simpler way to drive browsers (Chrome, Edge, Safari, Android, etc) without external dependencies (ie, Selenium, PhantomJS, etc) using the Chrome Debugging Protocol.	Go	MIT
Chromeless	Chrome automation made simple. Runs locally or headless on AWS Lambda.	JavaScript	MIT

Webkit drivers

These drivers control an in-process instance of Webkit

Name	About	Supported Languages	License
Browserjet	Runs a custom build of webkit, controlled by node.js interface.	JavaScript	Not specified
ghost.py	ghost.py is a webkit web client written in python.	Python	MIT
headless_browser	Headless browser based on WebKit written in C++.	C++	Not Specified
Jabba-Webkit	Jabba's headless webkit browser for scraping AJAX-powered webpages.	Python	Not specified
Jasmine-Headless-Webkit	jasmine-headless-webkit uses the QtWebKit widget to run your specs without needing to render a pixel.	Python, JavaScript, Ruby	Free
Python-Webkit	Python-Webkit is a python extension to Webkit to add full, complete access to Webkit's DOM	Python	GNU
Spynner	Programmatic web browsing module with AJAX support for Python	Python	Not specified
Webloop	Scriptable, headless WebKit with a Go API.	Go	BSD 3-Clause
wkhtmltopdf wkhtmltox wkhtmltoimage	Command line tool rendering HTML into PDF and other image formats.	shell, C	LGPLv3
WKZombie	Functional headless browser (with JSON support) for iOS using WebKit and hpple/libxml2.	Swift	MIT

Other drivers

These libraries control lesser known browsers or OS-provided web libraries

Name	About	Supported Languages	License
Nightmare	Nightmare is a high-level browser automation library built as an easier alternative to PhantomJS. It runs on the Electron engine.	JavaScript	MIT
grope	A RubyCocoa interface to the macOS WebKit Framework	RubyCocoa	MIT
SlimerJS	SlimerJS is similar to PhantomJs, except that it runs Gecko, the browser engine of Mozilla Firefox, instead of Webkit (And it is not yet truly headless).	JavaScript	Mozilla 2.0
SpecterJS	A scriptable headless Internet Explorer port of PhantomJS.	JavaScript	MIT
trifleJS	A headless Internet Explorer browser using the WebBrowser Class with a Javascript API running on the V8 engine.	JavaScript	MIT

Fake Browser Engine

These libraries are typically naive or HTML-only browsers

Name	About	Supported Languages	License
AngleSharp	Http Parsing Library		MIT
Guillotine	A headless browser, written in C#		LGPL-3.0
benv	Stub a browser environment in node.js and headlessly test your client-side code.	JavaScript	MIT
browser.rb	Headless Ruby browser on top of Nokogiri and TheRubyRacer	Ruby	Not specified
BrowserKit	BrowserKit simulates the behavior of a web browser.	PHP	MIT
DamonJS	Bot navigating urls and doing tasks.	JavaScript	Apache
Headless	Headless browser support for fast web acceptance testing in		MIT
HeadlessBrowser	A very miniature headless browser, for testing the DOM on Node.js	JavaScript	Not specified
HtmlUnit	HtmlUnit is a "GUI-Less browser for Java programs".	Java	Apache
Jaunt	Java Web Scraping & Automation API	Java	Not specified
JSDom	A JavaScript implementation of the WHATWG DOM and HTML standards, for use with Node.js.	JavaScript	MIT
MechanicalSoup	A Python library for automating interaction with websites.	Python	MIT
mechanize	Stateful programmatic web browsing.	Python	BSD 3-Clause, ZPL 2.1
node-as-browser	Create a browser-like environment within Node.js	JavaScript	MIT
RoboBrowser	A simple, Pythonic library for browsing the web without a standalone web browser.	Python	BSD 3-Clause
SimpleBrowser	A flexible and intuitive web browser engine designed for automation tasks. Built on the 4 framework.		BSD 3-Clause
stanislaw	Naive, mechanize-like HTML parser/form driver.	Python	Not specified
twill	Twill is a simple language that interacts with basic HTML pages (no JavaScript support).	Python	MIT
WeasyPrint	WeasyPrint is a visual rendering engine for HTML and CSS that can export to PDF. It aims to support web standards for printing.	Python	BSD 3-Clause
WWW::Mechanize	Headless browser for Perl with many plugins and extensions, notably Test::WWW:Mechanize for testing	Perl	Perl 5
X-RAY	Supports strings, arrays, arrays of objects, nested object structures, selector API, pagination, crawler, concurrency, throttles, delays, timeouts, and pluggable drivers (PhantomJS, HTTP)	JavaScript	MIT
Xidel (Internet Tools)	An XQuery-based cli web scraper for static X/HTML pages and JSON-APIs.	FreePascal, XQuery	GPL-2
Zombie.js	Zombie.js is a lightweight framework for testing client-side JavaScript code in a simulated environment. No browser required.	JavaScript	MIT

Runs in a browser

Name	About	Supported Languages	License
DalekJS	[unmaintained and recommend TestCafé] Automated cross browser testing with JavaScript.	JavaScript	MIT
TestCafé	Automated browser testing for the modern web development stack.	JavaScript	MIT
Sahi	Sahi is a cross-browser automation/testing tool with the facility to record and playback scripts.	JavaScript, Java, Ruby, PHP	Apache / Commercial
WatiN	Web Application Testing In		Apache 2.0

Misc tools

Name	About	Supported Languages	License
browser-launcher	Detect and launch browser versions, headlessly or otherwise	JavaScript	MIT

其实如果没有JavaScripts加载数据需求，单独用AngleSharp就可以搞定了。

但涉及到JavaScripts加载数据需求的，就需要上真正的无头浏览器组件才能搞定了。

AngleSharp现在只支持简单的JavaScripts代码执行，稍微复杂点的，都不行，听说以后要完整支持JavaScripts，敬请期待吧！

Code

/*

 * This is a Puppeteer+AngleSharp crawler console app samples

 */

using System;

using System.Collections.Generic;

using System.Threading.Tasks;

using AngleSharp;

using AngleSharp.Dom;

using AngleSharp.Html.Parser;

using Newtonsoft.Json;

using PuppeteerSharp;

namespace CrawlerSamples

{

    internal class Program

    {

        private const string Url = "https://store.mall.autohome.com.cn/83106681.html";

        private const int ChromiumRevision = BrowserFetcher.DefaultRevision;

        private static async Task Main(string[] args)

        {

            //Download chromium browser revision package

            await new BrowserFetcher().DownloadAsync(ChromiumRevision);

            //Test AngleSharp

            await TestAngleSharp();

            Console.ReadKey();

        }

        private static async Task TestAngleSharp()

        {

            /*

             * Used AngleSharp loading of HTML document

             * TODO: Used WithJavaScript function need install AngleSharp.Scripting.Javascript nuget package

             * Note: that JavaScripts support is an experimental and does not support complex JavaScripts code.

             */

            //IConfiguration config = Configuration.Default.WithDefaultLoader().WithCss().WithCookies().WithJavaScript();

            //IBrowsingContext context = BrowsingContext.New(config);

            //IDocument document = await context.OpenAsync(url);

            //Used PuppeteerSharp loading of HTML document

            var htmlString = await TestPuppeteerSharp();

            /*

             * Parsing of HTML document string

             */

            var context = BrowsingContext.New(Configuration.Default);

            var parser = context.GetService<IHtmlParser>();

            var document = parser.ParseDocument(htmlString);

            //Selector carbox element list

            var carboxList = document.QuerySelectorAll("div.shop-content div.content div.list li.carbox");

            var carModelList = new List<CarModel>();

            foreach (var carbox in carboxList)

            {

                //Parsing and converting to the car model object.

                var model = CreateModelWithAngleSharp(carbox);

                carModelList.Add(model);

                //Printing to console windows

                var jsonString = JsonConvert.SerializeObject(model);

                Console.WriteLine(jsonString);

                Console.WriteLine();

            }

            Console.WriteLine("Total count:" + carModelList.Count);

        }

        private static async Task<string> TestPuppeteerSharp()

        {

            //Enabled headless option

            var launchOptions = new LaunchOptions { Headless = true };

            //Starting headless browser

            var browser = await Puppeteer.LaunchAsync(launchOptions);

            //New tab page

            var page = await browser.NewPageAsync();

            //Request URL to get the page

            await page.GoToAsync(Url);

            //Get and return the HTML content of the page

            var htmlString = await page.GetContentAsync();

            #region Dispose resources

            //Close tab page

            await page.CloseAsync();

            //Close headless browser, all pages will be closed here.

            await browser.CloseAsync();

            #endregion

            return htmlString;

        }

        private static CarModel CreateModelWithAngleSharp(IParentNode node)

        {

            var model = new CarModel

            {

                Title = node.QuerySelector("a div.carbox-title").TextContent,

                ImageUrl = node.QuerySelector("a div.carbox-carimg img").GetAttribute("src"),

                ProductUrl = node.QuerySelector("a").GetAttribute("href"),

                Tip = node.QuerySelector("a div.carbox-tip").TextContent,

                OrdersNumber = node.QuerySelector("a div.carbox-number span").TextContent

            };

            return model;

        }

    }

}

Result

Note

注意，第一次运行，这一句代码：

await new BrowserFetcher().DownloadAsync(ChromiumRevision);

会从网络上下载浏览器便捷式安装包download-Win64-536395.zip到你本地，里面解压后是一个Chromium浏览器。这里需要等待一些时间。

Source

https://github.com/VAllens/CrawlerSamples

PuppeteerSharp+AngleSharp的爬虫实战之汽车之家数据抓取的更多相关文章

python3爬虫再探之豆瓣影评数据抓取
一个关于豆瓣影评的爬虫,涉及:模拟登陆,翻页抓取.直接上代码: import re import time import requests import xlsxwriter from bs4 imp ...
爬虫实战：汽车之家配置页面破解伪元素和混淆JS
本篇介绍如何破解汽车之家配置页面的伪元素和混淆的JS. ** 温馨提示:如需转载本文,请注明内容出处.** 本文链接:https://www.cnblogs.com/grom/p/9242156.ht ...
Python爬虫入门教程 29-100 手机APP数据抓取 pyspider
1. 手机APP数据----写在前面继续练习pyspider的使用,最近搜索了一些这个框架的一些使用技巧,发现文档竟然挺难理解的,不过使用起来暂时没有障碍,估摸着,要在写个5篇左右关于这个框架的教程 ...
Python爬虫入门教程 28-100 虎嗅网文章数据抓取 pyspider
1. 虎嗅网文章数据----写在前面今天继续使用pyspider爬取数据,很不幸,虎嗅资讯网被我选中了,网址为 https://www.huxiu.com/ 爬的就是它的资讯频道,本文章仅供学习交流 ...
Python爬虫入门教程 33-100 电影评论数据抓取 scrapy
1. 海王评论数据爬取前分析海王上映了,然后口碑炸了,对咱来说,多了一个可爬可分析的电影,美哉~ 摘录一个评论零点场刚看完,温导的电影一直很不错,无论是速7,电锯惊魂还是招魂都很棒.打斗和音效方面 ...
Python爬虫工程师必学——App数据抓取实战 ✌✌
Python爬虫工程师必学——App数据抓取实战 (一个人学习或许会很枯燥,但是寻找更多志同道合的朋友一起,学习将会变得更加有意义✌✌) 爬虫分为几大方向,WEB网页数据抓取.APP数据抓取.软件系统 ...
Python爬虫工程师必学APP数据抓取实战✍✍✍
Python爬虫工程师必学APP数据抓取实战整个课程都看完了,这个课程的分享可以往下看,下面有链接,之前做java开发也做了一些年头,也分享下自己看这个视频的感受,单论单个知识点课程本身没问题,大 ...
Python爬虫工程师必学——App数据抓取实战
Python爬虫工程师必学 App数据抓取实战整个课程都看完了,这个课程的分享可以往下看,下面有链接,之前做java开发也做了一些年头,也分享下自己看这个视频的感受,单论单个知识点课程本身没问题,大 ...
网络爬虫：使用Scrapy框架编写一个抓取书籍信息的爬虫服务
上周学习了BeautifulSoup的基础知识并用它完成了一个网络爬虫( 使用Beautiful Soup编写一个爬虫系列随笔汇总 ), BeautifulSoup是一个非常流行的Python网 ...

随机推荐

C - BLG POJ - 1417 种类并查集加dp（背包）
思路:刚看这道题感觉什么都不清楚,人物之间的关系一点也看不出来,都不知道怎么写,连并查集都没看出来,但是你可以仔细分析一下,当输入字符串为“yes”的时候,我们设输入的值为x和y,当x为天使是则由题可 ...
20175226 2018-2019-2 《Java程序设计》第六周学习总结
20175226 2018-2019-2 <Java程序设计>第六周学习总结教材学习内容总结内部类与异常类内部类:Java支持在一个类中定义另一个类(可以是static类) 外嵌类的 ...
git自动部署到服务器
1.现在服务器配置空仓库 mkdir -p test/project.git chmod 777 test cd test/project.git/ git init --bare . cd .. c ...
js数据结构与算法——队列
<script> //创建一个队列 function Queue(){ let items = []; //向队尾添加一个新的项 this.enqueue = function(eleme ...
关于在eclipse中添加windowbuilder插件的问题
最近在学习GUI,发现我的Eclipse中没有windowbuilder插件,之后按照百度搜索,按照网上教程,去安装时,发现下载网页已经更新,造成了很多问题, 不过问题不大,我已经找到了解决方法: 安 ...
UGUI中粒子特效与UI的遮挡问题
问题背景: 在做主线任务时发现完成任务后的特效显示穿透上面的UI层,不美观,策划不乐意了,抓紧解决下解决思路: 首先讲下影响渲染顺序的因素: 能够影响渲染顺序的因素有:1.Camera Depth ...
Django组件-cookie与session
一.会话跟踪技术 1.什么是会话跟踪技术我们需要先了解一下什么是会话!可以把会话理解为客户端与服务器之间的一次会晤,在一次会晤中可能会包含多次请求和响应.例如你给10086打个电话,你就是客户端,而 ...
Android进阶：七、Retrofit2.0原理解析之最简流程【上】
retrofit 已经流行很久了,它是Square开源的一款优秀的网络框架,这个框架对okhttp进行了封装,让我们使用okhttp做网路请求更加简单.但是光学会使用只是让我们多了一个技能,学习其源码 ...
安装docker以及问题解决办法
1.使用官方推荐的方式安装 yum-utilsyum install -y yum-utils2.使用如下的命令设置稳定版的 repositoryyum-config-manager \ --a ...
linux centos 用户权限相关总结
linux上用户管理以及相应权限查看增加删除用户修改密码用户用户组用户默认目录用户shell路径等用户管理相关文件 1. 查看系统有哪些用户 cat /etc/passwd ...

PuppeteerSharp+AngleSharp的爬虫实战之汽车之家数据抓取