AngleSharp 网络数据采集 -- 使用AngleSharp做html解析
AngleSharp
AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The included parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code and ensures compatibility with results in evergreen browsers. Also standard DOM features such as querySelector or querySelectorAll work for tree traversal.
⚡️⚡️ Migrating from AngleSharp 0.9 to AngleSharp 0.10 or later? Look at our migration documentation. ⚡️⚡️
Key Features
- Portable (using .NET Standard 1.3)
- Standards conform (works exactly as evergreen browsers)
- Great performance (outperforms similar parsers in most scenarios)
- Extensible (extend with your own services)
- Useful abstractions (type helpers, jQuery like construction)
- Fully functional DOM (all the lists, iterators, and events you know)
- Form submission (easily log in everywhere)
- Navigation (a
BrowsingContextis like a browser tab - control it from .NET!). - LINQ enhanced (use LINQ with DOM elements, naturally without wrappers)
The advantage over similar libraries like HtmlAgilityPack is that the exposed DOM is using the official W3C specified API, i.e., that even things like querySelectorAll are available in AngleSharp. Also the parser uses the HTML 5.1 specification, which defines error handling and element correction. The AngleSharp library focuses on standards compliance, interactivity, and extensibility. It is therefore giving web developers working with C# all possibilities as they know from using the DOM in any modern browser.
The performance of AngleSharp is quite close to the performance of browsers. Even very large pages can be processed within milliseconds. AngleSharp tries to minimize memory allocations and reuses elements internally to avoid unnecessary object creation.
Simple Demo
The simple example will use the website of Wikipedia for data retrieval.
var config = Configuration.Default.WithDefaultLoader();
var address = "https://en.wikipedia.org/wiki/List_of_The_Big_Bang_Theory_episodes";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var cellSelector = "tr.vevent td:nth-child(3)";
var cells = document.QuerySelectorAll(cellSelector);
var titles = cells.Select(m => m.TextContent);
In the example we see:
- How to setup the configuration for supporting document loading
- Asynchronously get the document in a new context using the configuration
- Performing a query to get all cells with the content of interest
- The whole DOM supports LINQ queries
Every collection in AngleSharp supports LINQ statements. AngleSharp also provides many useful extension methods for element collections that cannot be found in the official DOM.
Supported Platforms
AngleSharp has been created as a .NET Standard 1.3 (and 2.0) compatible library. This includes, but is not limited to:
- .NET Core (1.0 and 2.0)
- .NET Framework (4.6)
- Xamarin.Android (7.0 and 8.0)
- Xamarin.iOS (10.0 and 10.14)
- Xamarin.Mac (3.0 and 3.8)
- Mono (4.6 and 5.4)
- UWP (10.0 and 10.0.16299)
- Unity (2018.1)
Documentation
The documentation of AngleSharp is located in the docs folder. More examples, best-practices, and general information can be found there. The documentation also contains a list of frequently asked questions.
More information is also available by following some of the hyper references mentioned in the Wiki. In-depth articles will be published on the CodeProject, with links being placed in the Wiki at GitHub.
Use-Cases
- Parsing HTML (incl. fragments)
- Parsing CSS (incl. selectors, declarations, ...)
- Constructing HTML (e.g., view-engine)
- Minifying CSS, HTML, ...
- Querying document elements
- Crawling information
- Gathering statistics
- Web automation
- Tools with HTML / CSS / ... support
- Connection to page analytics
- HTML / DOM unit tests
- Automated JavaScript interaction
- Testing other concepts, e.g., script engines
- ...
Vision
The project aims to bring a solid implementation of the W3C DOM for HTML, SVG, MathML, and CSS to the CLR - all written in C#. The idea is that you can basically do everything with the DOM in C# that you can do in JavaScript (plus, of course, more).
Most parts of the DOM are included, even though some may still miss their (fully specified / correct) implementation. The goal for v1.0 is to have all practically relevant parts implemented according to the official W3C specification (with useful extensions by the WHATWG).
The API is close to the DOM4 specification, however, the naming has been adjusted to apply with .NET conventions. Nevertheless, to make AngleSharp really useful for, e.g., a JavaScript engine, attributes have been placed on the corresponding interfaces (and methods, properties, ...) to indicate the status of the field in the official specification. This allows automatic generation of DOM objects with the official API.
This is a long-term project which will eventually result in a state of the art parser for the most important angle bracket based hyper-texts.
Our hope is to build a community around web parsing and libraries from this project. So far we had great contributions, but that goal was not fully achieved. Want to help? Get in touch with us!
Participating in the Project
If you know some feature that AngleSharp is currently missing, and you are willing to implement the feature, then your contribution is more than welcome! Also if you have a really cool idea - do not be shy, we'd like to hear it.
If you have an idea how to improve the API (or what is missing) then posts / messages are also welcome. For instance there have been ongoing discussions about some styles that have been used by AngleSharp (e.g., HTMLDocument or HtmlDocument) in the past. In the end AngleSharp stopped using HTMLDocument (at least visible outside of the library). Now AngleSharp uses names like IDocument, IHtmlElement and so on. This change would not have been possible without such fruitful discussions.
The project is always searching for additional contributors. Even if you do not have any code to contribute, but rather an idea for improvement, a bug report or a mistake in the documentation. These are the contributions that keep this project active.
Live discussions can take place in our Gitter chat, which supports using GitHub accounts.
More information is found in the contribution guidelines. All contributors can be found in the CONTRIBUTORS file.
This project has also adopted the code of conduct defined by the Contributor Covenant to clarify expected behavior in our community.
For more information see the .NET Foundation Code of Conduct.
Funding / Support
If you use AngleSharp frequently, but you do not have the time to support the project by active participation you may still be interested to ensure that the AngleSharp projects keeps the lights on.
Therefore we created a backing model via Bountysource. Any donation is welcome and much appreciated. We will mostly spend the money on dedicated development time to improve AngleSharp where it needs to be improved, plus invest in the web utility eco-system in .NET (e.g., in JavaScript engines, other parsers, or a renderer for AngleSharp to mention some outstanding projects).
Visit Bountysource for more details.
Development
AngleSharp is written in C# 7.1 and thus requires Roslyn as a compiler. Using an IDE like Visual Studio 2017+ is recommended on Windows. Alternatively, VSCode (with OmniSharp or another suitable Language Server Protocol implementation) should be the tool of choice on other platforms.
The code tries to be as clean as possible. Notably the following rules are used:
- Use braces for any conditional / loop body
- Use the
-Asyncsuffixed methods when available - Use VIP ("Var If Possible") style (in C++ called AAA: Almost Always Auto) to place types on the right
More important, however, is the proper usage of tests. Any new feature should come with a set of tests to cover the functionality and prevent regression.
Changelog
A very detailed changelog exists. If you are just interested in major releases then have a look at our own releases document.
.NET Foundation
This project is supported by the .NET Foundation.
License
The MIT License (MIT)
Copyright (c) 2013 - 2019 AngleSharp
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
AngleSharp 网络数据采集 -- 使用AngleSharp做html解析的更多相关文章
- .NET Core 网络数据采集 -- 使用AngleSharp做html解析
有这么一本Python的书: <<Python 网络数据采集>> 我准备用.NET Core及第三方库实现里面所有的例子. 这是第一部分, 主要使用的是AngleSharp: ...
- net core体系-网络数据采集(AngleSharp)-1初探
有这么一本Python的书: <<Python 网络数据采集>> 我准备用.NET Core及第三方库实现里面所有的例子. 这是第一部分, 主要使用的是AngleSharp: ...
- .NET Core使用AngleSharp网络数据采集
环境: vs2019 .net core 3.1 angleSharp winform 安装:angleSharp 有这么一本Python的书: <<Python 网络数据采集>&g ...
- 笔记之Python网络数据采集
笔记之Python网络数据采集 非原创即采集 一念清净, 烈焰成池, 一念觉醒, 方登彼岸 网络数据采集, 无非就是写一个自动化程序向网络服务器请求数据, 再对数据进行解析, 提取需要的信息 通常, ...
- python 网络数据采集1
python3 网络数据采集1 第一部分: 一.可靠的网络连接: 使用库: python标准库: urllib python第三方库:BeautifulSoup 安装:pip3 install be ...
- (数据科学学习手札47)基于Python的网络数据采集实战(2)
一.简介 马上大四了,最近在暑期实习,在数据挖掘的主业之外,也帮助同事做了很多网络数据采集的内容,接下来的数篇文章就将一一罗列出来,来续写几个月前开的这个网络数据采集实战的坑. 二.马蜂窝评论数据采集 ...
- (数据科学学习手札33)基于Python的网络数据采集实战(1)
一.简介 前面两篇文章我们围绕利用Python进行网络数据采集铺垫了很多内容,但光说不练是不行的,于是乎,本篇就将基于笔者最近的一项数据需求进行一次网络数据采集的实战: 二.网易财经股票数据爬虫实战 ...
- (数据科学学习手札31)基于Python的网络数据采集(初级篇)
一.简介 在实际的业务中,我们手头的数据往往难以满足需求,这时我们就需要利用互联网上的资源来获取更多的补充数据,但是很多情况下,有价值的数据往往是没有提供源文件的直接下载渠道的(即所谓的API),这时 ...
- Python网络数据采集PDF高清完整版免费下载|百度云盘
百度云盘:Python网络数据采集PDF高清完整版免费下载 提取码:1vc5 内容简介 本书采用简洁强大的Python语言,介绍了网络数据采集,并为采集新式网络中的各种数据类型提供了全面的指导.第 ...
随机推荐
- JS的for循环包裹异步函数的问题
有个循环,循环一个异步回调,为啥回调引用的循环值都是最后一步循环的循环值?然后,又有些时候无论什么循环值都得不到? var arr = [1,3,5,7,9]; var arrLength = arr ...
- 企业网站的SSL签证生产测试以及https配置方法
这一次要做企业网站怎么获得安全的数字证书,没有数字证书的话,在浏览器访问网站的时候会跳出不安全界面,而且钓鱼网站也会让用户进去个假网站,一般企业可以去阿里云去买数字证书,买好之后浏览器便会加载这个数字 ...
- 使用量产工具合并U盘空间一例
1.问题提出: 朋友拿到一只别人赠送的广告U盘,上面印刷有产品广告.插入电脑后,在系统的磁盘管理中,显示为两块“硬盘”,其中一块“硬盘”中有广告视频.产品介绍等,占用大概6GB,这块“硬盘”中的这 ...
- 35C3 CTF - Web - php
参考链接 https://ctftime.org/task/7404 题目描述 PHP's unserialization mechanism can be exceptional. 解题流程 程序代 ...
- 移动端适配 rem
前置知识: 物理像素(physical pixel,device pixel) 物理像素(设备像素),显示设备中一个最微小的物理部件.每个像素可以根据操作系统设置自己的颜色和亮度. 设备独立像素(de ...
- 深入理解MySql事务
事务是MySQL等关系型数据库区别于NoSQL的重要方面,是保证数据一致性的重要手段.本文将首先介绍MySQL事务相关的基础概念,然后介绍事务的ACID特性,并分析其实现原理. MySQL博大精深,文 ...
- 快速幂(Fast Pow)
定义 快速求a^b%c的算法 原理 指数可以被二进制分解 那么a^b可以分解为a^2^k1*a^2^k2*…… 又显然a^2^(k+1)=a^(2^k*2)=(a^2^k)^2 所以可以将指数在二进制 ...
- 设计模式Design Pattern(3) -- 责任链模式
什么是责任链模式? 责任链模式(Chain of Responsibility Pattern):请求知道公开接口,但不知道那个具体类处理,这些具体处理类对象连接成一条链.请求沿着这条链传递,直到有对 ...
- CSS——相对定位、绝对定位、固定定位
相对定位: position:relative 当元素被设置相对定位或是绝对定位后,将自动产生层叠,他们的层叠级别自然的高于文本流,除非设置其z-index值为负值. 并且我们发现当相对定位元素进行位 ...
- 对includes的研究
1.includes() 方法用来判断一个数组是否包含一个指定的值,如果是返回 true,否则false. 2.let site = ['runoob', 'google', 'taobao']; s ...