Python tldextract模块准确获取域名和后缀

Python tldextract 模块 - 功能说明

tldextract准确地从URL的域名和子域名分离通用顶级域名或国家顶级域名。例如，http://www.google.com，你只想取出连接的 'google' 部分。每个人都会想到用 ‘.’ 拆分，来获取域名和后缀，但这是不准确的。并且只有当你想到简单的，例如.com域名，以 ‘.’ 截取最后2个元素得到结果。想想如果解析，例如：http://forums.bbc.co.uk，上面天真的分裂方法是有问题的，你会得到 'co' 作为域名和“uk”为顶级域名，而不是“bbc”和“co.uk” 。tldextract有一个公共后缀列表，它可以匹配所有域名。因此，给定一个URL，它从其域中知道其子域名，并且从其国家中知道其域名。

>>> import tldextract

>>> tldextract.extract('http://forums.news.cnn.com/')

ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom

ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk')

>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan

ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')

ExtractResult是namedtuple，所以它以简单方法得到你想要的部分。

>>> ext = tldextract.extract('http://forums.bbc.co.uk')

>>> (ext.subdomain, ext.domain, ext.suffix)

('forums', 'bbc', 'co.uk')

>>> # rejoin subdomain and domain

>>> '.'.join(ext[:2])

'forums.bbc'

>>> # a common alias

>>> ext.registered_domain

'bbc.co.uk'

子域和后缀是可选的。不是所有类似URL的输入都有一个子域或有效的后缀。

>>> tldextract.extract('google.com')

ExtractResult(subdomain='', domain='google', suffix='com')

>>> tldextract.extract('google.notavalidsuffix')

ExtractResult(subdomain='google', domain='notavalidsuffix', suffix='')

>>> tldextract.extract('http://127.0.0.1:8080/deployed/')

ExtractResult(subdomain='', domain='127.0.0.1', suffix='')

如果要重新加入整个命名的元组，无论是否找到子域或后缀：

>>> ext = tldextract.extract('http://127.0.0.1:8080/deployed/')

>>> # this has unwanted dots

>>> '.'.join(ext)

'.127.0.0.1.'

>>> # join each part only if it's truthy

>>> '.'.join(part for part in ext if part)

'127.0.0.1'

该模块通过实现从选择stackoverflow答案开始，从一个URL获取“域名”这个计算问题。然而，建议的正则表达式解决方案不涉及其它许多国家，如 com.au，如注册域parliament.uk。公共后缀列表，这个模块也是如此。

安装 tldextract

最新发布的 PyPI：

pip install tldextract

或者最新的开发版本：

pip install -e 'git://github.com/john-kurkowski/tldextract.git#egg=tldextract'

命令行用法，按空格分开网址：

tldextract http://forums.bbc.co.uk

# forums bbc co.uk

注意缓存更新

当第一次运行该模块时，它会用实时HTTP请求更新其后缀列表。这个更新的后缀集在无限期缓存/path/to/tldextract/.tld_set 。（可以说运行时引导类似这样不应该是默认行为，就像生产系统，但我想要你有最新的后缀，特别是当我没有保持这个代码的最新）。要避免此提取或控制缓存的位置，请通过设置后缀EXTRACT_CACHE环境变量或通过在后缀Extract初始化中设置cache_file路径来使用您自己的提取调用。

# extract callable that falls back to the included TLD snapshot, no live HTTP fetching

no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=None)

no_fetch_extract('http://www.google.com')

# extract callable that reads/writes the updated TLD set to a different path

custom_cache_extract = tldextract.TLDExtract(cache_file='/path/to/your/cache/file')

custom_cache_extract('http://www.google.com')

# extract callable that doesn't use caching

no_cache_extract = tldextract.TLDExtract(cache_file=False)

no_cache_extract('http://www.google.com')

如果你想保持最新后缀定义 - 虽然他们不经常更改 - 偶尔删除缓存文件，运行更新命令

tldextract --update

或：

env TLDEXTRACT_CACHE="~/tldextract.cache" tldextract --update

也建议在升级此lib之后删除文件。

高级用法

为后缀列表数据指定自己的URL或文件

您可以指定自己的输入数据代替默认的Mozilla公共后缀列表：

extract = tldextract.TLDExtract(

 suffix_list_urls=["http://foo.bar.baz"],

 # Recommended: Specify your own cache file, to minimize ambiguities about where

 # tldextract is getting its data, or cached data, from.

 cache_file='/path/to/your/cache/file')

以上片段将与您指定的网址提取，在首先需要下载后缀列表（即如果cache_file不存在）。如果你想从你的本地文件系统使用的输入数据，只需要使用file://协议：

extract = tldextract.TLDExtract(

 suffix_list_urls=["file://absolute/path/to/your/local/suffix/list/file"],

 cache_file='/path/to/your/cache/file')

请使用绝对路径suffix_list_urls关键字参数。 os.path是友好路径。

如果我传递一个无效的URL，我仍然得到一个结果，没有错误。为什么会得到？

为了保持tldextract光控制线和开销，因为有大量的URL验证器在那里，这个库是非常宽松的输入。如果有效的URL是对你很重要，调用之前先验证这些tldextract 。这种宽松的态度降低了使用库的学习曲线，代价是使用户对URL的细微差别。谁知道多少。但在将来，我会考虑一次大修。例如，用户可以选择验证，接收结果中的异常或错误元数据。 tldextract GitHub 地址：https://github.com/john-kurkowski/tldextract