youtube-dl 源码看看，例子是下载网页

1, 跑起来

下载 youtube-dl，配合 launch.json,

# 本文中 himala 是代指，具体见文末的 github repo

"configurations": [

        {

            "name": "audio",

            "type": "python",

            "request": "launch",

            "program": "${workspaceFolder}/youtube_dl",

            "console": "integratedTerminal",

            "args": ["-F",  "http://www.himala.com/61425525/sound/47740352/"]

        }

    ]

复制代码

然后，直接调用 main.py

通过 main.py 文件中，将

if __package__ is None and not hasattr(sys, 'frozen'):

   import os.path

   path = os.path.realpath(os.path.abspath(__file__))

   sys.path.insert(0, os.path.dirname(os.path.dirname(path)))

复制代码

替换为

import os.path

path = os.path.realpath(os.path.abspath(__file__))

sys.path.insert(0, os.path.dirname(os.path.dirname(path)))

复制代码

2, 操作纲要

2.1 ，程序入口, 拿命令行的参数，做事情

__main__.py 文件中，

if __name__ == '__main__':

   youtube_dl.main()

复制代码

__init__.py 文件中，走

def main(argv=None):

    try:

        _real_main(argv)

    except DownloadError:

        sys.exit(1)

    # ...

复制代码

__init__.py 文件中，接着走

def _real_main(argv=None):

    # 里面做了一个参数配置

    #...

          try:

            if opts.load_info_filename is not None:

                retcode = ydl.download_with_info_file(expand_path(opts.load_info_filename))

            else:

                retcode = ydl.download(all_urls)

        except MaxDownloadsReached:

   #...

复制代码

2.2 , YoutubeDL.py 文件中，拿 url 去下载音视频

class YoutubeDL(object):

   def download(self, url_list):

        # ...

        for url in url_list:

             try:

                # It also downloads the videos

                res = self.extract_info(

                    url, force_generic_extractor=self.params.get('force_generic_extractor', False))

              except UnavailableVideoError:

       # ...

复制代码

此函数中，不但抽出相关信息，还下载网页、音视频

所有的事情，都得到了解决

def extract_info(self, url, download=True, ie_key=None, extra_info={},

                     process=True, force_generic_extractor=False):

        if not ie_key and force_generic_extractor:

            ie_key = 'Generic'

        if ie_key:

            ies = [self.get_info_extractor(ie_key)]

        else:

            ies = self._ies

        for ie in ies:

            if not ie.suitable(url):

                continue

            ie = self.get_info_extractor(ie.ie_key())

        # ...

        try:

            ie_result = ie.extract(url)

        # ...

复制代码

上面代码中的 ie, Info Extract

youtube-dl 可以处理很多网站的信息，每一个网站都有对应的 Info Extract 文件

youtube-dl，音视频，都用 video 代指

3，找出 ie

youtube-dl 是怎样，给定一个 url，找出对应的 IE 的

通过正则，做匹配

youtube-dl 通过正则，实现站点支持的扩展性

3.1 , 上文代码中 `self._ies`，的初始化

3.1.1 `self._ies` 添加

YoutubeDL.py 文件中，

self._ies，初始化的入口

class YoutubeDL(object):

	def __init__(self, params=None, auto_init=True):

    	# ...

    	if auto_init:

            self.print_debug_header()

            self.add_default_info_extractors()

		# ...

复制代码

把 gen_extractor_classes 里面的信息，

添加给 self._ies

    def add_default_info_extractors(self):

        """

        Add the InfoExtractors returned by gen_extractors to the end of the list

        """

        for ie in gen_extractor_classes():

            self.add_info_extractor(ie)

    def add_info_extractor(self, ie):

        """Add an InfoExtractor object to the end of the list."""

        self._ies.append(ie)

        # ...

复制代码

3.1.2 `self._ies` 添加的内容

__init__.py 文件中，

_ALL_CLASSES 添加了 extractor 文件夹下 extractors.py 文件中，引用到的，所有以 IE 结尾的类



#...

except ImportError:

    _LAZY_LOADER = False

    from .extractors import *

    _ALL_CLASSES = [

        klass

        for name, klass in globals().items()

        if name.endswith('IE') and name != 'GenericIE'

    ]

    _ALL_CLASSES.append(GenericIE)

def gen_extractor_classes():

    return _ALL_CLASSES

复制代码

_ALL_CLASSES, 这个列表的顺序挺重要的，先通过正则匹配到的，

是用到的 IE

3.1.3 , 添加网站

新支持一个网站，建立对应的 IE 文件

extractors.py 文件中，如下添加引用

from .youtube import (

    YoutubeIE,

    YoutubeChannelIE,

    # ...

}

复制代码

3.2，找出对应的 IE

上文提高的， YoutubeDL.py 文件中，

def extract_info(self, url, download=True, ie_key=None, extra_info={},

                     process=True, force_generic_extractor=False):

        # ...

        for ie in ies:

            if not ie.suitable(url):

                continue

            ie = self.get_info_extractor(ie.ie_key())

        # ...

复制代码

每个 ie 都有一个类方法 def suitable(cls, url):

每个网站的 ie 继承自

common.py 文件中的 class InfoExtractor(object)

class InfoExtractor(object):

    @classmethod

    def suitable(cls, url):

        if '_VALID_URL_RE' not in cls.__dict__:

            cls._VALID_URL_RE = re.compile(cls._VALID_URL)

        return cls._VALID_URL_RE.match(url) is not None

复制代码

如果该网站的 ie 没实现自己的 suitable,

就用 InfoExtractor 类的 suitable

每个网站的 IE

class XimalayaIE(XimalayaBaseIE):

    # 本文中 himala 是代指，具体见文末的 github repo

    IE_NAME = 'himala'

    IE_DESC = 'himala 网站'

    _VALID_URL = r'https?://(?:www\.|m\.)?himala\.com/(?P<uid>[0-9]+)/sound/(?P<id>[0-9]+)'

复制代码

InfoExtractor 类，通过 __dict__ 获取，我们配置的 _VALID_URL 属性，

正则一下，识别出来

4，找出网页信息

上文代码， YoutubeDL.py 文件中，进入 ie , 做事情

def extract_info(self, url, download=True, ie_key=None, extra_info={},

                     process=True, force_generic_extractor=False):

        # ...

        try:

            ie_result = ie.extract(url)

        # ...

复制代码

先进入 common.py 文件中，InfoExtractor 类

    def extract(self, url):

        """Extracts URL information and returns it in list of dicts."""

        try:

            for _ in range(2):

                try:

                    self.initialize()

                    ie_result = self._real_extract(url)

        # ...

复制代码

再进入实际做事情的类 , himala.py 文件中

下载网页，正则抽取信息

# 本文中 himala 是代指，具体见文末的 github repo

class HimalaIE(InfoExtractor):

    def _real_extract(self, url):

		#...

        webpage = self._download_webpage(url, audio_id,

                                         note='Download sound page for %s' % audio_id,

                                         errnote='Unable to get sound page')

       #  ...

       if is_m:

            audio_description = self._html_search_regex(r'(?s)<section\s+class=["\']content[^>]+>(.+?)</section>',

                                                        webpage, 'audio_description', fatal=False)

        else:

            audio_description = self._html_search_regex(r'(?s)<div\s+class=["\']rich_intro[^>]*>(.+?</article>)',

                                                        webpage, 'audio_description', fatal=False)

       #  ...

复制代码

5，应用

Himala 网站中，一个主播的作品很多，

没有该主播内部的作品搜索功能

通过简单扩展 youtube-dl ，可以实现找出

该主播有多少期 fallout 节目，及对应在哪一页

代码很简单，见

github repo

作者：邓轻舟
链接：https://juejin.im/post/6889989103958360077
来源：掘金
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。