我拿这个站点作为案例:https://91mjw.com/  其他站点方法都是差不多的。

第一步:获得整站所有的视频连接

html = requests.get("https://91mjw.com",headers=gHeads).text
xmlcontent = etree.HTML(html)
UrlList = xmlcontent.xpath("//div[@class='m-movies clearfix']/article/a/@href")
NameList = xmlcontent.xpath("//div[@class='m-movies clearfix']/article/h2/a/text()")

  

第二步 :是进入选择的电影的页面 去获得视频的链接

UrlList = xmlContent.xpath("//div[@id='video_list_li']/a/@href")

第三步 构造下载视频用到的参数

第四步 下载视频 保存到本地

直接上实现代码  
   使用的多线程 加信号量实现  默认开启5条线程开始操作 每条线程去下载一套视频  是一套 一套 一套     
    也可以自己去修改同时开启几条线程
    实现代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import re
import requests
from threading import *
from bs4 import BeautifulSoup
from lxml import etree
from contextlib import closing
 
nMaxThread = 5
connectlock = BoundedSemaphore(nMaxThread)
gHeads = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"}
 
class MovieThread(Thread):
    def __init__(self,url,movieName):
        Thread.__init__(self)
        self.url = url
        self.movieName = movieName
 
    def run(self):
        try:
            urlList = self.GetMovieUrl(self.url)
            for i in range(len(urlList)):
                type,vkey = self.GetVkeyParam(self.url,urlList[i])
                if type != None and vkey !=None:
                    payload,DownloadUrl = self.GetOtherParam(self.url,urlList[i],type,vkey)
                    if DownloadUrl :
                        videoUrl = self.GetDownloadUrl(payload,DownloadUrl)
                        if videoUrl :
                            self.DownloadVideo(videoUrl,self.movieName,i+1)
        finally:
            connectlock.release()
 
    def GetMovieUrl(self,url):
        heads = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host":"91mjw.com",
            "Referer":"https://91mjw.com/"
        }
        html = requests.get(url,headers=heads).text
        xmlContent = etree.HTML(html)
        UrlList = xmlContent.xpath("//div[@id='video_list_li']/a/@href")
        if  len(UrlList) > 0:
            return UrlList
        else:
            return None
 
    def GetVkeyParam(self,firstUrl,secUrl):
        heads = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host": "91mjw.com",
            "Referer": firstUrl
        }
        try :
            html = requests.get(firstUrl+secUrl,headers=heads).text
            bs = BeautifulSoup(html,"html.parser")
            content = bs.find("body").find("script")
            reContent = re.findall('"(.*?)"',content.text)
            return reContent[0],reContent[1]
        except:
            return None,None
 
    def GetOtherParam(self,firstUrl,SecUrl,type,vKey):
        url = "https://api.1suplayer.me/player/?userID=&type=%s&vkey=%s"%(type,vKey)
        heads = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host": "api.1suplayer.me",
            "Referer": firstUrl+SecUrl
        }
        try:
            html = requests.get(url,headers=heads).text
            bs = BeautifulSoup(html,"html.parser")
            content = bs.find("body").find("script").text
            recontent = re.findall(" = '(.+?)'",content)
            payload = {
                    "type":recontent[3],
                    "vkey":recontent[4],
                    "ckey":recontent[2],
                    "userID":"",
                    "userIP":recontent[0],
                    "refres":1,
                    "my_url":recontent[1]
                }
            return payload,url
        except:
            return None,None
 
    def GetDownloadUrl(self,payload,refereUrl):
        heads = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host": "api.1suplayer.me",
            "Referer": refereUrl,
            "Origin": "https://api.1suplayer.me",
            "X-Requested-With": "XMLHttpRequest"
        }
        while True:
            retData = requests.post("https://api.1suplayer.me/player/api.php",data=payload,headers=heads).json()
            if  retData["code"] == 200:
                return retData["url"]
            elif retData["code"] == 404:
                payload["refres"] += 1;
                continue
            else:
                return None
 
    def DownloadVideo(self,url,videoName,videoNum):
        CurrentSize = 0
        heads = {
            "chrome-proxy":"frfr",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
            "Host":"sh-yun-ftn.weiyun.com",
            "Range":"bytes=0-"
        }
        with closing(requests.get(url,headers=heads)) as response:
            retSize = int(response.headers['Content-Length'])
            chunkSize = 10240
            if response.status_code == 206:
                print '[File Size]: %0.2f MB\n' % (retSize/1024/1024)
                with open("./video/%s/%02d.mp4"%(videoName,videoNum),"wb") as f:
                    for data in response.iter_content(chunk_size=chunkSize):
                        f.write(data)
                        CurrentSize += len(data)
                        f.flush()
                        print '[Progress]: %0.2f%%' % float(CurrentSize*100/retSize) + '\r'
 
def main():
    html = requests.get("https://91mjw.com",headers=gHeads).text
    xmlcontent = etree.HTML(html)
    UrlList = xmlcontent.xpath("//div[@class='m-movies clearfix']/article/a/@href")
    NameList = xmlcontent.xpath("//div[@class='m-movies clearfix']/article/h2/a/text()")
    for i in range(len(UrlList)):
        connectlock.acquire()
        url = UrlList[i]
        name = NameList[i].encode("utf-8")
        t = MovieThread(url,name)
        t.start()
 
if __name__ == '__main__':
    main()

  

用python实现多线程爬取影视网站全部视频方法【笔记】的更多相关文章

  1. from appium import webdriver 使用python爬虫,批量爬取抖音app视频(requests+Fiddler+appium)

    使用python爬虫,批量爬取抖音app视频(requests+Fiddler+appium) - 北平吴彦祖 - 博客园 https://www.cnblogs.com/stevenshushu/p ...

  2. python之简单爬取一个网站信息

    requests库是一个简介且简单的处理HTTP请求的第三方库 get()是获取网页最常用的方式,其基本使用方式如下 使用requests库获取HTML页面并将其转换成字符串后,需要进一步解析HTML ...

  3. 初次尝试python爬虫,爬取小说网站的小说。

    本次是小阿鹏,第一次通过python爬虫去爬一个小说网站的小说. 下面直接上菜. 1.首先我需要导入相应的包,这里我采用了第三方模块的架包,requests.requests是python实现的简单易 ...

  4. 用Python爬取影视网站,直接解析播放地址。

    记录时刻! 写这个爬虫主要是想让自己的爬虫实用,把脚本放到了服务器,成为可随时调用的接口. 思路算是没思路吧!把影视名带上去请求影视网站,然后解析出我们需要的播放地址. 我也把自己的接口分享出来.接口 ...

  5. Python多线程爬取某网站表情包

    # 爬取网络图片import requestsfrom lxml import etreefrom urllib import requestfrom queue import Queue # 导入队 ...

  6. python协程爬取某网站的老赖数据

    import re import json import aiohttp import asyncio import time import pymysql from asyncio.locks im ...

  7. 使用python爬虫,批量爬取抖音app视频(requests+Fiddler+appium)

    抖音很火,楼主使用python随机爬取抖音视频,并且无水印下载,人家都说天下没有爬不到的数据,so,楼主决定试试水,纯属技术爱好,分享给大家.. 1.楼主首先使用Fiddler4来抓取手机抖音app这 ...

  8. Python爬虫一爬取B站小视频源码

    如果要爬取多页的话 在最下方循环中 填写好循环的次数就可以了 项目源码 from fake_useragent import UserAgent import requests import time ...

  9. Python多进程方式抓取基金网站内容的方法分析

    因为进程也不是越多越好,我们计划分3个进程执行.意思就是 :把总共要抓取的28页分成三部分. 怎么分呢? # 初始range r = range(1,29) # 步长 step = 10 myList ...

随机推荐

  1. 最新 学霸君java校招面经 (含整理过的面试题大全)

    从6月到10月,经过4个月努力和坚持,自己有幸拿到了网易雷火.京东.去哪儿.学霸君等10家互联网公司的校招Offer,因为某些自身原因最终选择了学霸君.6.7月主要是做系统复习.项目复盘.LeetCo ...

  2. Android Capabilities讲解

    1.Capabilities介绍 可以看下之前代码里面设置的capabilities DesiredCapabilities capabilities =newDesiredCapabilities( ...

  3. IP通信学习心得01

    一.物理拓扑 1. 1) 总线拓扑 特点:所有设备都处于同一个冲突域与广播域,共享相同的带宽 一次只能有一个设备传输,且两端要安装端接器. 传输介质:同轴电缆.(注:10Base5:容量10M 传输5 ...

  4. 第I位是0/1

    int a; scanf("%d",&a); ;i<;i++) { ;//从右往左第i位是x ,i==0,就是第一位 printf("%d ",x ...

  5. JVM之java并发 ——线程安全与锁优化

    概述 人们很难想象现实中的对象在一项工作进行期间,会被不停地中断和切换,对象的属性(数据)可能会在中断期间被修改和变“脏”,而这些事情在计算机世界中则是很正常的事情.有时候,良好的设计原则不得不向现实 ...

  6. 嵌入式02 STM32 实验02 端口输入输出各4种模式

    GPIO(General-purpose input/output 通用目的输入/输出端口) 电压(A模拟量)与电平(D数字量) GPIO 8种工作模式(输入四种.输出四种) 1.GPIO_Mode_ ...

  7. laravel使用辅助函数url()引入js和css静态文件

    使用laravel框架时可以将静态文件如,js文件,css文件,放到resources文件夹下的js下,当然也可以放到public文件夹下的js文件夹下,publi文件夹下默认情况下是没有css,js ...

  8. nginx 简单介绍

    Nginx同Apache一样都是一种WEB服务器.基于REST架构风格,以统一资源描述符(Uniform Resources Identifier)URI或者统一资源定位符(Uniform Resou ...

  9. spring源码解析前瞻

    很多人有疑问:为什么要读源码?读源码有什么用?我也一直问自己这些问题,读源码非常枯燥,工作中又用不到,慢慢的自己读源码越发现自己知识的不足,无法把知识串起来,形成知识体系.从单系统中常用的Spring ...

  10. jmeter用什么查看结果报告

    JMeter查看测试结果的方法很多,最常用的几种是:察看结果树.聚合报告.图形报表.邮件观察仪等.