正经Python汤不热爬虫

转自：https://github.com/facert/tumblr_spider

install

pip install -r requirements.txt

run

python tumblr.py username (usename 为任意一个热门博主的 usename)

snapshoot

爬取结果

user.txt 是爬取的博主用户名结果， source.txt 是视频地址集

原理

根据一个热门博主的 usename, 脚本自动会获取博主转过文章的其他博主的 username，并放入爬取队列中，递归爬取。

申明

这是一个正经的爬虫（严肃脸），爬取的资源跟你第一个填入的 username 有很大关系，另外由于某些原因，导致 tumblr 被墙，所以最简单的方式就是用国外 vps 去跑。

# -*- coding:utf-8 -*-

import signal

import sys

import requests

import threading

import queue

import time

from bs4 import BeautifulSoup

mutex = threading.Lock()

is_exit = False

class Tumblr(threading.Thread):

    def __init__(self, queue):

        self.user_queue = queue

        self.total_user = []

        self.total_url = []

        self.f_user = open('user.txt', 'a+')

        self.f_source = open('source.txt', 'a+')

        threading.Thread.__init__(self)

    def download(self, url):

        res = requests.get(url)

        source_list = []

        soup = BeautifulSoup(res.text)

        iframes = soup.find_all('iframe')

        tmp_source = []

        for i in iframes:

            source = i.get('src', '').strip()

            if source and source.find('https://www.tumblr.com/video') != -1 and source not in self.total_url:

                source_list.append(source)

                tmp_source.append(source)

                print (u'新增链接:' + source)

        tmp_user = []

        new_users = soup.find_all(class_='reblog-link')

        for user in new_users:

            username = user.text.strip()

            if username and username not in self.total_user:

                self.user_queue.put(username)

                self.total_user.append(username)

                tmp_user.append(username)

                print (u'新增用户:' + username)

        mutex.acquire()

        if tmp_user:

            self.f_user.write('\n'.join(tmp_user)+'\n')

        if tmp_source:

            self.f_source.write('\n'.join(tmp_source)+'\n')

        mutex.release()

    def run(self):

        global is_exit

        while not is_exit:

            user = self.user_queue.get()

            url = 'http://%s.tumblr.com/' % user

            self.download(url)

            time.sleep(2)

        self.f_user.close()

        self.f_source.close()

def handler(signum, frame):

    global is_exit

    is_exit = True

    print ("receive a signal %d, is_exit = %d" % (signum, is_exit))

    sys.exit(0)

def main():

    if len(sys.argv) < 2:

        print ('usage: python tumblr.py username')

        sys.exit()

    username = sys.argv[1]

    NUM_WORKERS = 10

    q = queue.Queue()

    # 修改这里的 username

    q.put(username)

    signal.signal(signal.SIGINT, handler)

    signal.signal(signal.SIGTERM, handler)

    threads = []

    for i in range(NUM_WORKERS):

        tumblr = Tumblr(q)

        tumblr.setDaemon(True)

        tumblr.start()

        threads.append(tumblr)

    while True:

        for i in threads:

            if not i.isAlive():

                break

        time.sleep(1)

if __name__ == '__main__':

    main()

正经Python汤不热爬虫的更多相关文章

Python初学者之网络爬虫(二)
声明:本文内容和涉及到的代码仅限于个人学习,任何人不得作为商业用途.转载请附上此文章地址本篇文章Python初学者之网络爬虫的继续,最新代码已提交到https://github.com/octans ...
零基础入门Python实战:四周实现爬虫网站 Django项目视频教程
点击了解更多Python课程>>> 零基础入门Python实战:四周实现爬虫网站 Django项目视频教程适用人群: 即将毕业的大学生,工资低工作重的白领,渴望崭露头角的职场新人, ...
【Python】：简单爬虫作业
使用Python编写的图片爬虫作业: #coding=utf-8 import urllib import re def getPage(url): #urllib.urlopen(url[, dat ...
使用python/casperjs编写终极爬虫-客户端App的抓取-ZOL技术频道
使用python/casperjs编写终极爬虫-客户端App的抓取-ZOL技术频道使用python/casperjs编写终极爬虫-客户端App的抓取
[Python学习] 简单网络爬虫抓取博客文章及思想介绍
前面一直强调Python运用到网络爬虫方面很有效,这篇文章也是结合学习的Python视频知识及我研究生数据挖掘方向的知识.从而简介下Python是怎样爬去网络数据的,文章知识很easy ...
洗礼灵魂，修炼python（69）--爬虫篇—番外篇之feedparser模块
feedparser模块 1.简介 feedparser是一个Python的Feed解析库,可以处理RSS ,CDF,Atom .使用它我们可从任何 RSS 或 Atom 订阅源得到标题.链接和文章的 ...
洗礼灵魂，修炼python（50）--爬虫篇—基础认识
爬虫 1.什么是爬虫爬虫就是昆虫一类的其中一个爬行物种,擅长爬行. 哈哈,开玩笑,在编程里,爬虫其实全名叫网络爬虫,网络爬虫,又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者 ...
使用Python + Selenium打造浏览器爬虫
Selenium 是一款强大的基于浏览器的开源自动化测试工具,最初由 Jason Huggins 于 2004 年在 ThoughtWorks 发起,它提供了一套简单易用的 API,模拟浏览器的各种操 ...
Python 利用Python编写简单网络爬虫实例3
利用Python编写简单网络爬虫实例3 by:授客 QQ:1033553122 实验环境 python版本:3.3.5(2.7下报错实验目的获取目标网站“http://bbs.51testing. ...

随机推荐

002-Saltstack自动化操作记录（2）-配置使用
之前梳理了就是第一篇001,下面说说saltstack配置及模块使用: 为了试验效果,再追加一台被控制端minion机器192.168.1.118需要在master控制端机器上做好主机名映射关系 1 ...
调试dcc 试图将u-boot放入ocm运行碰到的问题
1. 起因: gd->mon_len = (ulong)&__bss_end - (ulong)_start; 在u-boot.map中查找,发现__bss_end并不是u-boot.b ...
【vue-router的基础】history了解一下
概述 window.onpopstate是popstate事件在window对象上的事件处理程序. 每当处于激活状态的历史记录条目发生变化时,popstate事件就会在对应window对象上触发. 如 ...
[洛谷P3322] SDOI2015 排序
问题描述小A有一个1-2^N的排列A[1..2^N],他希望将A数组从小到大排序,小A可以执行的操作有N种,每种操作最多可以执行一次,对于所有的 i(1<=i<=N),第i中操作为将序列 ...
查看有没有绑这个host
1.查看有没有绑这个host ping broker.vs.amap.com
html body标签语法
html body标签语法标签body是什么意思? 标签body是一个网页的身体部分,也就是用于定义网页的主体内容,也是一个HTML文档中必须的部分. 作用:定义文档的主体. 广州大理石机械构件 ...
VSCode支持jsx自动补全
点击settings.json中编辑, 把这段话加上去就可以了 "emmet.includeLanguages": { "javascript": " ...
洛谷 P1140 相似基因 ( 线性DP || 类LCS )
题意 : 题目链接分析 : 可以观察到给出的配对代价表中对角线部分是正数其余的都是负数,也就是说让相同字母的匹配的越多越好即找出 LCS 但是这里 DP 的过程需要记录一下代价有关 LCS ...
Debian Buster升级后找不到声卡
昨天将Debian从Stretch升级到了新版巴斯光年(Buster).仍旧是先将source.list中的stretch替换为buster,再执行apt-get的update.upgrade.dis ...
select服务器端模型封装——回调方式快速建立服务端
#pragma once #ifndef WINSOCK2_H #define _WINSOCK_DEPRECATED_NO_WARNINGS #include<WinSock2.h> # ...

正经Python汤不热爬虫

install

run

snapshoot

爬取结果

原理

申明

正经Python汤不热爬虫的更多相关文章

随机推荐

热门专题