正经Python汤不热爬虫

转自：https://github.com/facert/tumblr_spider

install

pip install -r requirements.txt

run

python tumblr.py username (usename 为任意一个热门博主的 usename)

snapshoot

爬取结果

user.txt 是爬取的博主用户名结果， source.txt 是视频地址集

原理

根据一个热门博主的 usename, 脚本自动会获取博主转过文章的其他博主的 username，并放入爬取队列中，递归爬取。

申明

这是一个正经的爬虫（严肃脸），爬取的资源跟你第一个填入的 username 有很大关系，另外由于某些原因，导致 tumblr 被墙，所以最简单的方式就是用国外 vps 去跑。

# -*- coding:utf-8 -*-

import signal

import sys

import requests

import threading

import queue

import time

from bs4 import BeautifulSoup

mutex = threading.Lock()

is_exit = False

class Tumblr(threading.Thread):

    def __init__(self, queue):

        self.user_queue = queue

        self.total_user = []

        self.total_url = []

        self.f_user = open('user.txt', 'a+')

        self.f_source = open('source.txt', 'a+')

        threading.Thread.__init__(self)

    def download(self, url):

        res = requests.get(url)

        source_list = []

        soup = BeautifulSoup(res.text)

        iframes = soup.find_all('iframe')

        tmp_source = []

        for i in iframes:

            source = i.get('src', '').strip()

            if source and source.find('https://www.tumblr.com/video') != -1 and source not in self.total_url:

                source_list.append(source)

                tmp_source.append(source)

                print (u'新增链接:' + source)

        tmp_user = []

        new_users = soup.find_all(class_='reblog-link')

        for user in new_users:

            username = user.text.strip()

            if username and username not in self.total_user:

                self.user_queue.put(username)

                self.total_user.append(username)

                tmp_user.append(username)

                print (u'新增用户:' + username)

        mutex.acquire()

        if tmp_user:

            self.f_user.write('\n'.join(tmp_user)+'\n')

        if tmp_source:

            self.f_source.write('\n'.join(tmp_source)+'\n')

        mutex.release()

    def run(self):

        global is_exit

        while not is_exit:

            user = self.user_queue.get()

            url = 'http://%s.tumblr.com/' % user

            self.download(url)

            time.sleep(2)

        self.f_user.close()

        self.f_source.close()

def handler(signum, frame):

    global is_exit

    is_exit = True

    print ("receive a signal %d, is_exit = %d" % (signum, is_exit))

    sys.exit(0)

def main():

    if len(sys.argv) < 2:

        print ('usage: python tumblr.py username')

        sys.exit()

    username = sys.argv[1]

    NUM_WORKERS = 10

    q = queue.Queue()

    # 修改这里的 username

    q.put(username)

    signal.signal(signal.SIGINT, handler)

    signal.signal(signal.SIGTERM, handler)

    threads = []

    for i in range(NUM_WORKERS):

        tumblr = Tumblr(q)

        tumblr.setDaemon(True)

        tumblr.start()

        threads.append(tumblr)

    while True:

        for i in threads:

            if not i.isAlive():

                break

        time.sleep(1)

if __name__ == '__main__':

    main()

正经Python汤不热爬虫的更多相关文章

Python初学者之网络爬虫(二)
声明:本文内容和涉及到的代码仅限于个人学习,任何人不得作为商业用途.转载请附上此文章地址本篇文章Python初学者之网络爬虫的继续,最新代码已提交到https://github.com/octans ...
零基础入门Python实战:四周实现爬虫网站 Django项目视频教程
点击了解更多Python课程>>> 零基础入门Python实战:四周实现爬虫网站 Django项目视频教程适用人群: 即将毕业的大学生,工资低工作重的白领,渴望崭露头角的职场新人, ...
【Python】：简单爬虫作业
使用Python编写的图片爬虫作业: #coding=utf-8 import urllib import re def getPage(url): #urllib.urlopen(url[, dat ...
使用python/casperjs编写终极爬虫-客户端App的抓取-ZOL技术频道
使用python/casperjs编写终极爬虫-客户端App的抓取-ZOL技术频道使用python/casperjs编写终极爬虫-客户端App的抓取
[Python学习] 简单网络爬虫抓取博客文章及思想介绍
前面一直强调Python运用到网络爬虫方面很有效,这篇文章也是结合学习的Python视频知识及我研究生数据挖掘方向的知识.从而简介下Python是怎样爬去网络数据的,文章知识很easy ...
洗礼灵魂，修炼python（69）--爬虫篇—番外篇之feedparser模块
feedparser模块 1.简介 feedparser是一个Python的Feed解析库,可以处理RSS ,CDF,Atom .使用它我们可从任何 RSS 或 Atom 订阅源得到标题.链接和文章的 ...
洗礼灵魂，修炼python（50）--爬虫篇—基础认识
爬虫 1.什么是爬虫爬虫就是昆虫一类的其中一个爬行物种,擅长爬行. 哈哈,开玩笑,在编程里,爬虫其实全名叫网络爬虫,网络爬虫,又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者 ...
使用Python + Selenium打造浏览器爬虫
Selenium 是一款强大的基于浏览器的开源自动化测试工具,最初由 Jason Huggins 于 2004 年在 ThoughtWorks 发起,它提供了一套简单易用的 API,模拟浏览器的各种操 ...
Python 利用Python编写简单网络爬虫实例3
利用Python编写简单网络爬虫实例3 by:授客 QQ:1033553122 实验环境 python版本:3.3.5(2.7下报错实验目的获取目标网站“http://bbs.51testing. ...

随机推荐

【转】rsa公钥和私钥的生成
转:https://www.cnblogs.com/zengsf/p/10136886.html 在liunx环境中 openssl 然后生成私钥: genrsa -out app_private_k ...
Gym-100814K 数位DP 模拟除法
Johnny is a brilliant mathematics student. He loves mathematics since he was a child, now he is work ...
VUE+DJANGO
1.router类型不能设置为history const router = new Router({ mode: '', routes, }); //避免打包到django后刷新报错 2.样式放sta ...
eclipse修改代码后都需要clean的解决办法
问题描述: 用STS(类似于Eclipse)正在开发一个JavaWeb项目,但不知怎么的有一天,修改完Java代码,点击运行Tomcat,发现根本没有修改.刚刚开始的时候,因为一开始没找到原因而且工期 ...
HDU-6668-Polynomial(数学)
链接: https://vjudge.net/problem/HDU-6668 题意: 度度熊最近学习了多项式和极限的概念. 现在他有两个多项式 f(x) 和 g(x),他想知道当 x 趋近无限大的时 ...
shiro框架学习-3- Shiro内置realm
1. shiro默认自带的realm和常见使用方法 realm作用:Shiro 从 Realm 获取安全数据默认自带的realm:idae查看realm继承关系,有默认实现和自定义继承的realm ...
mvn 本地jar包加入自己的maven仓库
-Dfile :你的jar的名称 -DgroupId :在pom中的groupId -DartifactId :在pom中的artifactId -Dversion :在pom中的version 在j ...
洛谷 P1505 BZOJ 2157 [国家集训队]旅游
bzoj题面 Time limit 10000 ms Memory limit 265216 kB OS Linux 吐槽又浪费一个下午--区间乘-1之后,最大值和最小值更新有坑.新的最大值是原来最 ...
Codeforces Round #403---C题（DFS，树）
C. Andryusha and Colored Balloons time limit per test 2 seconds memory limit per test 256 megabytes ...
[CSP-S模拟测试]:电压机制（图论+树上差分）
题目描述科学家在“无限神机”($Infinity\ Machine$)找到一个奇怪的机制,这个机制有$N$个元件,有$M$条电线连接这些元件,所有元件都是连通的.两个元件之间可能有多条电线连接.科学 ...

正经Python汤不热爬虫

install

run

snapshoot

爬取结果

原理

申明

正经Python汤不热爬虫的更多相关文章

随机推荐

热门专题