python crawler

crawl blog website: www.apress.com

# -*- coding: utf-8 -*-

"""

Created on Wed May 10 18:01:41 2017

@author: Raghav Bali

"""

"""

This script crawls apress.com's blog page to:

    + extract list of recent blog post titles and their URLS

    + extract content related to each blog post in plain text

using requests and BeautifulSoup packages

``Execute``

        $ python crawl_bs.py

"""

import requests

from time import sleep

from bs4 import BeautifulSoup

def get_post_mapping(content):

    """This function extracts blog post title and url from response object

    Args:

        content (request.content): String content returned from requests.get

    Returns:

        list: a list of dictionaries with keys title and url

    """

    post_detail_list = []

    post_soup = BeautifulSoup(content,"lxml")

    h3_content = post_soup.find_all("h3")

    for h3 in h3_content:

        post_detail_list.append(

            {'title':h3.a.get_text(),'url':h3.a.attrs.get('href')}

            )

    return post_detail_list

def get_post_content(content):

    """This function extracts blog post content from response object

    Args:

        content (request.content): String content returned from requests.get

    Returns:

        str: blog's content in plain text

    """

    plain_text = ""

    text_soup = BeautifulSoup(content,"lxml")

    para_list = text_soup.find_all("div",

                                   {'class':'cms-richtext'})

    for p in para_list[0]:

        plain_text += p.getText()

    return plain_text

if __name__ =='__main__':

    crawl_url = "http://www.apress.com/in/blog/all-blog-posts"

    post_url_prefix = "http://www.apress.com"

    print("Crawling Apress.com for recent blog posts...\n\n")    

    response = requests.get(crawl_url)

    if response.status_code == 200:

        blog_post_details = get_post_mapping(response.content)

    if blog_post_details:

        print("Blog posts found:{}".format(len(blog_post_details)))

        for post in blog_post_details:

            print("Crawling content for post titled:",post.get('title'))

            post_response = requests.get(post_url_prefix+post.get('url'))

            if post_response.status_code == 200:

                post['content'] = get_post_content(post_response.content)

            print("Waiting for 10 secs before crawling next post...\n\n")

            sleep(10)

        print("Content crawled for all posts")

        # print/write content to file

        for post in blog_post_details:

            print(post)

python crawler的更多相关文章

Python crawler access to web pages the get requests a cookie
Python in the process of accessing the web page,encounter with cookie,so we need to get it. cookie i ...
【python爬虫】根据查询词爬取网站返回结果
最近在做语义方面的问题,需要反义词.就在网上找反义词大全之类的,但是大多不全,没有我想要的.然后就找相关的网站,发现了http://fanyici.xpcha.com/5f7x868lizu.html ...
python脚本工具－ 3 目录遍历
遍历系统中某一目录下的所有文件名 #! /usr/bin/python # coding:utf-8 import os def dirList(path): filelist = os.listdi ...
pyrailgun 0.24 : Python Package Index
pyrailgun 0.24 : Python Package Index pyrailgun 0.24 Download pyrailgun-0.24.zip Fast Crawler For Py ...
[Python]新手写爬虫全过程（转）
今天早上起来,第一件事情就是理一理今天该做的事情,瞬间get到任务,写一个只用python字符串内建函数的爬虫,定义为v1.0,开发中的版本号定义为v0.x.数据存放?这个是一个练手的玩具,就写在tx ...
python编写知乎爬虫实践
爬虫的基本流程网络爬虫的基本工作流程如下: 首先选取一部分精心挑选的种子URL 将种子URL加入任务队列从待抓取URL队列中取出待抓取的URL,解析DNS,并且得到主机的ip,并将URL对应的网页 ...
python爬虫之urllib
#coding=utf-8 #urllib操作类 import time import urllib.request import urllib.parse from urllib.error imp ...
Python实现自动登录/登出校园网网关
学校校园网的网络连接有免费连接和收费连接两种类型,可想而知收费连接浏览体验更佳,比如可以访问更多的网站.之前收费地址只能开通包月服务才可使用,后来居然有了每个月60小时的免费使用收费地址的优惠.但是, ...
python爬虫实践
模拟登陆与文件下载爬取http://moodle.tipdm.com上面的视频并下载模拟登陆由于泰迪杯网站问题,测试之后发现无法用正常的账号密码登陆,这里会使用访客账号登陆. 我们先打开泰迪杯的 ...

随机推荐

Git 解决合并分支时的冲突
参考链接:https://www.liaoxuefeng.com/wiki/896043488029600/900004111093344 创建分支时,新分支的文件内容建立在原分支的基础上,我们称这时 ...
English--虚拟语气和条件状语从句
English|虚拟语气和条件状语从句虚拟语气在英语中,还是有一定地位的,毕竟大家都做着我有一百万的梦~~~ 前言目前所有的文章思想格式都是:知识+情感. 知识:对于所有的知识点的描述.力求不含任 ...
this、对象原型
this和对象原型第一章关于this 1.1 为什么要用this this 提供了一种更优雅的方式来隐式"传递"一个对象引用,因此可以将 API 设计得更加简洁并且易于复用. ...
关于web浏览器的Web SQL和IndexedDB
虽然在HTML5 WebStorage介绍了html5本地存储的Local Storage和Session Storage,这两个是以键值对存储的解决方案,存储少量数据结构很有用,但是对于大量结构化数 ...
AMD规范中模块id的命名规则
AMD 即 Asynchronous Module Definition, 中文是“ 异步模块定义”的意思. AMD 规范制定了定义模块的规则,这样模块和模块的依赖可以被异步加载. AMD 规范只定义 ...
搭建前端监控系统（六）JS截屏和录屏篇
怎样定位前端线上问题,一直以来,都是很头疼的问题,因为它发生于用户的一系列操作之后.错误的原因可能源于机型,网络环境,接口请求,复杂的操作行为等等,在我们想要去解决的时候很难复现出来,自然也就无法解决 ...
手写神经网络Python深度学习
import numpy import scipy.special import matplotlib.pyplot as plt import scipy.misc import glob impo ...
httprunner学习13-环境变量.env
前言一般来说,在进行实际应用的开发过程中,应用会拥有不同的运行环境,通常会有以下环境: 本地开发环境测试环境生产环境在不同环境中,我们可能会使用不同的数据库或邮件发送驱动等配置,这时候则需要通 ...
WebLogic任意文件上传漏洞(CVE-2019-2618)
WebLogic任意文件上传漏洞(CVE-2019-2618) 0x01 漏洞描述漏洞介绍 CVE-2019-2618漏洞主要是利用了WebLogic组件中的DeploymentService接口, ...
RabbitMQ 的 docker 镜像使用
RabbitMQ 的 docker 镜像使用 1.下载镜像(management版本的才带有web管理界面) docker pull rabbitmq:3.7.18-management 2.创建容器 ...

python crawler

python crawler的更多相关文章

随机推荐

热门专题