阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

1.查找以<a>开头的所有文本，然后判断href是否在<a>里面，如果<a>里面有href,就像<a href=" " >,然后提取href的值。

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")

bsObj = BeautifulSoup(html)

for link in bsObj.findAll("a"):

    if 'href' in link.attrs:

        print(link.attrs['href'])

运行结果：

在网页源代码的定位：

2.提取以 /wiki/开头的文本

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")

bsObj = BeautifulSoup(html,"lxml")

for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$")):

    if 'href' in link.attrs:

        print(link.attrs['href'])

运行结果：

3.连环着提取不同网页以/wiki开头的文本

from urllib.request import urlopen

from bs4 import BeautifulSoup

import datetime

import random

import re

random.seed(datetime.datetime.now())

def getLinks(articleUrl):

    html = urlopen("http://en.wikipedia.org"+articleUrl)

    bsObj = BeautifulSoup(html,"lxml")

    return bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")

while len(links) > 0:

    newArticle = links[random.randint(0, len(links)-1)].attrs["href"]

    print(newArticle)

    links = getLinks(newArticle)

运行结果：

运行一段时间之后，会报错：远程主机强迫关闭了一个现有的连接，这是网站拒绝程序的连接吗？

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href的更多相关文章

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl 1.函数调用它自身,这样就形成了一个循环,一环套一环: from urllib.request ...
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...
首部讲Python爬虫电子书 Web Scraping with Python
首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...
Web Scraping with Python读书笔记及思考
Web Scraping with Python读书笔记标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用一般的数据 ...
<Web Scraping with Python>:Chapter 1 & 2
<Web Scraping with Python> Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsi ...
Web scraping with Python (part II) « Jean, aka Sig(gg)
Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)
python笔记之提取网页中的超链接
python笔记之提取网页中的超链接对于提取网页中的超链接,先把网页内容读取出来,然后用beautifulsoup来解析是比较方便的.但是我发现一个问题,如果直接提取a标签的href,就会包含jav ...
python找出数组中第二大的数
#!usr/bin/env python #encoding:utf-8 ''''' __Author__:沂水寒城功能:找出数组中第2大的数字 ''' def find_Second_large_ ...
通过Web安全工具Burp suite找出网站中的XSS漏洞实战(二)
一.背景笔者6月份在慕课网录制视频教程XSS跨站漏洞加强Web安全,里面需要讲到很多实战案例,在漏洞挖掘案例中分为了手工挖掘.工具挖掘.代码审计三部分内容,手工挖掘篇参考地址为快速找出网站中可能存 ...

随机推荐

HDU 6098 17多校6 Inversion（思维+优化）
Problem Description Give an array A, the index starts from 1.Now we want to know Bi=maxi∤jAj , i≥2. ...
2.14 加载Firefox配置
2.14 加载Firefox配置(略,已在2.1.8讲过,请查阅2.1.8节课) 回到顶部 2.14-1 加载Chrome配置一.加载Chrome配置chrome加载配置方法,只需改下面一个地方,u ...
【c++】函数默认参数
c++ Prime Plus sixth edition page274 如何设置默认值呢?必须通过函数原型. 只有原型指定了默认值.函数定义与没有默认参数时完全相同. 参考 1. http://ww ...
AI之路，第二篇：python数学知识2
第二篇:python数学知识2 线性代数导入相应的模块: >>> import numpy as np (数值处理模块)>>> import scipy ...
【算法】QuickSort
快速排序,时间复杂度O(N*logN),要能熟练掌握! 以下主要参考http://blog.csdn.net/morewindows/article/details/6684558, 感谢原博主! 该 ...
牛客OI赛制测试赛2
A题: https://www.nowcoder.com/acm/contest/185/A 链接:https://www.nowcoder.com/acm/contest/185/A来源:牛客网题 ...
selenium和PhantomJS的使用
利用selenium来进行爬取数据 import time from selenium import webdriver # 创建phantomjs浏览器对象 driver = webdriver.P ...
《DSP using MATLAB》Problem 6.1
今早不知道怎么5点就醒了,起来喝了口水,走到阳台,看看窗外,远处高楼上也有灯亮着,也许已经开始新的一天. 今天开始第6章了,继续努力.
Spring源码学习（总）
前文: ------------------------------------------------------------------------------------------------ ...
CentOS下如何根据Dump文件分析线上问题
https://blog.csdn.net/lixin03080/article/details/79711296 一.保存现场 1.记录系统整体资源使用情况,进程信息.线程信息 top -b -n ...

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href的更多相关文章

随机推荐

热门专题