<Web Scraping with Python>

Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsing

  • BeautifulSoup

Key:

    P5:

  • urlib or urlib2?

 If you’ve used the urllib2 library in Python 2.x, you might have noticed that things have changed somewhat between urllib2 and urllib. In Python 3.x, urllib2 was renamed urllib and was split into several submodules: urllib.request, urllib.parse, and url lib.error. Although function names mostly remain the same, you might want to note which functions have moved to submodules when using the new urllib. 

在学习这本书之前,使用过此package(我一开始学习Python就用的是3.x,Mac自带Python2.x),当时出错了,上Stackoverflow找到了答案,现在这本书提到了这点,重新回顾一下。如果你用过 Python 2.x 里的 urllib2 库,可能会发现 urllib2 与 urllib 有些不同。在 Python 3.x 里,urllib2 改名为 urllib,被分成一些子模块:urllib.requesturllib.parse 和 urllib.error。尽管函数名称大多和原来一样,但是在用新的 urllib 库时需要注意哪些函数被移动到子模块里了。

 

    P15:

  • When to get_text() and When to Preserve Tags?

.get_text() strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.

Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Call‐ ing .get_text() should always be the last thing you do, immedi‐ ately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible. 

    P16:

  • find() and findAll() with BeautifulSoup?

<Web Scraping with Python>:Chapter 1 & 2的更多相关文章

  1. Web Scraping with Python读书笔记及思考

    Web Scraping with Python读书笔记 标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用 一般的数据 ...

  2. Web scraping with Python (part II) « Jean, aka Sig(gg)

    Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)

  3. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl 1.函数调用它自身,这样就形成了一个循环,一环套一环: from urllib.request ...

  4. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...

  5. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...

  6. 首部讲Python爬虫电子书 Web Scraping with Python

    首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...

  7. 《Web Scraping With Python》Chapter 2的学习笔记

    You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...

  8. Web Scraping with Python

    Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...

  9. Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python

    Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

  1. jQuery 改变Form 指向的 Action

    var path = "shiftCancelAction"; $('#queryForm').attr("action",path).submit();

  2. atom写文档技巧

    1. 段落和标题大纲 标题大纲(类似于HTML的H1, H2, …) 简单得很,一级标题用# 标题, 二级标题用## 标题,三级标题用### 标题,以此类推. 段落(类似HTML的<p>) ...

  3. [Head First Python]2. python of comment

    1- 多行注释 ''' ''' 或 """ """ '''this is the standard way to include a mul ...

  4. git merge 分支

    把master merge到apple_campus1.git stash2.git checkout master3.git pull4.git checkout apple_campus5.git ...

  5. MyEclipse6.5安装SVN插件的三种方法

    MyEclipse6.5安装SVN插件的三种方法 方法一.如果可以上网可在线安装 1. 打开Myeclipse,在菜单栏中选择Help→Software Updates→Find and Instal ...

  6. Win32 API中的user32.dll中的ShowWindow方法参数整理

    在使用ShowWindow方法来设置窗体的状态时,由于不知道参数值,用起来非常容易混乱,所以整理了以下其参数的枚举值,方便以后的的使用.   public class User32API { #reg ...

  7. 关于link, visited, hover, active

    LoVe/HAte 如果只是希望点击的时候显示背景色,那么只需要设置 :active,无需设置:hover #navbar:active, #backbtn:active { background-c ...

  8. Mac Outlook数据文件的位置

    ****/Documents/Microsoft User Data/Office 2011 Identities/Main Identity 在这里 如果是中文版的,在这里: /Users/×××× ...

  9. socket基础(二)

    Microsoft.Net Framework为应用程序访问Internet提供了分层的.可扩展的以及受管辖的网络服务,其名字空间System.Net和System.Net.Sockets包含丰富的类 ...

  10. 经验:Ubuntu 登陆 L2TP VPN

    Ubuntu Linux 操作系统默认支持PPTP协议的VPN登陆,但是随着网络环境的复杂化,我们需要使用L2TP协议的VPN登陆,下面,我们只需要简单的几条命令即可登陆L2TP协议的VPN.     ...