<Web Scraping with Python>

Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsing

  • BeautifulSoup

Key:

    P5:

  • urlib or urlib2?

 If you’ve used the urllib2 library in Python 2.x, you might have noticed that things have changed somewhat between urllib2 and urllib. In Python 3.x, urllib2 was renamed urllib and was split into several submodules: urllib.request, urllib.parse, and url lib.error. Although function names mostly remain the same, you might want to note which functions have moved to submodules when using the new urllib. 

在学习这本书之前,使用过此package(我一开始学习Python就用的是3.x,Mac自带Python2.x),当时出错了,上Stackoverflow找到了答案,现在这本书提到了这点,重新回顾一下。如果你用过 Python 2.x 里的 urllib2 库,可能会发现 urllib2 与 urllib 有些不同。在 Python 3.x 里,urllib2 改名为 urllib,被分成一些子模块:urllib.requesturllib.parse 和 urllib.error。尽管函数名称大多和原来一样,但是在用新的 urllib 库时需要注意哪些函数被移动到子模块里了。

 

    P15:

  • When to get_text() and When to Preserve Tags?

.get_text() strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.

Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Call‐ ing .get_text() should always be the last thing you do, immedi‐ ately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible. 

    P16:

  • find() and findAll() with BeautifulSoup?

<Web Scraping with Python>:Chapter 1 & 2的更多相关文章

  1. Web Scraping with Python读书笔记及思考

    Web Scraping with Python读书笔记 标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用 一般的数据 ...

  2. Web scraping with Python (part II) « Jean, aka Sig(gg)

    Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)

  3. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl 1.函数调用它自身,这样就形成了一个循环,一环套一环: from urllib.request ...

  4. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...

  5. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...

  6. 首部讲Python爬虫电子书 Web Scraping with Python

    首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...

  7. 《Web Scraping With Python》Chapter 2的学习笔记

    You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...

  8. Web Scraping with Python

    Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...

  9. Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python

    Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

  1. jquery之stop()的用法

    // 为了看效果,随意写的动画 $('#animater').animate({ 'right':-800 }, 3000).animate({'font-size':'16px'},'normal' ...

  2. javascript:自定义事件初探

    javascript:自定义事件初探   http://www.cnblogs.com/jeffwongishandsome/archive/2008/10/27/1317148.html

  3. jQuery模拟瀑布流布局

    <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title> ...

  4. 关于阿里云ESC上go语言项目编译6l: running gcc failed: Cannot allocate memory

    (1)前段时间将自己的阿里云服务器上的系统由centos 6.5换为了ubuntu 14,其他的硬件配置都没有发生改变,将服务器上的数据恢复并且重新安装了golang的编译环境后,发现使用go bui ...

  5. information_schema.character_sets 学习

    information_schema.character_sets 表用于查看字符集的详细信息 1.character_sets 常用列说明: 1.character_set_name: 字符集名 2 ...

  6. How to use Request js (Node js Module) pools

    Can someone explain how to use the request.js pool hash? The github notes say this about pools: pool ...

  7. 一次oracle大量数据删除经历

    oracle有个数据表现在已经有2500万条数据了,软件用到这个表的数据时就变的特别慢,所以准备把一个月以前的数据全部清除. 我的步骤是(下边操作都是在plsql中运行的) 1.首先 将这个月的数据导 ...

  8. DEDECMS织梦全站动态化访问(包括自由列表freelist)及发布内容时自动动态化设置

    DEDECMS织梦 - 全站已有内容全部设置为动态化访问(包括自由列表freelist),以及发布内容时自动为动态化,设置分为三个步骤: 1.将所有文档设置为“仅动态”:执行以下mysql语句:upd ...

  9. Apache监控

    Apache性能监控 http://www.cnblogs.com/fnng/archive/2012/11/11/2765463.html 要监控apache的性能,我们需要修改配置文件,允许查看a ...

  10. Linux下的命令行上网

    对于网页浏览器现在大多数人用links/elinks,对了,还有个老牌一点的文本浏览器Lynx,links/elinks也是从Lynx中fork出来的. 以上所说的虽然能字符界面来浏览网页,但是不能显 ...