<Web Scraping with Python>:Chapter 1 & 2

<Web Scraping with Python>

Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsing

BeautifulSoup

Key：

P5:

urlib or urlib2?

If you’ve used the urllib2 library in Python 2.x, you might have noticed that things have changed somewhat between urllib2 and urllib. In Python 3.x, urllib2 was renamed urllib and was split into several submodules: urllib.request, urllib.parse, and url lib.error. Although function names mostly remain the same, you might want to note which functions have moved to submodules when using the new urllib.

在学习这本书之前，使用过此package（我一开始学习Python就用的是3.x，Mac自带Python2.x），当时出错了，上Stackoverflow找到了答案，现在这本书提到了这点，重新回顾一下。如果你用过 Python 2.x 里的 urllib2 库，可能会发现 urllib2 与 urllib 有些不同。在 Python 3.x 里，urllib2 改名为 urllib，被分成一些子模块：urllib.request、urllib.parse 和 urllib.error。尽管函数名称大多和原来一样，但是在用新的 urllib 库时需要注意哪些函数被移动到子模块里了。

P15:

When to get_text() and When to Preserve Tags?

.get_text() strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.

Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Call‐ ing .get_text() should always be the last thing you do, immedi‐ ately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible.

P16:

find() and findAll() with BeautifulSoup?

<Web Scraping with Python>:Chapter 1 & 2的更多相关文章

Web Scraping with Python读书笔记及思考
Web Scraping with Python读书笔记标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用一般的数据 ...
Web scraping with Python (part II) « Jean, aka Sig(gg)
Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl 1.函数调用它自身,这样就形成了一个循环,一环套一环: from urllib.request ...
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...
首部讲Python爬虫电子书 Web Scraping with Python
首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...
《Web Scraping With Python》Chapter 2的学习笔记
You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...
Web Scraping with Python
Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...
Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python
Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

javascript事件捕获与冒泡
对“捕获”和“冒泡”这两个概念,我想我们对冒泡更熟悉一些,因为在我们使用的所有浏览器中,都支持事件冒泡,即事件由子元素向祖先元素传播的,就像气泡从水底向水面上浮一样.而在像firefox,chrom ...
Pet（hdu 4707 BFS）
Pet Time Limit: 4000/2000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Total Submiss ...
关于set和map的用法
1.set 定义:每个元素最多只出现一次,并且默认的是从小到大排序. set 遍历: 题目http://www.cnblogs.com/ZP-Better/p/4700218.html for(set ...
UGUI Scrollbar控件
如题就是Scrollbar控件,它简单可以看成 Scrollbar 和 Image组件组成它基本上不单独使用多数是制作滚动视图.我们来看看他独特的属性,重复的属性就不在介绍了! 属性讲解: Hand ...
wso2esb源码编译总结
最近花了两周的空闲时间帮朋友把wso2esb的4.0.3.4.6.0.4.7.0三个版本从源码编译出来了.以下是大概的一些体会. wso2esb是基于carbon的.carbon是个基于eclipse ...
Eclipse SVN插件的帐号、password改动
问题描写叙述: Eclipse的SVN插件Subclipse做得非常好,在svn操作方面提供了非常强大丰富的功能.但到眼下为止,该插件对svn用户的概念极为淡薄,不但不能方便地切换用户,并且一旦用户的 ...
如何对应用服务性能问题诊断（Tomcat、Weblogic中间件）
在我们web项目中,我们常见的web应用服务器有Tomcat.Weblogic.WebSphere.它们是互联网应用系统的基础架构软件,也叫“中间件”,负责处理动态在页面请求,并为应用提供了名字.事务 ...
tail
tail用于显示指定文件末尾内容,不指定文件时,作为输入信息进行处理.常用查看日志文件. -f 循环读取 -q 不显示处理信息 -v 显示详细的处理信息 -c<数目> 显示的字节数 -n& ...
ECShop2.7.2详细文件结构及模板结构目录名称
┣plugins目录┣templates目录┃ ┣backup目录┃ ┃ ┣index.htm┃ ┃ ┗ibrary目录┃ ┃ ┗index.htm┃ ┣cac ...
C盘扩容，超详细，史上最简单的扩容技术贴！
http://ideapad.zol.com.cn/55/160_549015.html 很多朋友跟我一样,转到windows 7 64bit后,发现以前所谓的35GB理论不够用了,哪怕你不把任何程序 ...

<Web Scraping with Python>:Chapter 1 & 2

<Web Scraping with Python>:Chapter 1 & 2的更多相关文章

随机推荐

热门专题