<Web Scraping with Python>:Chapter 1 & 2

<Web Scraping with Python>

Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsing

BeautifulSoup

Key：

P5:

urlib or urlib2?

If you’ve used the urllib2 library in Python 2.x, you might have noticed that things have changed somewhat between urllib2 and urllib. In Python 3.x, urllib2 was renamed urllib and was split into several submodules: urllib.request, urllib.parse, and url lib.error. Although function names mostly remain the same, you might want to note which functions have moved to submodules when using the new urllib.

在学习这本书之前，使用过此package（我一开始学习Python就用的是3.x，Mac自带Python2.x），当时出错了，上Stackoverflow找到了答案，现在这本书提到了这点，重新回顾一下。如果你用过 Python 2.x 里的 urllib2 库，可能会发现 urllib2 与 urllib 有些不同。在 Python 3.x 里，urllib2 改名为 urllib，被分成一些子模块：urllib.request、urllib.parse 和 urllib.error。尽管函数名称大多和原来一样，但是在用新的 urllib 库时需要注意哪些函数被移动到子模块里了。

P15:

When to get_text() and When to Preserve Tags?

.get_text() strips all tags from the document you are working with and returns a string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away and you’ll be left with a tagless block of text.

Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Call‐ ing .get_text() should always be the last thing you do, immedi‐ ately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible.

P16:

find() and findAll() with BeautifulSoup?

<Web Scraping with Python>:Chapter 1 & 2的更多相关文章

Web Scraping with Python读书笔记及思考
Web Scraping with Python读书笔记标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用一般的数据 ...
Web scraping with Python (part II) « Jean, aka Sig(gg)
Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl 1.函数调用它自身,这样就形成了一个循环,一环套一环: from urllib.request ...
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...
首部讲Python爬虫电子书 Web Scraping with Python
首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...
《Web Scraping With Python》Chapter 2的学习笔记
You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...
Web Scraping with Python
Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...
Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python
Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

Response 关于浏览器header的方法
Response.AddHeader Response.AddHeader使用实例 1.文件下载,指定默认名 Response.AddHeader("content-type" ...
zabbix之3触发器/action及模板
1.触发器: {server_name:item_name.func.operator.condition} 一旦condition(条件)触发,则item状态改变触发器之间可以存在依赖关系,即it ...
Eclipse工程乱码解决
eclipse之所以会出现乱码问题是因为eclipse编辑器选择的编码规则是可变的.一般默认都是UTF-8或者GBK,当从外部导入的一个工程时,如果该工程的编码方式与eclipse中设置的编码方式不同 ...
Lost Cows(BIT poj2182)
Lost Cows Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 10609 Accepted: 6797 Descri ...
更有效率的使用 Visual Studio - 快捷键
工欲善其事,必先利其器.虽然说Vim和Emacs是神器,但是对于使用Visual Studio的程序员来说,我们也可以通过一些快捷键和潜在的一些功能实现脱离鼠标写代码,提高工作效率,像使用Vim一样使 ...
C语言IO操作总结
C语言IO操作总结C程序将输入看做字节流,流的来源是文件.输入设备.或者另一程序的输入:C程序将输出也看做字节流:流的目的是文件.视频显示等: 文件处理:1 :fopen("filename ...
logstash 处理nginx 错误日志
2016/08/30 14:52:02 [error] 11325#0: *346 open() "/var/www/zjzc-web-frontEnd/%27%22%2f%3E%3C%2f ...
cf471B MUH and Important Things
B. MUH and Important Things time limit per test 1 second memory limit per test 256 megabytes input s ...
如何提升app开发效率
无论在什么行业,用户永远都是不可替代的“上帝”,一切的服务,开发都得按照用户的意愿来进行.然而在app开发领域中,专业的技术操作却并不像逛街淘货一样清晰可见,更多的需要app开发人员一行行代码敲出来, ...
CDH 2、Cloudera Manager的安装
1.Cloudera Manager • Cloudera Manager是一个管理CDH的端到端的应用. • 作用: – 管理 – 监控 – 诊断 – 集成 • 架构 • Server – 管理控制 ...

<Web Scraping with Python>:Chapter 1 & 2

<Web Scraping with Python>:Chapter 1 & 2的更多相关文章

随机推荐

热门专题