QQ:231469242

欢迎交流

Parsing HTML with the BeautifulSoup Module

Beautiful Soup是用于提取HTML网页信息的模板,BeautifulSoup模板名字是bs4。
 
bs4.BeautifulSoup()函数需要调用时,携带包含HTML的一个字符串。这个字符串将被复制。
bs4.BeautifulSoup()返回一个BeautifulSoup对象。
 
Beautiful Soup is a module for extracting information from an HTML page (and is much better for this purpose than regular expressions). The BeautifulSoupmodule’s name is bs4 (for Beautiful Soup, version 4). To install it, you will need to run pip install beautifulsoup4 from the command line. (Check out Appendix A for instructions on installing third-party modules.) While beautifulsoup4 is the name used for installation, to import Beautiful Soup you run import bs4.

For this chapter, the Beautiful Soup examples will parse (that is, analyze and identify the parts of) an HTML file on the hard drive. Open a new file editor window in IDLE, enter the following, and save it as example.html. Alternatively, download it from http://nostarch.com/automatestuff/.

<!-- This is the example.html example file. -->
<html><head><title>The Website Title</title></head>
<body><p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p><p class="slogan">Learn Python the easy way!</p><p>By <span id="author">Al Sweigart</span></p>
</body></html>

As you can see, even a simple HTML file involves many different tags and attributes, and matters quickly get confusing with complex websites. Thankfully, Beautiful Soup makes working with HTML much easier.

Creating a BeautifulSoup Object from HTML

bs4.BeautifulSoup()函数需要调用时,携带包含HTML的一个字符串。这个字符串将被复制。
bs4.BeautifulSoup()返回一个BeautifulSoup对象。
The bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will parse. The bs4.BeautifulSoup() function returns is a BeautifulSoupobject. Enter the following into the interactive shell while your computer is connected to the Internet:

>>> import requests, bs4
>>> res = requests.get('http://nostarch.com')
>>> res.raise_for_status()
>>> noStarchSoup = bs4.BeautifulSoup(res.text) #返回文字属性给bs4.BeautifulSoup函数
>>> type(noStarchSoup)
 
<class 'bs4.BeautifulSoup'>

 

This code uses requests.get() to download the main page from the No Starch Press website and then passes the text attribute of the response to bs4.BeautifulSoup(). The BeautifulSoup object that it returns is stored in a variable named noStarchSoup.

You can also load an HTML file from your hard drive by passing a File object tobs4.BeautifulSoup(). Enter the following into the interactive shell (make sure theexample.html file is in the working directory):

>>> exampleFile = open('example.html')
>>> exampleSoup = bs4.BeautifulSoup(exampleFile)
>>> type(exampleSoup)
 
<class 'bs4.BeautifulSoup'>

Once you have a BeautifulSoup object, you can use its methods to locate specific parts of an HTML document.

如何创建一个example.html测试文件
打开一个idle文件,复制好下图HTML代码,用example.html文件名报存

bs4_2的更多相关文章

随机推荐

  1. java 计算地球上两点间距离

    /** * 计算地球上任意两点(经纬度)距离 * * @param long1 * 第一点经度 * @param lat1 * 第一点纬度 * @param long2 * 第二点经度 * @para ...

  2. JS模块规范 前端模块管理器

    一:JS模块规范(为了将js文件像java类一样被import和使用而定义为模块, 组织js文件,实现良好的文件层次结构.调用结构) A:CommonJS就是为JS的表现来制定规范,因为js没有模块的 ...

  3. jquery渐渐的显示、隐藏效果

    <!DOCTYPE html> <html> <head> <meta charset="gb2312" /> <title& ...

  4. Java设计模式(一) 策略模式

    策略模式:定义了算法族,分别封装起来,让它们之间可以互相替换,此模式让算法的变化独立于使用算法的客户. 1,定义算法接口 package com.pattern.strategy.test; publ ...

  5. 【BZOJ 2555】SubString

    http://www.lydsy.com/JudgeOnline/problem.php?id=2555 一个字符串在原串中的出现次数就是这个字符串对应后缀自动机上的状态的\(|Right|\),要求 ...

  6. 【NOIP 2015 & SDOI 2016 Round1 & CTSC 2016 & SDOI2016 Round2】游记

    我第一次写游记,,,, 正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪里?正文在哪 ...

  7. 10G整数文件中寻找中位数或者第K大数

    来源:http://hxraid.iteye.com/blog/649831 题目:在一个文件中有 10G 个整数,乱序排列,要求找出中位数.内存限制为 2G.只写出思路即可(内存限制为 2G的意思就 ...

  8. js-函数eval

    eval函数接收一个参数s,如果s不是字符串,则直接返回s.否则执行s语句.如果s语句执行结果是一个值,则返回此值,否则返回undefined. 需要特别注意的是对象声明语法“{}”并不能返回一个值, ...

  9. 【凯子哥带你学Framework】Activity界面显示全解析

    前几天凯子哥写的Framework层的解析文章<Activity启动过程全解析>,反响还不错,这说明“写让大家都能看懂的Framework解析文章”的思想是基本正确的. 我个人觉得,深入分 ...

  10. git 最基本的使用方法

    1. git init    ----初始化git仓库 2.git add .   ----把代码添加到仓库 3.git commit -m '注释'  ---commit:提交 -m:全部提交  ‘ ...