The sixth day of Crawler learning

爬取我爱竞赛网的大量数据

首先获取每一种比赛信息的分类链接

def get_type_url(url):
    web_data = requests.get(web_url)
    soup = BeautifulSoup(web_data.text, 'lxml')
    types = soup.select("#mn_P1_menu li a")
    for type in types:
        print(type.get_text())
        get_num(type.get("href"))

然后获取每一个分类连接中的总页数

def get_num(url):
    web_data = requests.get(url)
    soup = BeautifulSoup(web_data.text, 'lxml')
    num = soup.select(".pg span")
    # 部分页面没有分页只有一页，需要分类一下
    if(num!=[]):
        i = int(num[0].get_text().split(" ")[2])
        for w in range(1, i):
            print("第"+str(w)+"页")
            urls = url + "index.php?page={}".format(str(w))
            get_message_url(urls)
    else:
        get_message_url(url)

最后获取每一页中各个比赛的信息

def get_message_url(url):
    web_data = requests.get(url)
    soup = BeautifulSoup(web_data.text, 'lxml')
    titles = soup.select(".xld .xs2_tit a")
    views = soup.select("span.chakan")
    post_times = soup.select("div.list_info")
    for title, view, post_time in zip(titles, views, post_times):
        data = {
            "标题": title.get_text(),
            "浏览量": view.get_text().strip(),
            "发布时间": post_time.get_text().strip().split(" ")[0],
            "链接": title.get("href")
        }
        print(data)

The sixth day of Crawler learning的更多相关文章

The fifth day of Crawler learning
使用mongoDB 下载地址:https://www.mongodb.com/dr/fastdl.mongodb.org/win32/mongodb-win32-x86_64-2008plus-ssl ...
The fourth day of Crawler learning
爬取58同城 from bs4 import BeautifulSoupimport requestsurl = "https://qd.58.com/diannao/35200617992 ...
The third day of Crawler learning
连续爬取多页数据分析每一页url的关联找出联系例如虎扑第一页:https://voice.hupu.com/nba/1 第二页:https://voice.hupu.com/nba/2 第三页: ...
The second day of Crawler learning
用BeatuifulSoup和Requests爬取猫途鹰网服务器与本地的交换机制我们每次浏览网页都是再向网页所在的服务器发送一个Request,然后服务器接受到Request后返回Response ...
The first day of Crawler learning
使用BeautifulSoup解析网页 Soup = BeautifulSoup(urlopen(html),'lxml') Soup为汤,html为食材,lxml为菜谱 from bs4 impor ...
Machine and Deep Learning with Python
Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...
深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
[C2P2] Andrew Ng - Machine Learning
##Linear Regression with One Variable Linear regression predicts a real-valued output based on an in ...
[C2P3] Andrew Ng - Machine Learning
##Advice for Applying Machine Learning Applying machine learning in practice is not always straightf ...

随机推荐

hdu 2196【树形dp】
http://acm.hdu.edu.cn/showproblem.php?pid=2196 题意:找出树中每个节点到其它点的最远距离. 题解: 首先这是一棵树,对于节点v来说,它到达其它点的最远距离 ...
jmeter循环和计数器
Header和Cookie相关内容
相信很多同学都对HTTP的header和cookie,和session都有疑问,因为我们开发的时候一般都需要请求网络获取数据,有时候还需要带cookie或者带特殊的字段发起请求. 现在我们就来简单的了 ...
uni-app设置 video开始播放进入全屏状态
有一video标签 <video id="myVideo" :src="videoUrl"></video> 获取 video 上下文 ...
20190527-JavaScriptの打怪升级旅行 { 语句 [ 声明，变量 ] }
写在前面的乱七八糟:时间总是轻易地溜走,不留一丝念想,近一个月,倒是过得有点丧,从今天开始起,已经开始接触后台了,而JavaScript也只是大致有了个分类框架,那些细枝末节还有的补,任重道远,天将降 ...
HTML静态网页--JavaScript-简介
JavaScript简介 1.JavaScript是个什么东西? 它是个脚本语言,需要有宿主文件,它的宿主文件是HTML文件. 2.它与Java什么关系? 没有什么直接的联系,Java是Sun公司(已 ...
tp5 select出来数据集(对象)转成数组
1.先在数据库配置文件中 //数据集返回类型 'resultset_type' => 'collection', 2.在使用时, 使用 toArray() 方法 //查询数据库 $news = ...
2018-8-10-使用-RetroShare-分享资源
title author date CreateTime categories 使用 RetroShare 分享资源 lindexi 2018-08-10 19:16:51 +0800 2018-02 ...
maven 安装环境变量设置后变成 mvn 并且Cmd Idea创建第一个项目
1.maven的安装教程下载地址为:http://maven.apache.org/download.cgi 点击下载,然后解压,我把目录名改为maven,目录结构如下图所示下面我们配置环境变量 ...
C++的价值
In May 2010, the GCC steering committee decided to allow use of a C++ compiler to compile GCC. The c ...

The sixth day of Crawler learning

爬取我爱竞赛网的大量数据

The sixth day of Crawler learning的更多相关文章

随机推荐

热门专题