个人作业2-6.4-Python爬取顶会信息

1、个人作业2

数据爬取阶段

import requests
from lxml import etree
import pymysql

def getdata(url):
    # 请求CVPR主页
    page_text = requests.get(url).text
    parser = etree.HTMLParser(encoding="utf-8")
    tree = etree.HTML(page_text, parser=parser)
                    #html
    # 爬取论文连接
    hrefs = tree.xpath('//dt[@class="ptitle"]/a/@href')
    print(len(hrefs))

    # 爬取论文信息
    titles = []
    pdfs = []
    abstracts = []
    authors = []
    keywords = []

    for href in hrefs:
        db = pymysql.connect(host="127.0.0.1", user="root", password="lin0613",
                             database="users")

        href = "https://openaccess.thecvf.com/" + href
        page_text = requests.get(href).text
        tree_link = etree.HTML(page_text, parser=parser)

        title = tree_link.xpath('/html/body/div/dl/dd/div[@id="papertitle"]/text()')
        title[0] = title[0].strip()
        titles += title

        title[0] = title[0].replace(":", "")
        words = title[0].split()
        keyword = ""
        for word in words:
            if checkword(word):
                save_keywords(pymysql.connect(host="127.0.0.1", user="root", password="lin0613",database="users"), word)
                keyword += word + " "

        keywords.append(keyword)

        pdf = tree_link.xpath('/html/body/div/dl/dd/a[contains(text(),"pdf")]/@href')
        pdf[0] = pdf[0].replace("../../", "https://openaccess.thecvf.com/")
        pdfs += pdf

        abstract = tree_link.xpath('/html/body/div/dl/dd/div[@id="abstract"]/text()')
        abstract[0] = abstract[0].strip()
        abstracts += abstract

        author = tree_link.xpath('/html/body/div/dl/dd/div/b/i/text()')
        authors += author

        # print(title)
        # print(author)
        # print(pdf)
        # print(abstract)

        save(db, title[0], author[0], abstract[0], href, keyword)

    print(titles)
    print(hrefs)
    print(authors)
    print(abstracts)
    print(pdfs)

def save(db, title, author, abstract, link, keyword):
    # 使用cursor()方法获取操作游标
    cursor = db.cursor()

    # SQL 插入语句
    sql = "INSERT INTO papers(title, authors, abstract_text, original_link, keywords) \
           VALUES ('%s', '%s',  '%s',  '%s', '%s')" % \
          (title, author, abstract, link, keyword)
    try:
        # 执行sql语句
        cursor.execute(sql)
        print("true")
        # 执行sql语句
        db.commit()
    except:
        print("error wenzhang")
        # 发生错误时回滚
        db.rollback()

    # 关闭数据库连接
    db.close()

def save_keywords(db, keyword):
    # 使用cursor()方法获取操作游标
    cursor = db.cursor()

    # SQL 插入语句
    sql = "INSERT INTO keywords(keyword) VALUES ('%s')" % (keyword)
    try:
        # 执行sql语句
        cursor.execute(sql)
        # 执行sql语句
        print("true")
        db.commit()
    except:
        print("error word")
        # 发生错误时回滚
        db.rollback()

    # 关闭数据库连接
    db.close()

def checkword(word):
    invalid_words = ['the', 'a', 'an', 'and', 'by', 'of', 'in', 'on', 'is', 'to', "as", "from", "for", "with", "that",
                     "have", "by", "on", "upon", "about", "above", "across", "among", "ahead", "after", "a",
                     "analthough", "at", "also", "along", "around", "always", "away", "anyup", "under", "untilbefore",
                     "between", "beyond", "behind", "because", "what", "when", "would", "could", "who", "whom", "whose",
                     "which", "where", "why", "without", "whether", "down", "during", "despite", "over", "off", "only",
                     "other", "out", "than", "the", "thenthrough", "throughout", "that", "these", "this", "those",
                     "there", "therefore", "some", "such", "since", "so", "can", "many", "much", "more", "may", "might",
                     "must", "ever", "even", "every", "each" ,"with","A","With","From"]
    if word.lower() in invalid_words:
        return False
    else:
        return True

if __name__ == '__main__':
    #getdata("https://openaccess.thecvf.com/CVPR2018?day=2018-06-20")
    getdata("https://openaccess.thecvf.com/CVPR2018?day=2018-06-21")
    getdata("https://openaccess.thecvf.com/CVPR2019?day=2019-06-18")
    #getdata("https://openaccess.thecvf.com/CVPR2019?day=2019-06-19")
    #getdata("https://openaccess.thecvf.com/CVPR2019?day=2019-06-20")
    getdata("https://openaccess.thecvf.com/CVPR2020?day=2020-06-16")
    #getdata("https://openaccess.thecvf.com/CVPR2020?day=2020-06-17")
    #getdata("https://openaccess.thecvf.com/CVPR2020?day=2020-06-18")
    #getdata("https://openaccess.thecvf.com/CVPR2018?day=2018-06-19")

个人作业2-6.4-Python爬取顶会信息的更多相关文章

Python爬取拉勾网招聘信息并写入Excel
这个是我想爬取的链接:http://www.lagou.com/zhaopin/Python/?labelWords=label 页面显示如下: 在Chrome浏览器中审查元素,找到对应的链接: 然后 ...
python爬取豆瓣视频信息代码
目录一:代码二:结果如下(部分例子) 这里是爬取豆瓣视频信息,用pyquery库(jquery的python库). 一:代码 from urllib.request import quote ...
Python 爬取美团酒店信息
事由:近期和朋友聊天,聊到黄山酒店事情,需要了解一下黄山的酒店情况,然后就想着用python 爬一些数据出来,做个参考主要思路:通过查找,基本思路清晰,目标明确,仅仅爬取美团莫一地区的酒店信息,不过 ...
python 爬取豆瓣书籍信息
继爬取猫眼电影TOP100榜单之后,再来爬一下豆瓣的书籍信息(主要是书的信息,评分及占比,评论并未爬取).原创,转载请联系我. 需求:爬取豆瓣某类型标签下的所有书籍的详细信息及评分语言:pyth ...
python爬取电影网站信息
一.爬取前提1)本地安装了mysql数据库 5.6版本2)安装了Python 2.7 二.爬取内容电影名称.电影简介.电影图片.电影下载链接三.爬取逻辑1)进入电影网列表页, 针对列表的html内 ...
python爬取豆瓣电影信息数据
题外话+ 大家好啊,最近自己在做一个属于自己的博客网站(准备辞职回家养老了,明年再战)在家里琐事也很多, 加上自己一回到家就懒了(主要是家里冷啊! 广东十几度,老家几度,躲在被窝瑟瑟发抖,) 由于 ...
python爬取梦幻西游召唤兽资质信息（不包含变异）
一.分析 1.爬取网站:https://xyq.163.com/chongwu/ 2.获取网页源码: request.get("https://xyq.163.com/chongwu/&qu ...
python 爬取bilibili 视频信息
抓包时发现子菜单请求数据时一般需要rid,但的确存在一些如游戏->游戏赛事不使用rid,对于这种未进行处理,此外rid一般在主菜单的响应中,但有的如番剧这种,rid在子菜单的url中,此外返回的 ...
python爬取网业信息案例
需求:爬取网站上的公司信息代码如下: import json import os import shutil import requests import re import time reques ...

随机推荐

CF157A Game Outcome 题解
Content 有一个 \(n\times n\) 的矩阵,每个元素都有一个权值.求所有满足其所在纵列所有元素权值和大于其所在横列所有元素权值和的元素个数. 数据范围:\(1\leqslant n\l ...
centos安装宝塔命令
yum install -y wget && wget -O install.sh http://download.bt.cn/install/install_6.0.sh & ...
c++之可变参数格式化字符串（c++11可变模板参数）
本文将使用泛型实现可变参数. 涉及到的关见函数: std::snprintf 1.一个例子函数声明及定义 1 // 泛型 2 template <typename... Args> ...
【LeetCode】558. Quad Tree Intersection 解题报告（Python）
作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 目录题目描述题目大意解题方法日期题目地址:https://leetcode.c ...
【LeetCode】712. Minimum ASCII Delete Sum for Two Strings 解题报告（Python & C++）
作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 目录题目描述题目大意解题方法日期题目地址:https://leetcode.c ...
1319 - Monkey Tradition
1319 - Monkey Tradition PDF (English) Statistics Forum Time Limit: 2 second(s) Memory Limit: 32 MB ...
MacOS使用Docker创建MySQL主从数据库
一.拉取MySQL镜像通过终端获取最新的MySQL镜像 docker pull mysql/mysql-server 二.创建MySQL数据库容器配置文件对应目录我们在当前用户下创建一组目录,用来 ...
Codeforces 913C：Party Lemonade（贪心）
C. Party Lemonade A New Year party is not a New Year party without lemonade! As usual, you are expec ...
TLS、SSL
TLS/SSL 的功能实现主要依赖于三类基本算法:散列函数 Hash.对称加密和非对称加密,其利用非对称加密实现身份认证和密钥协商,对称加密算法采用协商的密钥对数据加密,基于散列函数验证信息的完整性. ...
第二十四个知识点：描述一个二进制m组的滑动窗口指数算法
第二十四个知识点:描述一个二进制m组的滑动窗口指数算法简单回顾一下我们知道的. 大量的密码学算法的大数是基于指数问题的安全性,例如RSA或者DH算法.因此,现代密码学需要大指数模幂算法的有效实现.我 ...

个人作业2-6.4-Python爬取顶会信息

1、个人作业2

个人作业2-6.4-Python爬取顶会信息的更多相关文章

随机推荐

热门专题