Python 爬取汽车领域问答语料（自用）

#coding=utf-8

import time

import requests

from lxml import etree

from pymongo import MongoClient

from selenium import webdriver

client = MongoClient("IP", 27017)

db = client["Automobile"]

collection = db["wenda_autohome"]

db.authenticate("","")

driver = webdriver.Chrome(executable_path=r"D:\chromedriver_win32\chromedriver.exe")

def splist(l, s):

    return [l[i: i+s] for i in range(len(l)) if i%s==0]

for i in range(36726, 40202):

    # url = 'https://wenda.autohome.com.cn/topic/detail/40195'

    url = 'https://wenda.autohome.com.cn/topic/detail/' + str(i)

    time.sleep(1)

    driver.get(url)

    html = driver.page_source

    tree = etree.HTML(html)

    question = tree.xpath("//h1[@class='card-title']/text()")

    answer_list = tree.xpath("//a[@class='text']/text()")

    if question==[] or answer_list==[]:

        continue

    n = 0

    for j in answer_list:

        n += 1

        answer_list[n-1] = j[41:-37]

        if answer_list[n-1][-3:]!='...':

            continue

        s = "//div[@class='card-reply-wrap'][" + str(n) + "]//a[@class='more']"

        try:

            driver.find_element_by_xpath(s).click()

            html_answer = driver.page_source

            tree_answer = etree.HTML(html_answer)

            answer_part = tree_answer.xpath("//div[@class='answer-content']/div/div[@class='ahe__area ahe__block ahe__text']/p/text()")

            answer = ''

            for item in answer_part:

                answer += item

            answer_list[n-1] = answer

            time.sleep(1)

            driver.get(url)

        except Exception as e:

            print e

            continue

    keywords = tree.xpath("//ul[@class='card-tag-list']/li/text()")

    discription_list = tree.xpath("//div[@class='ahe__area ahe__block ahe__text']/p/text()")

    discription = ''

    for j in discription_list:

        discription += j 

    zancai = tree.xpath("//span[@class='js-praise-count']/text()")

    zancai_list = splist(zancai, 2)

    dc = {}

    dc['keywords'] = keywords

    dc['question'] = question[0]

    dc['discription'] = discription

    dc['answer'] = answer_list

    dc['zancai'] = zancai_list

    dc['url'] = url

    collection.insert(dc)

driver.close()

Python 爬取汽车领域问答语料（自用）的更多相关文章

Python 爬取汽车之家口碑数据
本文仅供学习交流使用,如侵立删!联系方式见文末汽车之家口碑数据 2021.8.3 更新增加用户信息参数.认证车辆信息等 2021.3.24 更新更新最新数据接口 2020.12.25 更新添加 ...
使用python爬取MedSci上的期刊信息
使用python爬取medsci上的期刊信息,通过设定条件,然后获取相应的期刊的的影响因子排名,期刊名称,英文全称和影响因子.主要过程如下: 首先,通过分析网站http://www.medsci.cn ...
毕设之Python爬取天气数据及可视化分析
写在前面的一些P话:(https://jq.qq.com/?_wv=1027&k=RFkfeU8j) 天气预报我们每天都会关注,我们可以根据未来的天气增减衣物.安排出行,每天的气温.风速风向. ...
Python 爬取所有51VOA网站的Learn a words文本及mp3音频
Python 爬取所有51VOA网站的Learn a words文本及mp3音频 #!/usr/bin/env python # -*- coding: utf-8 -*- #Python 爬取所有5 ...
python爬取网站数据
开学前接了一个任务,内容是从网上爬取特定属性的数据.正好之前学了python,练练手. 编码问题因为涉及到中文,所以必然地涉及到了编码的问题,这一次借这个机会算是彻底搞清楚了. 问题要从文字的编码讲 ...
python爬取某个网页的图片-如百度贴吧
python爬取某个网页的图片-如百度贴吧作者:vpoet mail:vpoet_sir@163.com 注:随意copy,不用告诉我 #coding:utf-8 import urllib imp ...
Python:爬取乌云厂商列表，使用BeautifulSoup解析
在SSS论坛看到有人写的Python爬取乌云厂商,想练一下手,就照着重新写了一遍原帖:http://bbs.sssie.com/thread-965-1-1.html #coding:utf- im ...
python爬取免费优质IP归属地查询接口
python爬取免费优质IP归属地查询接口具体不表,我今天要做的工作就是: 需要将数据库中大量ip查询出起归属地刚开始感觉好简单啊,毕竟只需要从百度找个免费接口然后来个python脚本跑一晚上就o ...
Python爬取豆瓣指定书籍的短评
Python爬取豆瓣指定书籍的短评 #!/usr/bin/python # coding=utf-8 import re import sys import time import random im ...

随机推荐

Appscan的第一个测试请求就是提交MAC地址
GET /AppScan_fingerprint/MAC_ADDRESS_真实的MAC地址.html HTTP/1.0 还好都是合法测试,否则情何以堪...
CentOS 7.4 如何安装 MariaDB 10.3.9 Stable 数据库
CentOS 7.4 如何安装 MariaDB 10.3.9 Stable 数据库一.CentOS 7.4上卸载 Mariadb 数据库 1.查询所安装的MariaDB组件 [libin@VM_0_ ...
9. Spark Streaming技术内幕 : Receiver在Driver的精妙实现全生命周期彻底研究和思考
原创文章,转载请注明:转载自听风居士博客(http://www.cnblogs.com/zhouyf/) Spark streaming 程序需要不断接收新数据,然后进行业务逻辑 ...
最短路-Bellmanford
简介: 给定一个图和一个源点,求源点到其余点的最短路径,图中有可能存在负权边. 算法步骤 1.初始化:将除源点外的所有顶点的最短距离估计值 dist[v] ← +∞, dist[s] ←0; 2.迭代 ...
CodeForces 738E Subordinates
排序,构造. 相当于告诉我们一棵树$n$个节点,每个节点在哪一层,至少需要移动多少个节点,才能让这些节点变成一棵树. 按照层次排个序移动一下就可以了,优先选择那些不是$s$但是层次是$0$的节点,如果 ...
Python开发基础-Day3-列表、元组和字典
列表列表定义:[]内以逗号分隔,按照索引,存放各种数据类型,每个位置代表一个元素特性: 1.可存放多个值 2.可修改指定索引位置对应的值,可变 3.按照从左到右的顺序定义列表元素,下标从0开始顺序 ...
微软笔试Highway问题解析
High way 时间限制:10000ms 单点时限:1000ms 内存限制:256MB 描述 In the city, there is a one-way straight highway ...
网络数据包分析网卡Offload
http://blog.nsfocus.net/network-packets-analysis-nic-offload/ 对于网络安全来说,网络传输数据包的捕获和分析是个基础工作,绿盟科技研 ...
原生js操作HTML DOM
先上图 1.一些常用的方法 obj.getElementById() 返回带有指定 ID 的元素. obj.getElementsByTagName() 返回包含带有指定标签名称的所有元素的节点列表( ...
Notepad++ v5.5以上惯用法教程
注:本文中为注明为自定义快捷键的,均为notepad++的默认快捷键. 0. 关闭标签页 UltraEdit是双击窗口就可以关闭,Notepad++双击不能关闭,右键只能关闭非当前标签页,那怎么办呢 ...

Python 爬取汽车领域问答语料（自用）

Python 爬取汽车领域问答语料（自用）的更多相关文章

随机推荐

热门专题