Python——爬取瓜子二手车

# coding:utf8

# author:Jery

# datetime:2019/5/1 5:16

# software:PyCharm

# function:爬取瓜子二手车

import requests

from lxml import etree

import re

import csv

start_url = 'https://www.guazi.com/www/buy/o1c-1'

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',

    'Cookie': 'uuid=6032a689-d79a-4060-c8b0-f57d8db4e245; antipas=16I7A500578955101K231sG39E; clueSourceCode=10103000312%2300; user_city_id=49; ganji_uuid=3434204287155953305008; sessionid=405f3fb6-fb90-409d-c7ed-32874a157920; lg=1; cainfo=%7B%22ca_s%22%3A%22pz_baidu%22%2C%22ca_n%22%3A%22tbmkbturl%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22%22%2C%22ca_campaign%22%3A%22%22%2C%22ca_kw%22%3A%22-%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22scode%22%3A%2210103000312%22%2C%22ca_transid%22%3A%22%22%2C%22platform%22%3A%221%22%2C%22version%22%3A1%2C%22ca_i%22%3A%22-%22%2C%22ca_b%22%3A%22-%22%2C%22ca_a%22%3A%22-%22%2C%22display_finance_flag%22%3A%22-%22%2C%22client_ab%22%3A%22-%22%2C%22guid%22%3A%226032a689-d79a-4060-c8b0-f57d8db4e245%22%2C%22sessionid%22%3A%22405f3fb6-fb90-409d-c7ed-32874a157920%22%7D; cityDomain=mianyang; _gl_tracker=%7B%22ca_source%22%3A%22-%22%2C%22ca_name%22%3A%22-%22%2C%22ca_kw%22%3A%22-%22%2C%22ca_id%22%3A%22-%22%2C%22ca_s%22%3A%22self%22%2C%22ca_n%22%3A%22-%22%2C%22ca_i%22%3A%22-%22%2C%22sid%22%3A20570070983%7D; preTime=%7B%22last%22%3A1556660763%2C%22this%22%3A1556659891%2C%22pre%22%3A1556659891%7D'

}

# 获取详情页面url

def get_detail_urls(url):

    response = requests.get(url, headers=headers)

    text = response.content.decode('utf-8')

    html = etree.HTML(text)

    index = html.xpath('//ul[@class="pageLink clearfix"]/li[@class="link-on"]/a/span/text()')

    next_url = html.xpath('//ul[@class="pageLink clearfix"]/li/a/@href')[-1]

    ul = html.xpath('//ul[@class="carlist clearfix js-top"]')[0]

    lis = ul.xpath('./li')

    urls = []

    for li in lis:

        detail_url = li.xpath('./a/@href')

        detail_url = 'https://www.guazi.com' + detail_url[0]

        urls.append(detail_url)

    return urls, index, next_url

def get_info(url):

    response = requests.get(url, headers=headers)

    text = response.content.decode('utf-8')

    html = etree.HTML(text)

    infos_dict = {}

    city = html.xpath('//p[@class="city-curr"]/text()')[0]

    city = re.search(r'[\u4e00-\u9fa5]+', city).group(0)

    infos_dict['city'] = city

    title = html.xpath('//div[@class="product-textbox"]/h2/text()')[0]

    infos_dict['title'] = title.replace(r'\r\n', '').strip()

    infos = html.xpath('//div[@class="product-textbox"]/ul/li/span/text()')

    infos_dict['cardtime'] = infos[0]

    infos_dict['kms'] = infos[1]

    if len(infos) == 4:

        infos_dict['cardplace'] = ''

        infos_dict['displacement'] = infos[2]

        infos_dict['speedbox'] = infos[3]

    else:

        infos_dict['cardplace'] = infos[2]

        infos_dict['displacement'] = infos[3]

        infos_dict['speedbox'] = infos[4]

    price = html.xpath('//div[@class="product-textbox"]/div/span[@class="pricestype"]/text()')[0]

    infos_dict['price'] = re.search(r'\d+.?\d+', price).group(0)

    return infos_dict

def main():

    with open(r"C:\Users\Jery\Desktop\guazi.csv", 'w', newline='') as f:

        csvwriter_head = csv.writer(f, dialect='excel')

        csvwriter_head.writerow(['城市', '车型', '上牌时间', '上牌地', '表显里程', '排量', '变速箱', '价格'])

    while True:

        global start_url

        urls, index, next_url = get_detail_urls(start_url)

        print("当前页码：{}*****************".format(index))

        # 写表头

        with open(r'C:\Users\Jery\Desktop\guazi.csv', 'a') as f:

            for url in urls:

                print("正在爬取：{}".format(url))

                infos = get_info(url)

                print(infos)

                csvwriter = csv.writer(f, dialect='excel')

                csvwriter.writerow(

                    [infos['city'], infos['title'], infos['cardtime'], infos['cardplace'], infos['kms'],

                     infos['displacement'],

                     infos['speedbox'],

                     infos['price']])

        if next_url:

            start_url = 'https://www.guazi.com' + next_url

if __name__ == '__main__':

    main()

后续将进行数据分析

Python——爬取瓜子二手车的更多相关文章

使用nodejs的puppeteer库爬取瓜子二手车网站
const puppeteer = require('puppeteer'); (async () => { const fs = require("fs"); const ...
Python scrapy框架爬取瓜子二手车信息数据
项目实施依赖: python,scrapy ,fiddler scrapy安装依赖的包: 可以到https://www.lfd.uci.edu/~gohlke/pythonlibs/ 下载 pywi ...
Python 爬取所有51VOA网站的Learn a words文本及mp3音频
Python 爬取所有51VOA网站的Learn a words文本及mp3音频 #!/usr/bin/env python # -*- coding: utf-8 -*- #Python 爬取所有5 ...
python爬取网站数据
开学前接了一个任务,内容是从网上爬取特定属性的数据.正好之前学了python,练练手. 编码问题因为涉及到中文,所以必然地涉及到了编码的问题,这一次借这个机会算是彻底搞清楚了. 问题要从文字的编码讲 ...
python爬取某个网页的图片-如百度贴吧
python爬取某个网页的图片-如百度贴吧作者:vpoet mail:vpoet_sir@163.com 注:随意copy,不用告诉我 #coding:utf-8 import urllib imp ...
Python:爬取乌云厂商列表，使用BeautifulSoup解析
在SSS论坛看到有人写的Python爬取乌云厂商,想练一下手,就照着重新写了一遍原帖:http://bbs.sssie.com/thread-965-1-1.html #coding:utf- im ...
使用python爬取MedSci上的期刊信息
使用python爬取medsci上的期刊信息,通过设定条件,然后获取相应的期刊的的影响因子排名,期刊名称,英文全称和影响因子.主要过程如下: 首先,通过分析网站http://www.medsci.cn ...
python爬取免费优质IP归属地查询接口
python爬取免费优质IP归属地查询接口具体不表,我今天要做的工作就是: 需要将数据库中大量ip查询出起归属地刚开始感觉好简单啊,毕竟只需要从百度找个免费接口然后来个python脚本跑一晚上就o ...
Python爬取豆瓣指定书籍的短评
Python爬取豆瓣指定书籍的短评 #!/usr/bin/python # coding=utf-8 import re import sys import time import random im ...

随机推荐

Django cache
Django中使用redis 方式一: utils文件夹下,建立redis_pool.py import redis POOL = redis.ConnectionPool(host='127.0.0 ...
Flask解决跨域
Flask解决跨域问题:网页上(client)有一个ajax请求,Flask sever是直接返回 jsonify. 然后ajax就报错:No 'Access-Control-Allow-Origi ...
g2o:一种图优化的C++框架
转载自 Taylor Guo g2o: A general framework for graph optimization 原文发表于IEEE InternationalConference on ...
Asp.Net程序目录下文件夹或文件操作导致Session失效的解决方案
1.配置web.config <system.web> <sessionState mode="StateServer" stateConnectionStrin ...
URAL 1133 Fibonacci Sequence(数论)
题目链接题意 :给你第 i 项的值fi,第 j 项的值是 fj 让你求第n项的值,这个数列满足斐波那契的性质,每一项的值是前两项的值得和. 思路 :知道了第 i 项第j项,而且还知道了每个数的范围, ...
JavaEE互联网轻量级框架整合开发（书籍）阅读笔记（5）：责任链模式、观察者模式
一.责任链模式.观察者模式 1.责任链模式:当一个对象在一条链上被多个拦截器处理(烂机器也可以选择不拦截处理它)时,我们把这样的设计模式称为责任链模式,它用于一个对象在多个角色中传递的场景. 2. ...
poj1860 Currency Exchange(spfa判断正环)
Description Several currency exchange points are working in our city. Let us suppose that each point ...
Java集合类总结（四）
PriorityQueue类优先队列不管你按照什么顺序插入元素,出队列的时候元素都是按顺序输出的.也就是每次调用remove的时候,都返回当前队列中最小的元素.然后队列中的元素不是维持排序状态的,如 ...
Android 自定义ViewGroup，实现侧方位滑动菜单
侧方位滑动菜单 1.现在adnroid流行的应用当中很多都是用的侧方位滑动菜单如图:
TSQL--SET ANSI_NULLS OFF
当ANSI_NULLS 为ON时,遵循SQL92的标准,只能使用IS NULL 来判断值是否为NULL, 而不能使用=或<>来与NULL做比较,任何值包括NULL值与NULL值做=或< ...

Python——爬取瓜子二手车

Python——爬取瓜子二手车的更多相关文章

随机推荐

热门专题