[Python爬虫] 之三十一：Selenium +phantomjs 利用 pyquery抓取消费主张信息

　　一、介绍

　　　　本例子用Selenium +phantomjs爬取央视栏目（http://search.cctv.com/search.php?qtext=消费主张&type=video）的信息（标题，时间）

　　二、网站信息

　　python 代码

# coding=utf-8

import os

import re

from selenium import webdriver

from datetime import datetime,timedelta

import time

from pyquery import PyQuery as pq

import re

import mongoDB

import datetime

class consumer:

    def __init__(self):

        #通过配置文件获取IEDriverServer.exe路径

        # IEDriverServer ='C:\Program Files\Internet Explorer\IEDriverServer.exe'

        # self.driver = webdriver.Ie(IEDriverServer)

        # self.driver.maximize_window()

        self.driver = webdriver.PhantomJS(service_args=['--load-images=false'])

        # self.driver = driver = webdriver.Chrome()

        self.driver.set_page_load_timeout(10)

        self.driver.maximize_window()

        self.db = mongoDB.mongoDbBase()

    def WriteLog(self, message,date):

        fileName = os.path.join(os.getcwd(), 'consumer/' + date  +   '.txt')

        with open(fileName, 'a') as f:

            f.write(message)

    # http://search.cctv.com/search.php?qtext=消费主张&type=video

    def CatchData(self,url='http://search.cctv.com/search.php?qtext=%E6%B6%88%E8%B4%B9%E4%B8%BB%E5%BC%A0&type=video'):

        error = ''

        try:

            self.driver.get(url)

            time.sleep(1)

            selenium_html = self.driver.execute_script("return document.documentElement.outerHTML")

            doc = pq(selenium_html)

            filename = datetime.datetime.now().strftime('%Y-%m-%d')

            message = '{0},{1}'.format( '标题', '时间')

            filename = datetime.datetime.now().strftime('%Y-%m-%d')

            self.WriteLog(message, filename)

            pages = doc("div[class='page']").find("a")

            # 2018-06-05 00:12:21

            pattern = re.compile("\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}")

            for index in range(1,6):

                url = "get_data('{0}', '消费主张', 'relevance', 'video', '-1', '1', '', '20', '1')".format(index)

                self.driver.execute_script(url)

                selenium_html = self.driver.execute_script("return document.documentElement.outerHTML")

                doc = pq(selenium_html)

                Elements = doc("div[class='jvedio']").find("a")

                for sub in Elements.items():

                    title = sub.attr('title').encode('utf8')

                    ts = pattern.findall(title)

                    strtime = ''

                    if ts and len(ts) == 1:

                        strtime = ts[0]

                    if strtime:

                        index = title.index(strtime)

                        title = title[0:index]

                    title = '\n{0},{1}'.format(title,strtime)

                    self.WriteLog(title, filename)

        except Exception, e1:

            error = e1.message

    # def CatchData(self,url='http://search.cctv.com/search.php?qtext=%E6%B6%88%E8%B4%B9%E4%B8%BB%E5%BC%A0&type=video'):

    #     error = ''

    #     try:

    #         self.driver.get(url)

    #         time.sleep(1)

    #         selenium_html = self.driver.execute_script("return document.documentElement.outerHTML")

    #         doc = pq(selenium_html)

    #

    #         filename = datetime.datetime.now().strftime('%Y-%m-%d')

    #

    #         pages = doc("div[class='page']").find("a")

    #

    #         for element in pages.items():

    #             url = element.attr('onclick').encode('utf8')

    #             # get_data('1','消费主张','relevance','video','-1','1','','20','1')

    #             # get_data('2', '消费主张', 'relevance', 'video', '-1', '1', '', '20', '1')

    #             print url

    #             self.driver.execute_script(url)

    #             selenium_html = self.driver.execute_script("return document.documentElement.outerHTML")

    #             doc = pq(selenium_html)

    #

    #             Elements = doc("div[class='jvedio']").find("a")

    #             for sub in Elements.items():

    #                 title = sub.attr('title').encode('utf8')

    #                 print title

    #                 title = '\n{0}'.format(title)

    #                 self.WriteLog(title, filename)

    #     except Exception, e1:

    #         error = e1.message

obj = consumer()

obj.CatchData()

# obj.CatchContent('')

# obj.export('')

[Python爬虫] 之三十一：Selenium +phantomjs 利用 pyquery抓取消费主张信息的更多相关文章

[Python爬虫] 之二十八：Selenium +phantomjs 利用 pyquery抓取网站排名信息
一.介绍本例子用Selenium +phantomjs爬取中文网站总排名(http://top.chinaz.com/all/index.html,http://top.chinaz.com/han ...
[Python爬虫] 之二十一：Selenium +phantomjs 利用 pyquery抓取36氪网站数据
一.介绍本例子用Selenium +phantomjs爬取36氪网站(http://36kr.com/search/articles/电视?page=1)的资讯信息,输入给定关键字抓取资讯信息. 给 ...
[Python爬虫] 之三十：Selenium +phantomjs 利用 pyquery抓取栏目
一.介绍本例子用Selenium +phantomjs爬取栏目(http://tv.cctv.com/lm/)的信息二.网站信息三.数据抓取首先抓取所有要抓取网页链接,共39页,保存到数据库里 ...
[Python爬虫] 之十六：Selenium +phantomjs 利用 pyquery抓取一点咨询数据
本篇主要是利用 pyquery来定位抓取数据,而不用xpath,通过和xpath比较,pyquery效率要高. 主要代码: # coding=utf-8 import os import re fro ...
[Python爬虫] 之二十五：Selenium +phantomjs 利用 pyquery抓取今日头条网数据
一.介绍本例子用Selenium +phantomjs爬取今日头条(http://www.toutiao.com/search/?keyword=电视)的资讯信息,输入给定关键字抓取资讯信息. 给定 ...
[Python爬虫] 之二十二：Selenium +phantomjs 利用 pyquery抓取界面网站数据
一.介绍本例子用Selenium +phantomjs爬取界面(https://a.jiemian.com/index.php?m=search&a=index&type=news& ...
[Python爬虫] 之二十九：Selenium +phantomjs 利用 pyquery抓取节目信息信息
一.介绍本例子用Selenium +phantomjs爬取节目(http://tv.cctv.com/epg/index.shtml?date=2018-03-25)的信息二.网站信息三.数据抓 ...
[Python爬虫] 之十七：Selenium +phantomjs 利用 pyquery抓取梅花网数据
一.介绍本例子用Selenium +phantomjs爬取梅花网(http://www.meihua.info/a/list/today)的资讯信息,输入给定关键字抓取资讯信息. 给定关键字:数字: ...
[Python爬虫] 之二十七：Selenium +phantomjs 利用 pyquery抓取今日头条视频
一.介绍本例子用Selenium +phantomjs爬取今天头条视频(http://www.tvhome.com/news/)的信息,输入给定关键字抓取图片信息. 给定关键字:视频:融合:电视二 ...

随机推荐

洛谷P2296 寻找道路 [拓扑排序，最短路]
题目传送门寻找道路题目描述在有向图G 中,每条边的长度均为1 ,现给定起点和终点,请你在图中找一条从起点到终点的路径,该路径满足以下条件: 1 ．路径上的所有点的出边所指向的点都直接或间接与终点 ...
11.6八校联考T1，T2题解
因为版权问题,不丢题面,不放代码了(出题人姓名也隐藏) T1 这,是一道,DP题,但是我最开始看的时候,我思路挂了,以为是一道简单题,然后就写错了后来,我正确理解题意后写了个dfs,幸亏没有记忆化, ...
【搜索】还是N皇后
先看题才是最重要的: 这道题有点难理解,毕竟Code speaks louder than words,所以先亮代码后说话: #include<iostream> using namesp ...
为什么我喜欢Java
我现在的老板使用一个在线测试系统来筛选在线申请职位的求职者.测试的第一个问题很浅显,仅仅是为了让求职者熟悉一下这个系统的提交和测试代码的流程.问题是这样的,写一个将标准输入拷贝到标准输出的流程.求职者 ...
python 爬取世纪佳缘,经过js渲染过的网页的爬取
#!/usr/bin/python #-*- coding:utf-8 -*- #爬取世纪佳缘 #这个网站是真的烦,刚开始的时候用scrapy框架写,但是因为刚接触框架,碰到js渲染的页面之后就没办法 ...
Ubuntu系统安装谷歌 Chrome 浏览器
在 Ubuntu 16.04 中安装谷歌 Chrome 浏览器,步骤: 1.sudo wget https://repo.fdzh.org/chrome/google-chrome.list -P / ...
BZOJ 1497 JZYZOJ 1344 [NOI2006]最大获利网络流最大权闭合图
http://www.lydsy.com/JudgeOnline/problem.php?id=1497 http://172.20.6.3/Problem_Show.asp?id=1344 思路 ...
bzoj 1415 无环期望
#include <cstdio> #include <vector> #include <queue> #include <algorithm> #d ...
【原】Eclipse部署Maven web项目到tomcat服务器时，没有将lib下的jar复制过去的解决办法
我们在做web开发是,经常都要在eclipse中搭建web服务器,并将开发中的web项目部署到web服务器进行调试,在此,我选择的是tomcat服务器.之前部署web项目到tomcat进行启动调试都很 ...
程序员应该知道的几个国外IT网站
程序员应该知道的几个国外IT网站摘要:文中总结了几个常用的国外IT网站,下面列举出来供大家学习参考: 导读:文中总结了几个常用的国外IT网站,下面列举出来供大家学习参考: 1. TheServe ...

[Python爬虫] 之三十一：Selenium +phantomjs 利用 pyquery抓取消费主张信息

一、介绍

二、网站信息

[Python爬虫] 之三十一：Selenium +phantomjs 利用 pyquery抓取消费主张信息的更多相关文章

随机推荐

热门专题

　　一、介绍

　　二、网站信息