[Python爬虫] 之十四：Selenium +phantomjs抓取媒介360数据

　　具体代码如下：

# coding=utf-8
import os
import re
from selenium import webdriver
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.select import Select
import IniFile
from selenium.webdriver.common.keys import Keys
from threading import Thread
import thread
import LogFile
import urllib
import mongoDB
#抓取数据线程类
class ScrapyData_Thread(Thread):
    #抓取数据线程类
    def __init__(self,webSearchUrl,inputTXTIDLabel,searchlinkIDLabel,htmlLable,htmltime,keyword,invalid_day,db):
        '''
        构造函数
        :param webSearchUrl: 搜索页url
        :param inputTXTIDLabel: 搜索输入框的标签
        :param searchlinkIDLabel: 搜索链接的标签
        :param htmlLable: 要搜索的标签
        :param htmltime: 要搜索的时间
        :param invalid_day: 要搜索的关键字，多个关键字中间用分号(;)隔开
        :param keywords: #资讯发布的有效时间，默认是3天以内
        :param db: 保存数据库引擎
        '''
        Thread.__init__(self)

        self.webSearchUrl = webSearchUrl
        self.inputTXTIDLabel = inputTXTIDLabel
        self.searchlinkIDLabel = searchlinkIDLabel
        self.htmlLable = htmlLable
        self.htmltime = htmltime
        self.keyword = keyword
        self.invalid_day = invalid_day
        self.db = db

        self.driver = webdriver.PhantomJS()
        self.wait = ui.WebDriverWait(self.driver, 20)
        self.driver.maximize_window()

    def Comapre_to_days(self,leftdate, rightdate):
        '''
        比较连个字符串日期，左边日期大于右边日期多少天
        :param leftdate: 格式：2017-04-15
        :param rightdate: 格式：2017-04-15
        :return: 天数
        '''
        l_time = time.mktime(time.strptime(leftdate, '%Y-%m-%d'))
        r_time = time.mktime(time.strptime(rightdate, '%Y-%m-%d'))
        result = int(l_time - r_time) / 86400
        return result

    def run(self):
        print '关键字：%s' % self.keyword

        self.driver.get(self.webSearchUrl)
        time.sleep(1)

        # js = "var obj = document.getElementById('ctl00_ContentPlaceHolder1_txtSearch');obj.value='" + self.keyword + "';"
        # self.driver.execute_script(js)
        # 点击搜索链接
        ss_elements = self.driver.find_element_by_id(self.inputTXTIDLabel)
        ss_elements.send_keys(unicode(self.keyword,'utf8'))

        search_elements = self.driver.find_element_by_id(self.searchlinkIDLabel)
        search_elements.click()
        time.sleep(4)

        self.wait.until(lambda driver: self.driver.find_elements_by_xpath(self.htmlLable))
        Elements = self.driver.find_elements_by_xpath(self.htmlLable)

        timeElements = self.driver.find_elements_by_xpath(self.htmltime)
        urlList = []
        for hrefe in Elements:
            urlList.append(hrefe.get_attribute('href').encode('utf8'))

        index = 0
        strMessage = ' '
        strsplit = '\n------------------------------------------------------------------------------------\n'
        index = 0
        # 每页中有用记录
        recordCount = 0
        usefulCount = 0
        meetingList = []
        kword = self.keyword
        currentDate = time.strftime('%Y-%m-%d')

        for element in Elements:

            strDate = timeElements[index].text.encode('utf8')
            # 日期在3天内并且搜索的关键字在标题中才认为是复合要求的数据
            #因为搜索的记录是安装时间倒序，所以如果当前记录的时间不在3天以内，那么剩下的记录肯定是小于当前日期的，所以就退出
            if self.Comapre_to_days(currentDate,strDate) < self.invalid_day:
                usefulCount =  1
                txt = element.text.encode('utf8')
                if txt.find(kword) > -1:
                    dictM = {'title': txt, 'date': strDate,
                             'url': urlList[index], 'keyword': kword, 'info': txt}
                    meetingList.append(dictM)

                    # print ' '
                    # print '标题：%s' % txt
                    # print '日期：%s' % strDate
                    # print '资讯链接：' + urlList[index]
                    # print strsplit

                    # # log.WriteLog(strMessage)
                    recordCount = recordCount + 1
            else:
                usefulCount = 0

            if usefulCount:
                break
            index = index + 1

        self.db.SaveMeetings(meetingList)
        print "共抓取了: %d 个符合条件的资讯记录" % recordCount

        self.driver.close()
        self.driver.quit()

if __name__ == '__main__':

    configfile = os.path.join(os.getcwd(), 'chinaMedia.conf')
    cf = IniFile.ConfigFile(configfile)
    webSearchUrl = cf.GetValue("section", "webSearchUrl")
    inputTXTIDLabel = cf.GetValue("section", "inputTXTIDLabel")
    searchlinkIDLabel = cf.GetValue("section", "searchlinkIDLabel")
    htmlLable = cf.GetValue("section", "htmlLable")
    htmltime = cf.GetValue("section", "htmltime")

    invalid_day = int(cf.GetValue("section", "invalid_day"))

    keywords= cf.GetValue("section", "keywords")
    keywordlist = keywords.split(';')
    start = time.clock()
    db = mongoDB.mongoDbBase()
    for keyword in keywordlist:
        if len(keyword) > 0:
            t = ScrapyData_Thread(webSearchUrl,inputTXTIDLabel,searchlinkIDLabel,htmlLable,htmltime,keyword,invalid_day,db)
            t.setDaemon(True)
            t.start()
            t.join()

    end = time.clock()
    print "整个过程用时间: %f 秒" % (end - start)

[Python爬虫] 之十四：Selenium +phantomjs抓取媒介360数据的更多相关文章

[Python爬虫] 之十：Selenium +phantomjs抓取活动行中会议活动
一.介绍本例子用Selenium +phantomjs爬取活动树(http://www.huodongshu.com/html/find_search.html?search_keyword=数字) ...
[Python爬虫] 之九：Selenium +phantomjs抓取活动行中会议活动（单线程抓取）
思路是这样的,给一系列关键字:互联网电视:智能电视:数字:影音:家庭娱乐:节目:视听:版权:数据等.在活动行网站搜索页(http://www.huodongxing.com/search?city=% ...
[Python爬虫] 之十一：Selenium +phantomjs抓取活动行中会议活动信息
一.介绍本例子用Selenium +phantomjs爬取活动行(http://www.huodongxing.com/search?qs=数字&city=全国&pi=1)的资讯信息 ...
[Python爬虫] 之十三：Selenium +phantomjs抓取活动树会议活动数据
抓取活动树网站中会议活动数据(http://www.huodongshu.com/html/index.html) 具体的思路是[Python爬虫] 之十一中抓取活动行网站的类似,都是用多线程来抓取, ...
[Python爬虫] 之三十：Selenium +phantomjs 利用 pyquery抓取栏目
一.介绍本例子用Selenium +phantomjs爬取栏目(http://tv.cctv.com/lm/)的信息二.网站信息三.数据抓取首先抓取所有要抓取网页链接,共39页,保存到数据库里 ...
PYTHON 爬虫笔记十:利用selenium+PyQuery实现淘宝美食数据搜集并保存至MongeDB（实战项目三）
利用selenium+PyQuery实现淘宝美食数据搜集并保存至MongeDB 目标站点分析淘宝页面信息很复杂的,含有各种请求参数和加密参数,如果直接请求或者分析Ajax请求的话会很繁琐.所以我们可 ...
C#使用Selenium+PhantomJS抓取数据
本文主要介绍了C#使用Selenium+PhantomJS抓取数据的方法步骤,具有很好的参考价值,下面跟着小编一起来看下吧手头项目需要抓取一个用js渲染出来的网站中的数据.使用常用的httpclie ...
selenium+PhantomJS 抓取淘宝搜索商品
最近项目有些需求,抓取淘宝的搜索商品,抓取的品类还多.直接用selenium+PhantomJS 抓取淘宝搜索商品,快速完成. #-*- coding:utf-8 -*-__author__ =''i ...
[Python爬虫] 之十二：Selenium +phantomjs抓取中的url编码问题
最近在抓取活动树网站 (http://www.huodongshu.com/html/find.html) 上数据时发现,在用搜索框输入中文后,点击搜索,phantomjs抓取数据怎么也抓取不到,但是 ...

随机推荐

OpenCL与CUDA，CPU与GPU
OpenCL OpenCL(全称Open Computing Language,开放运算语言)是第一个面向异构系统通用目的并行编程的开放式.免费标准,也是一个统一的编程环境,便于软件开发人员为高性能计 ...
Windows7 + OSG3.6 + VS2017 + Qt5.11
一.准备工作下载需要的材料: 1. OSG稳定版源代码, 3.6.3版本 2. 第三方库,选择VS2017对应的版本 https://download.osgvisual.org/3rdParty ...
mybatis 报错： Invalid bound statement (not found)
错误: org.apache.ibatis.binding.BindingException: Invalid bound statement (not found): test.dao.Produc ...
Python 2.7.13安装
参考文章:安装Python 进入至Python官方网站,点击下载下载完成后直接进行安装选择安装的路径选择安装的组件,请注意选择安装pip和Add python.exe to Path这两个选项 ...
Centos的 mysql for python的下载与安装
mysql-python的安装包下载地址:http://sourceforge.net/projects/mysql-python/files/latest/download linux环境是 Cen ...
Codeforces Round #423 A Restaurant Tables（模拟）
A. Restaurant Tables time limit per test 1 second memory limit per test 256 megabytes input standard ...
腾讯QQ的聊天记录中的图片记录造假
前不久和朋友在群里聊天时,突然出现了一个BUG,就是一个群友发了A图片,但在我这边显示得却是B图片.当时就猜测,腾讯为了节省流量或者手机资源的原因,给每一张图片弄了个唯一ID,遇到相同ID的就直接从本 ...
Flask实战第56天：板块管理
cms布局编辑 cms_boards.html {% block main_content %} <div class="top-box"> <button c ...
Flask实战第43天：把图片验证码和短信验证码保存到memcached中
前面我们已经获取到图片验证码和短信验证码,但是我们还没有把它们保存起来.同样的,我们和之前的邮箱验证码一样,保存到memcached中编辑commom.vews.py .. from utils i ...
Linux基础系列-系统密码破解
无引导介质(光盘.iso)救援模式下root密码破解第一步: GRUB启动画面读秒时按上下方向键,进入GRUB界面第二步: 使用上下光标键选择要修改的操作系统启动内核(默认选择的即可),按e键进行 ...

[Python爬虫] 之十四：Selenium +phantomjs抓取媒介360数据

[Python爬虫] 之十四：Selenium +phantomjs抓取媒介360数据的更多相关文章

随机推荐

热门专题