python爬取网业信息案例

需求：爬取网站上的公司信息

代码如下：

import json

import os

import shutil

import requests

import re

import time

requests.packages.urllib3.disable_warnings()

#通过url请求接口，获取返回数据

def getPage(url,headers):

    try:

        response = requests.get(url=url, headers=headers, verify=False)

        response.encoding = 'utf-8'

        if response.status_code == 200:

            #print (response.text)

            return response.text

        else:

            print('请求异常：{} status:{}'.format(url, response.status_code))

    except Exception as e:

            print('请求异常： {} error: {}'.format(url, e))

            return None

#删除文件的重复行

def file2uniq(file,destpath):

    sum = 0

    sum_pre = 0

    addrs = set()

    with open(file, 'r',encoding='utf8') as scan_file:

        for line in scan_file.readlines():

            sum_pre += 1

            # addr = get_addr(line)

            # line.decode('utf8')

            addrs.add(line)

    scan_file.close()

    with open(destpath, 'w',encoding='utf8') as infile:

        while len(addrs) > 0:

            sum += 1

            infile.write(addrs.pop())

    infile.close()

    if (os.path.exists(file)):

        os.remove(file)

    try:

        os.rename(destpath, file)

    except Exception as e:

        print (e)

        print ('rename file fail\r')

    else:

        print ('rename file success\r')

    #print(addrs)

    print("去重之前文本条数: "+str(sum_pre))

    print("去重之后文本条数: "+str(sum))

    return sum_pre,sum

#通过正则表达式提取页面内容

def parseHtml(html):

    #pattern = re.compile(r'<tr> <td class="tx">.+\s(.+)', re.I)  # 不区分大小写 匹配股票名称

    # 不区分大小写  获取完整公司名

    pattern = re.compile(r'<td class="text-center">.+</td> <td> <a href="/firm_.+">\s(.+)', re.I)

    # 获取证券公司

    #pattern = re.compile(r'\t(.+)[\s]+</a> </td> <td class="text-center">.+</td> <td class="text-center">.+</td> </tr>', re.I)

    #pattern = re.compile(r'\t(.+)\s\t\t\t\t\t\t\t  </a> </td> <td class="text-center">.+</td> <td class="text-center">.+</td> </tr> <tr> <td class="tx">', re.I)  # 不区分大小写

    #pattern = re.compile(r'</a>\s</td>\s<td class="text-center">.+</td> <td> <a href="/firm_.+.html">\s(.+)[\s]+</a> </td> <td> <a href="/firm_.+.html">\s(.+)', re.I)  # 不区分大小写 匹配股票名称

    items = re.findall(pattern, html)

    #print (items)

    for item in items:

        yield {

            'orgName': item.strip(),

        }

def write2txt(content):

    with open(file, 'a', encoding='utf-8') as f:

        f.write(json.dumps(content, ensure_ascii=False) + '\n')

def removeStr(old_str,new_str):

    """

    with open('sanban.txt', 'a', encoding='utf-8') as fpr:

        content = fpr.read()

    content = content.replace(r'{"orgName": "', '')

    content = content.replace(r'"}', '')

    """

    file_data = ""

    with open(file, 'r', encoding='utf-8') as f:

        for line in f:

            if old_str in line:

                line = line.replace(old_str,new_str)

            file_data += line

    with open(file, 'w', encoding='utf-8') as f:

        f.write(file_data)

def main(page):

    #url = 'https://www.qichacha.com/elib_sanban.html?p=' + str(page)

    url = 'https://www.qichacha.com/elib_ipo.html?p=' + str(page)  # https://www.qichacha.com/elib_ipo.html?p=2

    headers = {

        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',

    }

    print (url)

    html = getPage(url,headers)

    #print (html)

    for item in parseHtml(html):

        print(item)

        write2txt(item)

    removeStr(r'{"orgName": "','')

    removeStr(r'"}', '')

    file2uniq(file, destpath)

if __name__ == '__main__':

    file = r'orgName.txt'

    #file = r'midOrg.txt'

    #sourcepath = r'sanban.txt'

    destpath = r'temp.txt'

    for page in range(1,2):

        main(page)

        time.sleep(1)

python爬取网业信息案例的更多相关文章

Python爬取网易云音乐歌手歌曲和歌单
仅供学习参考 Python爬取网易云音乐网易云音乐歌手歌曲和歌单,并下载到本地很多人学习python,不知道从何学起.很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手.很多已经做 ...
用Python爬取网易云音乐热评
用Python爬取网易云音乐热评本文旨在记录Python爬虫实例:网易云热评下载由于是从零开始,本文内容借鉴于各种网络资源,如有侵权请告知作者. 要看懂本文,需要具备一点点网络相关知识.不过没有关 ...
Python爬取网易云热歌榜所有音乐及其热评
获取特定歌曲热评: 首先,我们打开网易云网页版,击排行榜,然后点击左侧云音乐热歌榜,如图: 关于如何抓取指定的歌曲的热评,参考这篇文章,很详细,对小白很友好: 手把手教你用Python爬取网易云40万 ...
Python 爬取网易云歌手的50首热门作品
使用 requests 爬取网易云音乐 Python 代码: import json import os import time from bs4 import BeautifulSoup impor ...
Python爬取网易云歌单
目录 1. 关键点 2. 效果图 3. 源代码 1. 关键点使用单线程爬取,未登录,爬取网易云歌单主要有三个关键点: url为https://music.163.com/discover/playl ...
爬虫实战(二) 用Python爬取网易云歌单
最近,博主喜欢上了听歌,但是又苦于找不到好音乐,于是就打算到网易云的歌单中逛逛本着 "用技术改变生活" 的想法,于是便想着写一个爬虫爬取网易云的歌单,并按播放量自动进行排序这篇 ...
Python爬取拉勾网招聘信息并写入Excel
这个是我想爬取的链接:http://www.lagou.com/zhaopin/Python/?labelWords=label 页面显示如下: 在Chrome浏览器中审查元素,找到对应的链接: 然后 ...
python爬取豆瓣视频信息代码
目录一:代码二:结果如下(部分例子) 这里是爬取豆瓣视频信息,用pyquery库(jquery的python库). 一:代码 from urllib.request import quote ...
python爬取网易云音乐歌曲评论信息
网易云音乐是广大网友喜闻乐见的音乐平台,区别于别的音乐平台的最大特点,除了“它比我还懂我的音乐喜好”.“小清新的界面设计”就是它独有的评论区了——————各种故事汇,各种金句频出.我们可以透过歌曲的评 ...

随机推荐

暴力破解（ Hydra | Medusa）
暴力破解 By : Mirror王宇阳笔者告知 : 暴力破解的结果是运气和速度的结晶,开始暴力破解前烧一炷香也是必要的! 引用张炳帅的一句话:"你的运气和管理员的安全意识成正比" ...
UiPath Read CSV 中文乱码
问题:UiPath 读取.CSV文件时,出现中文乱码. 解决1: 修改CSV文件的编码为UTF-8 解决2: 设置Read CSV Activity的 encoding属性为csv相应的编码格式参考 ...
conda pip 安装 dgl 并运行demo 出现：Segmentation fault (core dumped) 错误
安装dgl 并运行的时候,出现了如上错误,很是郁闷:使用 gdb python; run train.py 进行调试,发现是torch的问题:我猜测估计是torch 安装的版本过于新:于是重新安装 1 ...
安装SDK 6.0（二）
2==>安装SDK 6.0 打开安卓Android Studio 出现 Unable to access Android SDK add-on list 点击 Cancal 在点击Cancel ...
面试连环炮系列（八）：服务器CPU飙升100%怎么排查
服务器CPU飙升100%怎么排查执行"top"命令,查看当前进程CPU占用的实时情况,PID列是进程号,确定是哪个应用程序的问题. 如果是Java应用导致的,怎么定位故障原因执 ...
Often Misused:Spring Remote Service 经常被误用：Spring远程服务
Vue+ElementUI的后台管理框架
新开发的一个后台管理系统.在框架上,领导要用AdminLTE这套模板.这个其实很简单,把该引入的样式和js文件引入就可以了.这里就不多赘述了.有兴趣的可以参考:https://www.jianshu. ...
linux中的交换分区（swap）及优化
SWAP(交换内存) 1.什么是交换内存? 在硬盘上创建一块区域,当你的物理内存快要被用光的时候,内核临时的物理内存上的文件数据交换到硬盘上的这段区域上面,当物理内存有闲置的时候在把交换内存上的数 ...
openstack_dashboard无法获取nova
问题描述: 今天打开openstack的dashboard准备创建实例,结果计算节点每一项展开都无法获取nova 之前已经把nova搞好了并没有什么问题,怎么突然就服务也起不了了查看了一下nova服 ...
react生命周期函数的应用-----1性能优化 2发ajax请求
知识点1:每次render其实就会将jax的模板生成一个虚拟dom,跟上一个虚拟dom进行比对,通过diff算法找出不同,再更新到真实dom上去. 1性能优化每次父组件render一次(除了第一次初 ...

python爬取网业信息案例

python爬取网业信息案例的更多相关文章

随机推荐

热门专题