使用python爬去国家民政最新的省份代码的程序，requests,beautifulsoup,lxml

使用的python3.6

民政网站，不同年份数据可能页面结构不一致，这点踩了很多坑，这也是代码越写越长的原因。

如果以后此段代码不可用，希望再仔细学习下页面结构是否发生了变更。

 # -*- coding: utf-8 -*-

 """

 Created on Wed Jul 10 14:40:41 2019

 @author: Administrator

 """

 import pandas as pd

 import requests

 from bs4 import BeautifulSoup

 import time 

 url1 = 'http://www.mca.gov.cn/article/sj/xzqh//1980/'

 headers = {'content-type': 'application/json',

                'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'}

 # 1. 获取所有链接========================================================================

 def f1(url1):

     '2018-1980年中华人民共和国行政区划代码 的所有链接'

     #requests发出请求，设置url，header参数

     response = requests.get(url1, headers=headers, timeout=200, verify=False)

     soup = BeautifulSoup(response.text,'lxml') #将网页源码返回为BeautifulSoup类型

     _tmp1 = soup.select('td.arlisttd')

     end_1 = []

     for i in _tmp1:

         _a = i.select('a')[0].get('href')

         _b = i.select('a')[0].get('title')[:4]

         end_1.append(['http://www.mca.gov.cn'+_a,_b])

     return end_1

 end_2=[]

 for i in ['','?2','?3']:

     end_2 = end_2+f1(url1+i)

 def f2(url1='http://www.mca.gov.cn/article/sj/xzqh/2019/'):

     '2019年中华人民共和国行政区划代码'

     response = requests.get(url1, headers=headers, timeout=200, verify=False)

     soup = BeautifulSoup(response.text,'lxml')

     _tmp1 = soup.select('td.arlisttd')

     end_1 = []

     for i in _tmp1:

         _a = i.select('a')[0].get('href')

         _b = i.select('a')[0].get('title')[:7]

         end_1.append(['http://www.mca.gov.cn'+_a,_b])

     return end_1

 end_2 = end_2+f2()

 # 2. 获取数据========================================================================

 def f3(url1='http://www.mca.gov.cn/article/sj/xzqh/1980/201903/20190300014989.shtml'):

     #url1='http://www.mca.gov.cn/article/sj/xzqh/1980/201507/20150715854922.shtml'

     #url1='http://www.mca.gov.cn/article/sj/xzqh/1980/201507/20150715854918.shtml'

     #

     response = requests.get(url1, headers=headers, timeout=200, verify=False)

     soup = BeautifulSoup(response.text,'lxml')

     _txt = soup.select('script')[4].get_text().strip().replace('window.location.href="','').strip('";')

     if _txt[-4:]=='html':

         print('script!')

         url2 = _txt

     else:

         _tmp1 = soup.select('div.artext > div > p > a')

         if len(_tmp1)==0:

             _tmp1 = soup.select('div#zoom > a')

         url2 = _tmp1[0].get('href')

     print(url2)

     #return url2

     #url2='http://www.mca.gov.cn/article/sj/tjbz/a/201713/201708220856.html'

     time.sleep(0.5)

     response = requests.get(url2, headers=headers, timeout=200, verify=False)

     #将网页源码返回为BeautifulSoup类型

     soup = BeautifulSoup(response.text,'lxml')

     _tmp1 = soup.select('table > tr[height="19"]')

     end_1 = []

     if len(_tmp1)>5:

         for i in _tmp1:

             _a = i.select('td')[1].get_text().strip()

             if len(_a)>15: #部分数据页面，最后一行是备注。

                 continue

             else:

                 _b = i.select('td')[2].get_text().strip()

                 end_1.append([_a,_b])

     else:

         _tmp1 = soup.select('table > tr[height="20"]')

         for i in _tmp1:

             _a = i.select('td')[0].get_text().strip()

             if len(_a)>15 or _a=='行政区划代码': #部分数据页面，最后一行是备注。

                 continue

             else:

                 _b = i.select('td')[1].get_text().strip()

                 end_1.append([_a,_b])

     return end_1

 #循环对每个链接 获取数据

 end_3=[];#end_4=[]

 for j in range(len(end_2)):

     item = end_2[j]

     if ''  in item[1] or ''  in item[1]:

         print(j,item[0],item[1])

         tmp2 = f3(item[0])

         print('.')

         end_3.extend([[item[1]]+i for i in tmp2])

         #end_4.append(tmp2)

         time.sleep(0.1)

 df_result = pd.DataFrame(end_3)

 #pd.DataFrame(end_4).to_excel('所有连接.xlsx',index=False)

 df_result.to_excel('地区编码.xlsx',index=False)

 '''

 #\3 2019年5月份县以上行政区划代码_3852 > table > tbody > tr:nth-child(4)

 #list_content > div.list_right > div > ul > table > tbody > tr:nth-child(1) > td.arlisttd > a

 '''

使用python爬去国家民政最新的省份代码的程序，requests,beautifulsoup,lxml的更多相关文章

Python逐块读取大文件行数的代码 - 为程序员服务
Python逐块读取大文件行数的代码 - 为程序员服务 python数文件行数最简单的方法是使用enumerate方法,但是如果文件很大的话,这个方法就有点慢了,我们可以逐块的读取文件的内容,然后按块 ...
python爬去电影天堂恐怖片+游戏
1.爬去方式python+selenium 2.工作流程 selenium自动输入,自动爬取,建立文件夹,存入磁力链接到记事本 3.贴上代码 #!/usr/bin/Python# -*- coding ...
【Python】在Pycharm中安装爬虫库requests , BeautifulSoup , lxml 的解决方法
BeautifulSoup在学习Python过程中可能需要用到一些爬虫库例如:requests BeautifulSoup和lxml库前面的两个库,用Pychram都可以通过 File--> ...
Python爬去图片实例,python 爬取图片
# coding:utf-8 import requests import re import time proxies = { "http": "http://124. ...
Python爬去有道翻译
注:传入的类型为POST类型,所以需要使用urllib.parse.urlencode(),将字典转换成URL可用参数: 使用json.loads(),将输出的json格式,转换为字典类型 impor ...
利用Python爬去囧网福利(多线程、urllib、request)
import os; import urllib.request; import re; import threading;# 多线程 from urllib.error import URLErro ...
python爬去虎扑数据信息，完成可视化
首先分析虎扑页面数据如图我们所有需要的数据都在其中![image.png](1)所以我们获取需要的内容直接利用beaitifulsoupui4``` soup.find_all('a',class_ ...
python爬去壁纸网站上的所有壁纸
import requests as r 2 from bs4 import BeautifulSoup 3 import os 4 base_url = "http://www.win40 ...
python爬微信公众号前10篇历史文章（3）-lxml&xpath初探
理解lxml以及xpath 什么是lxml? python中用来处理XML和HTML的library.与其他相比,它能提供很好的性能, 并且它支持XPath. 具体可以查看官方文档->http: ...

随机推荐

java基础之Date类
Date类: Date类概述类 Date 表示特定的瞬间,精确到毫秒. 构造方法 public Date() public Date(long date) 成员方法 public long getT ...
Linux文件句柄数配置
1.单程序句柄数限制查看配置的句柄数:ulimit -n cat /etc/security/limits.conf 参考配置: * soft nofile 655360* hard nofile ...
2018-8-10-win10-uwp-手把手教你使用-asp-dotnet-core-做-cs-程序
title author date CreateTime categories win10 uwp 手把手教你使用 asp dotnet core 做 cs 程序 lindexi 2018-08-10 ...
nulls_hlist原理和 tcp连接查找
原文链接 http://abcdxyzk.github.io/blog/2018/09/28/kernel-sk_lookup/
网络结构解读之inception系列五：Inception V4
网络结构解读之inception系列五:Inception V4 在残差逐渐当道时,google开始研究inception和残差网络的性能差异以及结合的可能性,并且给出了实验结构. 本文思想阐述不多, ...
跟我一起在Win10中用VMware安装Ubuntu
VMware下安装Ubuntu 打开VMware,创建虚拟机: 点击运行虚拟机,报错为解决办法为大功告成,我们已经在Win10中用VMware安装Ubuntu了
ASCII、Unicode、UTF-8 字符串和编码
字符编码我们已经讲过了,字符串也是一种数据类型,但是,字符串比较特殊的是还有一个编码问题. 因为计算机只能处理数字,如果要处理文本,就必须先把文本转换为数字才能处理.最早的计算机在设计时采用8个比特 ...
JZOJ5967 常数国
题目像素有点低啊~ 算了凑合一下就好啦~ 题目大意给你一个首尾相接的数列,每次对一个区间进行操作: 顺时针操作,如果当前值比vvv大,就交换.输出最后的vvv. 比赛思路首先这题的时限这么仁慈, ...
Linux跨PC拷贝之SCP
命令:scp 不同的Linux之间copy文件常用有3种方法: 第一种就是ftp,也就是其中一台Linux安装ftp Server,这样可以另外一台使用ftp的client程序来进行文件的copy. ...
Python_day01——字符串
https://www.cnblogs.com/A-FM/p/5691468.html def main(): str1 = 'hello, world!' # 通过len函数计算字符串的长度 # 获 ...

使用python爬去国家民政最新的省份代码的程序，requests,beautifulsoup,lxml

使用python爬去国家民政最新的省份代码的程序，requests,beautifulsoup,lxml的更多相关文章

随机推荐

热门专题