通过Python、BeautifulSoup爬取Gitee热门开源项目

一、安装

1、通过requests 对响应内容进行处理,requests.get()方法会返回一个Response对象

pip install requests

2、beautifulSoup对网页解析不仅灵活、高效而且非常方便，支持多种解析器

pip install beautifulsoup4

3、pymongo是python操作mongo的工具包

pip install pymongo

4、安装mongo

二、分析网页&源代码

1、确定目标：首先要知道要抓取哪个页面的哪个版块

2、分析目标：确定抓取目标之后要分析URL链接格式以及拼接参数的含义其次还要分析页面源代码确定数据格式

3、编写爬虫代码并执行

三、编写代码

# -*- coding: utf-8 -*-

# __author__ : "初一丶" 公众号:程序员共成长

# __time__ : 2018/8/22 18:51

# __file__ : spider_mayun.py

# 导入相关库

import requests

from bs4 import BeautifulSoup

import pymongo

"""

通过分析页面url 查询不同语言的热门信息是有language这个参数决定的

"""

# language = 'java'

language = 'python'

domain = 'https://gitee.com'

uri = '/explore/starred?lang=%s' % language

url = domain + uri

# 用户代理

user_agent = 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10_12_6) ' \

            'AppleWebKit/537.36(KHTML, like Gecko) ' \

            'Chrome/67.0.3396.99Safari/537.36'

# 构建header

header = {'User_Agent': user_agent}

# 获取页面源代码

html = requests.get(url, headers=header).text

# 获取Beautiful对象

soup = BeautifulSoup(html)

# 热门类型分类 今日热门 本周热门 data-tab标签来区分当日热门和本周热门

hot_type = ['today-trending', 'week-trending']

# divs = soup.find_all('div', class_='ui tab active')

# 创建热门列表

hot_gitee = []

for i in hot_type:

   # 通过热门标签 查询该热门下的数据

   divs = soup.find_all('div', attrs={'data-tab': i})

   divs = divs[0].select('div.row')

   for div in divs:

       gitee = {}

       a_content = div.select('div.sixteen > h3 > a')

       div_content = div.select('div.project-desc')

       # 项目描述

       script = div_content[0].string

       # title属性

       title = a_content[0]['title']

       arr = title.split('/')

       # 作者名字

       author_name = arr[0]

       # 项目名字

       project_name = arr[1]

       # 项目url

       href = domain + a_content[0]['href']

       # 进入热门项目子页面

       child_page = requests.get(href, headers=header).text

       child_soup = BeautifulSoup(child_page)

       child_div = child_soup.find('div', class_='ui small secondary pointing menu')

       """

           <div class="ui small secondary pointing menu">

               <a class="item active" data-type="http" data-url="https://gitee.com/dlg_center/cms.git">HTTPS</a>

               <a class="item" data-type="ssh" data-url="git@gitee.com:dlg_center/cms.git">SSH</a>

           </div>

       """

       a_arr = child_div.findAll('a')

       # git http下载链接

       http_url = a_arr[0]['data-url']

       # git ssh下载链接

       ssh_url = a_arr[1]['data-url']

       gitee['project_name'] = project_name

       gitee['author_name'] = author_name

       gitee['href'] = href

       gitee['script'] = script

       gitee['http_url'] = http_url

       gitee['ssh_url'] = ssh_url

       gitee['hot_type'] = i

       # 连接mongo

       hot_gitee.append(gitee)

print(hot_gitee)

# 链接mongo参数

HOST, PORT, DB, TABLE = '127.0.0.1', 27017, 'spider', 'gitee'

# 创建链接

client = pymongo.MongoClient(host=HOST, port=PORT)

# 选定库

db = client[DB]

tables = db[TABLE]

# 插入mongo库

tables.insert_many(hot_gitee)

四、执行结果

[{

 'project_name': 'IncetOps',

 'author_name': 'staugur',

 'href': 'https://gitee.com/staugur/IncetOps',

 'script': '基于Inception，一个审计、执行、回滚、统计sql的开源系统',

 'http_url': 'https://gitee.com/staugur/IncetOps.git',

 'ssh_url': 'git@gitee.com:staugur/IncetOps.git',

 'hot_type': 'today-trending'

}, {

 'project_name': 'cms',

 'author_name': 'dlg_center',

 'href': 'https://gitee.com/dlg_center/cms',

 'script': None,

 'http_url': 'https://gitee.com/dlg_center/cms.git',

 'ssh_url': 'git@gitee.com:dlg_center/cms.git',

 'hot_type': 'today-trending'

}, {

 'project_name': 'WebsiteAccount',

 'author_name': '张聪',

 'href': 'https://gitee.com/crazy_zhangcong/WebsiteAccount',

 'script': '各种问答平台账号注册',

 'http_url': 'https://gitee.com/crazy_zhangcong/WebsiteAccount.git',

 'ssh_url': 'git@gitee.com:crazy_zhangcong/WebsiteAccount.git',

 'hot_type': 'today-trending'

}, {

 'project_name': 'chain',

 'author_name': '何全',

 'href': 'https://gitee.com/hequan2020/chain',

 'script': 'linux 云主机 管理系统，包含 CMDB,webssh登录、命令执行、异步执行shell/python/yml等。持续更...',

 'http_url': 'https://gitee.com/hequan2020/chain.git',

 'ssh_url': 'git@gitee.com:hequan2020/chain.git',

 'hot_type': 'today-trending'

}, {

 'project_name': 'Lepus',

 'author_name': '茹憶。',

 'href': 'https://gitee.com/ruzuojun/Lepus',

 'script': '简洁、直观、强大的开源企业级数据库监控系统，MySQL/Oracle/MongoDB/Redis一站式监控，让数据库监控更简...',

 'http_url': 'https://gitee.com/ruzuojun/Lepus.git',

 'ssh_url': 'git@gitee.com:ruzuojun/Lepus.git',

 'hot_type': 'today-trending'

}, {

 'project_name': 'AutoLink',

 'author_name': '苦叶子',

 'href': 'https://gitee.com/lym51/AutoLink',

 'script': 'AutoLink是一个开源Web IDE自动化测试集成解决方案',

 'http_url': 'https://gitee.com/lym51/AutoLink.git',

 'ssh_url': 'git@gitee.com:lym51/AutoLink.git',

 'hot_type': 'week-trending'

}, {

 'project_name': 'PornHubBot',

 'author_name': 'xiyouMc',

 'href': 'https://gitee.com/xiyouMc/pornhubbot',

 'script': '全球最大成人网站PornHub爬虫 （Scrapy、MongoDB） 一天500w的数据',

 'http_url': 'https://gitee.com/xiyouMc/pornhubbot.git',

 'ssh_url': 'git@gitee.com:xiyouMc/pornhubbot.git',

 'hot_type': 'week-trending'

}, {

 'project_name': 'wph_opc',

 'author_name': '万屏汇',

 'href': 'https://gitee.com/wph_it/wph_opc',

 'script': None,

 'http_url': 'https://gitee.com/wph_it/wph_opc.git',

 'ssh_url': 'git@gitee.com:wph_it/wph_opc.git',

 'hot_type': 'week-trending'

}, {

 'project_name': 'WebsiteAccount',

 'author_name': '张聪',

 'href': 'https://gitee.com/crazy_zhangcong/WebsiteAccount',

 'script': '各种问答平台账号注册',

 'http_url': 'https://gitee.com/crazy_zhangcong/WebsiteAccount.git',

 'ssh_url': 'git@gitee.com:crazy_zhangcong/WebsiteAccount.git',

 'hot_type': 'week-trending'

}, {

 'project_name': 'information27',

 'author_name': '印妈妈',

 'href': 'https://gitee.com/itcastyinqiaoyin/information27',

 'script': None,

 'http_url': 'https://gitee.com/itcastyinqiaoyin/information27.git',

 'ssh_url': 'git@gitee.com:itcastyinqiaoyin/information27.git',

 'hot_type': 'week-trending'

}]

通过Python、BeautifulSoup爬取Gitee热门开源项目的更多相关文章

[原创]python+beautifulsoup爬取整个网站的仓库列表与仓库详情
from bs4 import BeautifulSoup import requests import os def getdepotdetailcontent(title,url):#爬取每个仓库 ...
Python使用urllib,urllib3,requests库+beautifulsoup爬取网页
Python使用urllib/urllib3/requests库+beautifulsoup爬取网页 urllib urllib3 requests 笔者在爬取时遇到的问题 1.结果不全 2.'抓取失 ...
PYTHON 爬虫笔记九:利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集（实战项目二）
利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集目标站点分析今日头条这类的网站制作,从数据形式,CSS样式都是通过数据接口的样式来决定的,所以它的抓取方法和其他网页的抓取方 ...
python大规模爬取京东
python大规模爬取京东主要工具 scrapy BeautifulSoup requests 分析步骤打开京东首页,输入裤子将会看到页面跳转到了这里,这就是我们要分析的起点我们可以看到这个页面 ...
Python+Selenium爬取动态加载页面（2）
注: 上一篇<Python+Selenium爬取动态加载页面(1)>讲了基本地如何获取动态页面的数据,这里再讲一个稍微复杂一点的数据获取全国水雨情网.数据的获取过程跟人手动获取过程类似,所 ...
Python+Selenium爬取动态加载页面（1）
注: 最近有一小任务,需要收集水质和水雨信息,找了两个网站:国家地表水水质自动监测实时数据发布系统和全国水雨情网.由于这两个网站的数据都是动态加载出来的,所以我用了Selenium来完成我的数据获取. ...
用Python爬虫爬取广州大学教务系统的成绩（内网访问）
用Python爬虫爬取广州大学教务系统的成绩(内网访问) 在进行爬取前,首先要了解: 1.什么是CSS选择器? 每一条css样式定义由两部分组成,形式如下: [code] 选择器{样式} [/code ...
python之爬取网页数据总结（一）
今天尝试使用python,爬取网页数据.因为python是新安装好的,所以要正常运行爬取数据的代码需要提前安装插件.分别为requests Beautifulsoup4 lxml 三个插件 ...
大神：python怎么爬取js的页面
大神:python怎么爬取js的页面可以试试抓包看看它请求了哪些东西, 很多时候可以绕过网页直接请求后面的API 实在不行就上 selenium (selenium大法好) selenium和pha ...

随机推荐

结队第一次 plus
作业描述作业所属课程:软件工程1916|W(福州大学) 作业要求:结对第一次-原型设计结对学号:221600328 221600106 作业目标:尝试结对合作,使用NABCD模型,会分析用户需求, ...
JSPatch 热更新
JSPatch 是一个 iOS 动态更新框架,只需在项目中引入极小的引擎,就可以使用 JavaScript 调用任何 Objective-C/Swift 原生接口. 获得脚本语言的优势,为项目动态添加 ...
ie低版本内核事件兼容问题（事件绑定，绑定事件自动执行，文档模式问题）
问题情况搜狗等,兼容模式下,以前前端写的点击事件的代码没有, 后来一看是因为兼容模式为9,导致点击事件失效解决办法,步骤 1,处理绑定事件兼容问题 ie低版本绑定事件只支持attactevent, ...
linux磁盘满了的处理
1.查看磁盘使用情况 cd / df -h 如果总量Size和Used一样,按就证明磁盘满了 2.查看当前文件下每个文件大小 du -sh * 一层一层去查,就可以查到占用空间最大的那个文件及产生 ...
页面的input唤醒软键盘再收起后，页面会出现软键盘高度的空白背景
微信浏览器在版本6.7.4及以上会有这个bug:页面的input唤醒软键盘再收起后,页面会出现软键盘高度的空白背景,触摸到滚动条会消失恢复! 解决代码后台框架嵌入iframe的情景,iframe内部 ...
使用PIA查找组件的PeopleSoft导航
导航到企业组件>查找对象导航. 使用组件名称使用页面名称使用辅助页面名称使用内容参考名称只需输入对象名称,然后单击“搜索”即可.在这个例子中.我们知道组件名称即'PRCSDEFN',我们 ...
gridlayout代码注释
<div class="wrapper"> //定义一节或者一部分区域,它的css样式对应的css中class选择器的wrapper <div class=&qu ...
Nginx如何对日志文件进行配置？
在我们日常工作开发中,对调试bug最重要的手段就是查看日志和断点调试了. 今天我们来说日志文件,Nginx的日志文件一般保存的是访问日志和错误日志. 1. 用来log_format指令设置日志格式 l ...
[微信小程序]编译.wxss出错，2 not found
小程序新建项目就出错:2 not found 编译.wxss文件出错(不是一般的郁闷,新建项目就报错...) 大概的情况是开发工具没有更新.或更新不到, 第一,可以删掉开发工具重新下载最新安装: 第 ...
Python入门—文件读写
文件读写的基本流程: #1.打开文件#2.读写文件#3.关闭文件 f = open('文件读写',encoding='utf-8') #打开文件,并赋值给f,encoding='utf-8'让中文可以 ...

通过Python、BeautifulSoup爬取Gitee热门开源项目

通过Python、BeautifulSoup爬取Gitee热门开源项目的更多相关文章

随机推荐

热门专题