Spider爬虫基础

get获取某个网站的html代码,post访问网站获取网站返回的信息

import urllib.request

import urllib.parse

#使用get请求

def start1():

    response=urllib.request.urlopen('http://www.baidu.com')

    print(response.read().decode('utf-8'))

#使用post请求

def start2():

    data=bytes(urllib.parse.urlencode({'dsadasdas':'杀马特'}),encoding='utf8')

    #使用uillib.parse来将想发送的表单的键值对按照utf8弄成适合网页post传输的形式

    response=urllib.request.urlopen('http://httpbin.org/post',data=data)

    print(response.read().decode('utf-8'))

start2()

设置访问超时处理

import urllib.request

import urllib.parse

try:

    response=urllib.request.urlopen('http://www.baidu.com',timeout=0.01)

    print(response.read().decode('utf-8'))

except urllib.error.URLError as e:

    print('time out')

获取状态码等

import urllib.request

import urllib.parse

response=urllib.request.urlopen('http://www.baidu.com')

print(response.getheader) #获取请求的信息头

print(response.status) #获取请求的状态码

response=urllib.request.urlopen('http://douban.com')

print(response.status) #出现418状态码表示自己被发现是爬虫了

通过发送头部来伪装浏览器，突破豆瓣

import urllib.request

import urllib.parse

url1='http://www.douban.com'

sendheader= {

    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 Edg/85.0.564.67"

}

req=urllib.request.Request(url=url1,headers=sendheader)

response=urllib.request.urlopen(req)

print(response.read().decode('utf-8'))

beatifulsoup4的使用

bs4可以将复杂的html文档转换成一个复杂的树形结构，每个节点都是python对象，所有的对象可以归纳成4种

-Tag

-NavigableString

-Comment

文档的便利，文档的搜索，css选择器

import urllib.request

import urllib.parse

import re

from bs4 import BeautifulSoup

file = open('baidu.html','rb')

html=file.read().decode('utf8')

bs=BeautifulSoup(html,'html.parser')

#1.tag 标签及其内容

print(bs.title)  #打印title标签

print(bs.title.string) #打印title里面的字符串内容

print(bs.a.attrs)  #拿到标签里面的属性放进字典

#comment 是注释,输出的内容不包括注释符号

#----------文档的便利——---------

print('contents用于遍历某个标签里面的节点放进一个列表',bs.head.contents)

print('可以获取列表里面的某一个值',bs.head.contents[0])

#----------文档的搜索——---------

t_list=bs.find_all('a')

#下面这种是字符串过滤，会找完全匹配的内容

print('找所有的a标签放进列表',t_list)

#下面这种是正则表达式过滤

t_list1=bs.find_all(re.compile('a'))

print('正则找含有a的标签',t_list1)

#还有根据函数（来返回）过滤

def name_exist(tag):

    return tag.has_attr('name')

t_list=bs.find_all(name_exist)

print('找出属性中有name的',t_list)

t_list=bs.find_all(id='head')

print('找id等于head的',t_list)

t_list=bs.find_all(text='百度首页')

print('找文本等于百度首页的',t_list)

#----------css选择器——---------

t_list=bs.select('title') #按照标签查找

print(t_list)

t_list=bs.select('.bdsug') #按照css类查找

print(t_list)

t_list=bs.select('#u1') #按照css的id查找

print(t_list)

t_list=bs.select('head > title') #按照head下面的title查找

print(t_list)

t_list=bs.select('.sda ~ .mm') #找和.sda同一级的.mm 兄弟节点

print(t_list)

t_list=bs.select('div',class_='item') #找div且class是item的。

print(t_list)

保存数据进入xls数据库

import xlwt

workbook=xlwt.Workbook(encoding='utf8',style_compression=0)  #创建表对象

worksheet=workbook.add_sheet('sheet1',cell_overwrite_ok=True)  #创建工作表,cell_overwrite_ok=True要写，用于后面写的覆盖前面的

worksheet.write(0,0,'hello')  #写入数据，第一参数是行，第二个参数是列，第三个参数是内容

col=('链接','图片','关键字')

for i in range(0,3):

    worksheet.write(1,i,col[i])

workbook.save('student.xls')

sqlite数据库的使用

import sqlite3

conn = sqlite3.connect('test.db')

c = conn.cursor()

sql='''

create table company

    (id int primary key not null,

    name text not null,

    age int not null,

    address char(50),

    salary real)

'''

#create table company是创建company的表格

# 下面用括号代表它这个表格里面的内容，首先是id，id是整型且是主键，not nuall是非空

# address char(50)地址是50的字符串  salary是real形的数据

sql1='''

insert into company (id,name,age,address,salary)

values(1,'十大',30,'sjsad',15000)

'''

c.execute(sql) #执行sql语句,建表

c.execute(sql1) #执行sql语句，插入

sql3='select id,name,address,salary from company'

#-------以下为查询操作------

cursor=c.execute(sql3)

for row in cursor:

    print('id',row[0])

    print('name',row[1])

    print('address',row[2])

    print('salary',row[3])

    print('\n')

    # -------以上为查询操作------

conn.commit()  #提交数据库操作

conn.close()    #关闭数据库连接

print('成功建表')

#数据类型  文本text 整形int 字符串型varchar  含小数型numeric

#autoincrement自增长)

Spider爬虫基础的更多相关文章

Python爬虫基础
前言 Python非常适合用来开发网页爬虫,理由如下: 1.抓取网页本身的接口相比与其他静态编程语言,如java,c#,c++,python抓取网页文档的接口更简洁:相比其他动态脚本语言,如perl ...
python 3.x 爬虫基础---Urllib详解
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3.x 爬虫基础---Urllib详解前言爬虫也了解了一段时间了希望在半个月的时间内 ...
python 3.x 爬虫基础---常用第三方库（requests，BeautifulSoup4，selenium，lxml ）
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3.x 爬虫基础---Urllib详解 python 3.x 爬虫基础---常用第三方库 ...
spider 爬虫文件基本参数(3)
一代码 # -*- coding: utf-8 -*- import scrapy class ZhihuSpider(scrapy.Spider): # 爬虫名字,名字唯一,允许自定义 name ...
java网络爬虫基础学习（三）
尝试直接请求URL获取资源豆瓣电影 https://movie.douban.com/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort= ...
java网络爬虫基础学习（一）
刚开始接触java爬虫,在这里是搜索网上做一些理论知识的总结主要参考文章:gitchat 的java 网络爬虫基础入门,好像要付费,也不贵,感觉内容对新手很友好. 一.爬虫介绍网络爬虫是一个自动提 ...
python从爬虫基础到爬取网络小说实例
一.爬虫基础 1.1 requests类 1.1.1 request的7个方法 requests.request() 实例化一个对象,拥有以下方法 requests.get(url, *args) r ...
python爬虫基础_scrapy
其实scrapy想要玩得好,还是需要大量全栈知识的.scrapy 被比喻为爬虫里的django,框架和django类似. 安装: Linux/mac - pip3 install scrapy Win ...
爬虫基础以及 re,BeatifulSoup,requests模块使用
爬虫基础以及BeatifulSoup模块使用爬虫的定义:向网站发起请求,获取资源后分析并提取有用数据的程序爬虫的流程发送请求 ---> request 获取响应内容 ---> res ...

随机推荐

(第一篇)记一次python分布式web开发（利用docker）
作者:落阳日期:2020-12-23 在一次项目开发中,决定使用docker+nginx+flask+mysql的技术栈来开发,用此系列文章记录开发的过程. 系列文章,当前为第一篇,记录一次pyth ...
一听就懂：用Python做一个超简单的小游戏
写它会用到 while 循环random 模块if 语句输入输出函数
Python 微信公众号文章爬取
一.思路我们通过网页版的微信公众平台的图文消息中的超链接获取到我们需要的接口从接口中我们可以得到对应的微信公众号和对应的所有微信公众号文章. 二.接口分析获取微信公众号的接口: https:// ...
无法启动IIS Express Web服务器
解决打开项目文文件夹以.csproj结尾的文件),找到WebProjectProperties节点然后,将图中框选的三项节点内容全部清空删除.vs 重新启动即可
[leetcode] Add to List 74. Search a 2D Matrix
/** * Created by lvhao on 2017/8/1. * Write an efficient algorithm that searches for a value in an m ...
嵌入式LInux-让开发板访问外网-ping bad address baidu.com
我的嵌入式设备已经接入网络.能够ping局域网ip.可是为了实现能够ping通外网.比如 ping baidu.com 还是不行的. 当运行ping baidu.com这个命令时,提示 ping ba ...
JVM 低延迟垃圾收集器 Shenandoah 和 ZGC
本文部分摘自<深入理解 Java 虚拟机第三版> 概述衡量垃圾收集器的三项指标分别是:内存占用.吞吐量和延迟.这三者共同构成一个"不可能三角",即一款优秀的收集器最多 ...
SpringBoot进阶教程(六十九)ApplicationContextAware
在某些特殊的情况下,Bean需要实现某个功能,但该功能必须借助于Spring容器才能实现,此时就必须让该Bean先获取Spring容器,然后借助于Spring容器实现该功能.为了让Bean获取它所在的 ...
Java NIO 文件通道 FileChannel 用法
FileChannel 提供了一种通过通道来访问文件的方式,它可以通过带参数 position(int) 方法定位到文件的任意位置开始进行操作,还能够将文件映射到直接内存,提高大文件的访问效率.本文将 ...

Spider爬虫基础

Spider爬虫基础的更多相关文章

随机推荐

热门专题