Python配合BeautifulSoup读取网络图片并保存在本地

本例为Python配合BeautifulSoup读取网络图片，并保存在本地。

BeautifulSoup可代替正则表达式，更好地解析Html文本，获取其中的指定内容，如Tag、Property等

# -*- coding: gbk -*-

import urllib

import urllib2

from bs4 import BeautifulSoup

import time

import re

import os,sys

import chardet

def req(url):

    #url='http://www.szu.edu.cn/2014/news/index_1.html'

    header = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

    req=urllib2.Request(url,headers=header)

    data=urllib.urlopen(req).read()

    print data

    return data

def reqImg():

    #url='http://www.junmeng.com/tj/22376_4.html'

    url=r'http://www.junmeng.com/tj/22376.html'

    header = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

    patnLink=r'<a href=".*/tj/22376_\d*.html"><img src.+</a>'

    patnImg=r'<img src=.+>'

    savedir=r'C:\Users\hp\Desktop\results'

    if not os.path.exists(savedir):

            os.mkdir(savedir)

    for i in range(1,20):

        if i==1:

            tempurl=url

        else:

            tempurl='http://www.junmeng.com/tj/22376_%d.html'%i

        print tempurl

        #req=Request(tempurl,headers=header)

        data=urllib.urlopen(tempurl).read()

        #print data

        if i==19:

            patnLink=r'<a href=.*><img src=.*</a>'

        imgLinks=re.findall(patnLink,data)

        #print results

        link=imgLinks[0]

        #print link

        imgLink=link[link.find('src=')+5:link.find('.jpg')+4]

        print imgLink

        fullLink=r'http://www.junmeng.com%s'%imgLink

        lct=time.strftime('%Y%m%d%H%M%S')

        urllib.urlretrieve(fullLink,'%s\%s%d.jpg'%(savedir,lct,i))

        #return data

def reqImg2():

    url=r'http://www.ik6.com/meinv/40569/index.html'

    header = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

    savedir=r'C:\Users\hp\Desktop\results'

    if not os.path.exists(savedir):

            os.mkdir(savedir)

    for i in range(1,10):

        if i==1:

            tempurl=url

        else:

            tempurl='http://www.ik6.com/meinv/40569/index_%d.html'%i

        print tempurl

        #req=Request(tempurl,headers=header)

        data=urllib.urlopen(tempurl).read()

        page=BeautifulSoup(data)

        imgsrc=page.find_all('center')[0].find_all('img')[0].get('lazysrc')

        print imgsrc

        lct=time.strftime('%Y%m%d%H%M%S')

        urllib.urlretrieve(imgsrc,'%s\%s%d.jpg'%(savedir,lct,i))

def reqImg3():

    url=r'http://www.ik6.com/meinv/40572/index.html'

    header = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

    savedir=r'C:\Users\hp\Desktop\results'

    if not os.path.exists(savedir):

            os.mkdir(savedir)

    for i in range(1,10):

        if i==1:

            tempurl=url

        else:

            tempurl='http://www.ik6.com/meinv/40572/index_%d.html'%i

        print tempurl

        #req=Request(tempurl,headers=header)

        data=urllib.urlopen(tempurl).read()

        page=BeautifulSoup(data)

        imgsrc=page.find_all('center')[0].find_all('img')[0].get('lazysrc')

        print imgsrc

        lct=time.strftime('%Y%m%d%H%M%S')

        urllib.urlretrieve(imgsrc,'%s\%s%d.jpg'%(savedir,lct,i))

def reqImg4(url,themecount,imgcount):

    #url=r'http://www.ik6.com/meinv/40572/index.html'

    header = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

    savedir=r'C:\Users\hp\Desktop\result0128'

    if not os.path.exists(savedir):

            os.mkdir(savedir)

    newUrl=(url[:url.rfind('.htm')]+'_%d.html')

    print newUrl

    for i in range(1,imgcount+1):

        if i==1:

            tempurl=url

        else:

            tempurl=newUrl%i

        print tempurl

        try:

            data=urllib.urlopen(tempurl).read()

            if not data:

                print 'no response,exit'

                return

            page=BeautifulSoup(data)

            centers=page.find_all('center')

            if len(centers)==0:

                print 'response has no contents,exit'

                return

            else:

                imgsrc=centers[0].find_all('img')[0].get('lazysrc')

                print imgsrc

                #lct=time.strftime('%Y%m%d%H%M%S')

                #urllib.urlretrieve(imgsrc,'%s\%s%d.jpg'%(savedir,lct,i))

                urllib.urlretrieve(imgsrc,'%s\%d_%d.jpg'%(savedir,themecount,i))

        except Exception,e:

            return

使用：

req('http://blog.csdn.net/suwei19870312/article/details/8148427')

req('http://www.taobao.com')

reqImg()

reqImg2()

reqImg3()

for i in range(1000):

    count=11170+i

    url=r'http://www.ik6.com/meinv/%d/index.html'%count

    reqImg4(url,8)

Python配合BeautifulSoup读取网络图片并保存在本地的更多相关文章

Java--多线程读取网络图片并保存在本地
本例用到了多线程.时间函数.网络流.文件读写.正则表达式(在读取html内容response时,最好不要用正则表达式来抓捕html文本内容里的特征,因为服务器返回的多个页面的文本内容不一定使用相同的模 ...
Python3 获取网络图片并且保存到本地
Python3 获取网络图片并且保存到本地 import requests from bs4 import BeautifulSoup from urllib import request impor ...
PHP获取网络图片并保存在本地目录
PHP获取网络图片并保存在本地目录思路: 代码如下: function file_exists_S3($url) { $state = @file_get_contents($url,0,null,0 ...
Python脚本连接数据库读取特定字段保存在文件中
从Script表中取出Description字段作为文件名,并按协议将脚本归位相同的文件夹,取TestScript字段的内容写入文件 import MySQLdb import sys import ...
python Image open读取网络图片本地显示爬虫必备
#!/usr/bin/python3 # -*- coding: utf-8 -*- import requests from PIL import Image from io import Byte ...
Java从网络读取图片并保存至本地
package cn.test.net; import java.io.File; import java.io.FileOutputStream; import java.io.InputStrea ...
JAVA获取网络图片并保存到本地（随机图片接口）
import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileOutputStream; import j ...
python 将验证码保存到本地读取写入
#验证码 #读取验证码网址.打开本地路径.写入.输入验证码 downpicture = urllib.request.urlopen(SecretCodeUrl).read() local = ope ...
搭建基于python +opencv+Beautifulsoup+Neurolab机器学习平台
搭建基于python +opencv+Beautifulsoup+Neurolab机器学习平台 By 子敬叔叔最近在学习麦好的<机器学习实践指南案例应用解析第二版>,在安装学习环境的时候 ...

随机推荐

AngularJS开发指南9：AngularJS作用域的详解
AngularJS作用域是一个指向应用模型的对象.它是表达式的执行环境.作用域有层次结构,这个层次和相应的DOM几乎是一样的.作用域能监控表达式和传递事件. 作用域的特点作用域提供APIs($wat ...
jsp笔记
Jsp  Web服务器访问jsp的过程. 如果是第一次访问jsp文件,web服务器会把jsp翻译成一个servlet文件.再将其编译成一个.class文件.然后加载到内存.蓝色的地方也是为什么jav ...
Java基础-常量，变量，成员变量，局部变量
在java中,数据是以常量和变量两种方法形式进行存储和表示的(实际上,所有程序的数据都是这两种形式). 变量变量代表程序的状态.程序通过改变变量的值来改变整个程序的状态,或者说得更大一些,也就是实现 ...
Java基础-数据类型转换
1).简单类型数据间的转换,有两种方式:自动转换和强制转换,通常发生在表达式中或方法的参数传递时. 自动转换当一个较"小"数据与一个较"大"的数据一起运算 ...
获取手机的gps定位
只要手机有GPS模块,可以用HTML5的Geolocation接口获取在HTML5中,geolocation作为navigator的一个属性出现,它本身是一个对象,拥有三个方法: - getCurr ...
【poj1050】 To the Max
http://poj.org/problem?id=1050 (题目链接) 题意求二维最大子矩阵 Solution 数据好像很水,N最大才100,N^4大暴力都可以随便水过. 其实有N^3的做法.枚 ...
该如何理解AMD ，CMD，CommonJS规范--javascript模块化加载学习总结
是一篇关于javascript模块化AMD,CMD,CommonJS的学习总结,作为记录也给同样对三种方式有疑问的童鞋们,有不对或者偏差之处,望各位大神指出,不胜感激. 本篇默认读者大概知道requi ...
phpMyadmin /scripts/setup.php Remote Code Injection && Execution CVE-2009-1151
目录 . 漏洞描述 . 漏洞触发条件 . 漏洞影响范围 . 漏洞代码分析 . 防御方法 . 攻防思考 1. 漏洞描述 Insufficient output sanitizing when gener ...
javascript显示实时时间
<html> <script language=Javascript> function time(){ //获得显示时间的div t_div = document.getEl ...
java对象存储管理
java程序在内存中的存储分配情况: 堆区: 1.存储的全部是对象,每个对象都包含一个与之对应的class的信息.(class的目的是得到操作指令) 2.jvm只有一个堆区(heap)被所有线程共享, ...

Python配合BeautifulSoup读取网络图片并保存在本地

Python配合BeautifulSoup读取网络图片并保存在本地的更多相关文章

随机推荐

热门专题