初学Python爬取图片

最近在学习Python,基本语法还没有学完,冲着爬虫去的,拿起requests和BeautifulSoup就是怼,目标站点是随便在网上搜索的,上代码。

爬取手机壁纸

#http://aladd.net/archives/category/shoujibizhi/page/2 壁纸
#http://aladd.net/archives/category/picture/page/2 背景
import requests
from bs4 import BeautifulSoup
headers = {
'Host': 'aladd.net',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.3226.400 QQBrowser/9.6.11681.400',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://aladd.net/archives/category/cat_ico19',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Cookie': 'Hm_lvt_295fb6522198c063fd8b98be9bd5e31f=1501646526; Hm_lpvt_295fb6522198c063fd8b98be9bd5e31f=1501646537'
}
for i in range(999):
	url = requests.get('http://aladd.net/archives/category/shoujibizhi/page/%d'% i,headers=headers)
#判断是否可以访问
	if(url.status_code == 200):
		soup = BeautifulSoup(url.text)
		for i in soup.find_all('div',{'class','thumbnail_bz'}):
			#根据筛选之后的数据去提取img src
			im = i.find('img').get('src')
			name = i.find('img').get('alt')
                        #把提取出来的文件名处理成标准尺寸
			im = im[:-7]
			print(im)
			#循环下载图片
			ir = requests.get(im, stream=True)
			if ir.status_code == 200:
			    with open("pictures/%s.jpg"% name, 'wb') as f:
			        for chunk in ir:
			            f.write(chunk)
	else:
		print('爬取完毕')
		break

一个是爬取手机壁纸,一个是电脑壁纸,新手上路~

爬取电脑壁纸

 

#http://aladd.net/archives/category/shoujibizhi/page/2
import requests
import re
from bs4 import BeautifulSoup
#公共
headers = {
'Host': 'aladd.net',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.3226.400 QQBrowser/9.6.11681.400',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://aladd.net/archives/category/picture',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Cookie': 'Hm_lvt_295fb6522198c063fd8b98be9bd5e31f=1501646526,1501658038; Hm_lpvt_295fb6522198c063fd8b98be9bd5e31f=1501658774'
}
#假设一个条件
for i in range(999):
	url = requests.get('http://aladd.net/archives/category/picture/page/%d'% i,headers=headers)
	#判断这个url是否可以访问
	if url.status_code == 200:
		soup = BeautifulSoup(url.text)
		for x in soup.find_all('div',{'class','thumbnail'}):
			img = x.find('img').get('src')
			img = img[:-7]
			print(img)
			name = x.find('img').get('alt')
			#当图片名字包含路径的时候回进行转换 /a/a --> -a-a
			name = name.replace('/','-')
			#循环下载图片
			ir = requests.get(img,stream=True)
			if ir.status_code == 200:
				with open("pictures/bz/%s.jpg"% name, 'wb') as f:
					for chunk in ir:
						f.write(chunk)
	else:
		print('爬取失败或者已经完成')
		break

这里遇到一个问题,图片链接是/img/name/1.jpg而保存的路径是pictures/下面没有目录,遇到/img/name/1.jpg 就会抛出目录不存在,我这里的解决方法是 name.replace('/','-')  

使用字符串替换,把包含”/”的字符替换为’-‘完美简单解决问题。

附上文件,正在学习python的可以交流下,目前个人在备案另一个站点,留言和评论板块关闭;

1.py

2.py