python3爬取1024圖片－互聯網 - 大數據

文章出處

這兩年python特別火，火到博客園現在也是隔三差五的出現一些python的文章。各種開源軟件、各種爬蟲算法紛紛開路，作為互聯網行業的IT狗自然看的我也是心癢癢，于是趁著這個霧霾橫行的周末瞅了兩眼，作為一名老司機覺得還是應該以練帶學，1024在程序員界這么流行的網站，當然拿來先練一練。

python自稱是以自然語言的視角來編程，特點是開發快，語言簡潔，沒那么多技巧，大名鼎鼎的豆瓣、youtube都是使用python開發的網站，看來python在大規模使用這個方面來講應該沒有啥子問題；python也不是沒有缺點在性能方面就Java、C++等老前輩還是沒得比的，另外python和nodejs一樣只能使用CPU單核，也是性能方面影響是因素之一。但python在特定領域表現突出，特別是腳本、爬蟲、科學算法等。

好了,還是說正事如何爬取1024網站的圖片

分析

列表頁面

首先進入1024的導航網站，隨便點擊一個地址進入選擇圖片區或者在網站地址后面添加thread0806.php?fid=16&search=&page=,這就是1024網站的圖片區，這個爬蟲就是主要抓取這個區域的所有圖片，使用瀏覽器debug分析一下這個頁面發現基本都是列表頁,格式如下：

list

在地址欄http://xxxxxx.biz/thread0806.php?fid=16&search=&page=后面拼1、2、3等于就是訪問圖片區第一頁、第二頁、第三頁的列表頁。根據這些列表頁就可以爬出具體的每一個圖片頁的地址，類似上圖的地址：htm_data/16/1611/2114702.html 在地址的前面拼接上主站地址就是具體的圖片頁了。所以根據以上的分析：通過循環地址欄找到不同的列表頁在根據列表頁找到具體的圖片頁

地址欄->圖片列表->圖片頁地址

獲取列表頁圖片地址代碼如下：

import urllib.request,socket,re,sys,os

baseUrl='http://xxxx.biz/'

def getContant(Weburl):
    Webheader= {'Upgrade-Insecure-Requests':'1',
                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36',}
    req = urllib.request.Request(url = Weburl,headers=Webheader)
    respose = urllib.request.urlopen(req)
    _contant = respose.read()
    respose.close()
    return str(_contant)

def getUrl(URL):
    pageIndex = 1
    for i in range(1,int(pageIndex)+1):
        Weburl = URL + str(i)
        contant = getContant(Weburl)
        comp = re.compile(r'<a href="htm_data.{0,30}html" target="_blank" id=""><font color=g')
        urlList1 = comp.findall(contant)
        comp = re.compile(r'a href="(.*?)"')
        urlList2 = comp.findall(str(urlList1))
        urlList = []
        for url1 in urlList2:
            url2 = baseUrl+url1
            urlList.append(url2)
        return urlList
        
URL = baseUrl+'thread0806.php?fid=16&search=&page='
UrlList = getUrl(URL) 
print(UrlList)

在這個地址后面拼接1到N就是不同的列表頁

圖片頁面

利用瀏覽器debug一下頁面，圖片基本上都是外鏈地址，以http或者https開頭以jpg、png、gif結尾，寫個正則表達式匹配這些地址，然后交給程序下載就OK了。

頁面代碼如下：

page

在下載過程中遇到了幾個問題，就是有的頁面會報403禁止訪問等，應該是網站加了一些防止爬蟲的手段，網上找了下加上header參數來模擬瀏覽器訪問就解決了;

下載單個頁面代碼如下：

import urllib.request,socket,re,sys,os

#定義文件保存路徑
targetPath = "D:\\temp\\1024\\1"

def openUrl(url):
    headers = {
                  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '                            'Chrome/51.0.2704.63 Safari/537.36'
               }

    req = urllib.request.Request(url=url, headers=headers)
    res = urllib.request.urlopen(req)
    data = res.read()
    downImg(data)

def downImg(data):
    for link,t in set(re.findall(r'([http|https]:[^\s]*?(jpg|png|gif))', str(data))):

        if link.startswith('s'):
            link='http'+link
        else:
            link='htt'+link
        print(link)
        try:
            opener=urllib.request.build_opener()
            opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
            urllib.request.install_opener(opener)
            urllib.request.urlretrieve(link,saveFile(link))
        except:
            print('失敗')

def saveFile(path):
    #檢測當前路徑的有效性
    if not os.path.isdir(targetPath):
        os.mkdir(targetPath)

    #設置每個圖片的路徑
    pos = path.rindex('/')
    t = os.path.join(targetPath,path[pos+1:])
    return t

url = "http://xxxx.biz/htm_data/16/1611/2115193.html"
openUrl(url)

批量爬取

批量爬取有兩個工作要做，第一for循環目標內的所有列表頁，第二為了避免重復爬取，需要給每個頁面建立唯一的文件夾，下次爬取的時候如果存在直接跳過。最后在理一下所有的爬取步驟：

循環地址欄->找出圖片頁列表->圖片頁分析找出圖片地址->為圖片頁建立唯一的文件夾->開始下載頁面圖片

完整的代碼如下：

import urllib.request,socket,re,sys,os

baseUrl='http://xxxx.biz/'
targetPath = "D:\\temp\\1024\\"

def getContant(Weburl):
    Webheader= {'Upgrade-Insecure-Requests':'1',
                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36',}
    req = urllib.request.Request(url = Weburl,headers=Webheader)
    respose = urllib.request.urlopen(req)
    _contant = respose.read()
    respose.close()
    return str(_contant)

def getUrl(URL):
    pageIndex = 1
    for i in range(1,int(pageIndex)+1):
        Weburl = URL + str(i)
        contant = getContant(Weburl)
        comp = re.compile(r'<a href="htm_data.{0,30}html" target="_blank" id=""><font color=g')
        urlList1 = comp.findall(contant)
        comp = re.compile(r'a href="(.*?)"')
        urlList2 = comp.findall(str(urlList1))
        urlList = []
        for url1 in urlList2:
            url2 = baseUrl+url1
            urlList.append(url2)
        return urlList

def openUrl(url):
    headers = {
                  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '                            'Chrome/51.0.2704.63 Safari/537.36'
               }

    filePath=targetPath+url[-12:-5]
    #檢測當前路徑的有效性
    if not os.path.isdir(filePath):
        os.mkdir(filePath)
        req = urllib.request.Request(url=url, headers=headers)
        res = urllib.request.urlopen(req)
        data = res.read()
        downImg(data,filePath)
    else:
        print("已經下載過的地址跳過："+url)
        print("filePath  "+filePath)

def downImg(data,filePath):
    for link,t in set(re.findall(r'([http|https]:[^\s]*?(jpg|png|gif))', str(data))):

        if link.startswith('s'):
            link='http'+link
        else:
            link='htt'+link
        print(link)
        try:
            opener=urllib.request.build_opener()
            opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
            urllib.request.install_opener(opener)
            urllib.request.urlretrieve(link,saveFile(link,filePath))
        except:
            print('失敗')

def saveFile(path,filePath):
    #設置每個圖片的路徑
    pos = path.rindex('/')
    t = os.path.join(filePath,path[pos+1:])
    return t


def openPage(UrlList):
    for pageUlr in UrlList:
        try:
            print('正在下載地址：'+pageUlr)
            openUrl(pageUlr)
        except:
            print('地址：'+pageUlr+'下載失敗')

URL = baseUrl+'thread0806.php?fid=16&search=&page='
for num in range(0,20):#0-20頁
    print("#######################################")
    print("##########總目錄下載地址###############")
    print(URL+str(num))
    print("#######################################")
    print("#######################################")
    UrlList = getUrl(URL+str(num)) 
    openPage(UrlList)