Python爬蟲（十）——股票定向爬蟲

def getHTMLText(url, code='utf-8'):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = code  # 判斷編碼時間會消耗很多時間，所以我們提前去網站上了解編碼類型
        return r.text
    except:
        return ""

getStockList(lst, stockUrl)

我們先得到股票列表。通過觀察網頁源代碼，我們發現上證和深證的股票編碼可以用以下的正則表達式表示：

r'[s][hz]\d{6}'

然后我們查詢網站的編碼類型：

編碼類型

代碼：

def getStockList(lst, stockUrl):
    html = getHTMLText(stockUrl, 'GB2312')  # 填上在網頁代碼中看到的編碼方式
    soup = BeautifulSoup(html, 'html.parser')
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r'[s][hz]\d{6}', href)[0][2:])
        except:
            continue

getStockInfo(lst, stockUrl)

最后我們對股票列表中的股票進行查詢后提取股票的信息。這時我們需要在網頁代碼中找到我們需要的信息。我們發現整個股票的信息都在一個table中：

股票所有信息

進一步觀察發現股票名字在一個class='Lfont’的標簽中：

股票名稱

而其他的信息都在一個class='Rlist’的標簽中：

股票信息

在充分了解相關信息的name和attrs后，我們利用BeautifulSoup提取出相關信息再將其存入excel表格中。

代碼：

def getStockInfo(lst, stockUrl):
    count = 0
    wb = xlwt.Workbook(encoding='utf-8')
    ws = wb.add_sheet('股票')
    ws.write(0, 0, label='股票名稱')
    ws.write(0, 1, label='信息')
    for stock in lst:
        url = stockUrl+stock+'.html'  # 得到每個股票代碼的網頁
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('table', attrs={
                                  'style': 'width:550px;border:0;padding:0;border-collapse:collapse;'})
            if stockInfo:
                name = stockInfo.find('div', attrs={'class': 'Lfont'})
                infoDict.update({'股票名稱': name.text})
                table = stockInfo.find('td', attrs={'class': 'Rlist'})
                keyList = table.find_all('td')
                key = ''
                for td in keyList:
                    key = key+(td.text+'\n')
                infoDict['信息'] = key
                count += 1
                ws.write(count, 0, label=infoDict['股票名稱'])
                ws.write(count, 1, label=infoDict['信息'])
                print('\r當前進度：{:.2f}&'.format(count*100/len(lst)),
                      end='')  # /r能夠將打印的光標提到行首
        except:
            print('\r當前進度：{:.2f}&'.format(count*100/len(lst)), end='')
            traceback.print_exc()
            continue
    wb.save('股票.xls')

完整代碼

import requests
from bs4 import BeautifulSoup
import traceback
import re
import xlwt


def getHTMLText(url, code='utf-8'):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = code  # 判斷編碼時間會消耗很多時間，所以我們提前去網站上了解編碼類型
        return r.text
    except:
        return ""


def getStockList(lst, stockUrl):
    html = getHTMLText(stockUrl, 'GB2312')  # 填上在網頁代碼中看到的編碼方式
    soup = BeautifulSoup(html, 'html.parser')
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r'[s][hz]\d{6}', href)[0][2:])
        except:
            continue


def getStockInfo(lst, stockUrl):
    count = 0
    wb = xlwt.Workbook(encoding='utf-8') # 新建一個workbook
    ws = wb.add_sheet('股票') # 新建一個worksheet
    ws.write(0, 0, label='股票名稱')
    ws.write(0, 1, label='信息')
    for stock in lst:
        url = stockUrl+stock+'.html'  # 得到每個股票代碼的網頁
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('table', attrs={
                                  'style': 'width:550px;border:0;padding:0;border-collapse:collapse;'})
            if stockInfo:
                name = stockInfo.find('div', attrs={'class': 'Lfont'})
                infoDict.update({'股票名稱': name.text})
                table = stockInfo.find('td', attrs={'class': 'Rlist'})
                keyList = table.find_all('td')
                key = ''
                for td in keyList:
                    key = key+(td.text+'\n')
                infoDict['信息'] = key
                count += 1
                ws.write(count, 0, label=infoDict['股票名稱']) # 寫入信息
                ws.write(count, 1, label=infoDict['信息']) # 寫入信息
                print('\r當前進度：{:.2f}&'.format(count*100/len(lst)),
                      end='')  # /r能夠將打印的光標提到行首
        except:
            print('\r當前進度：{:.2f}&'.format(count*100/len(lst)), end='')
            traceback.print_exc()
            continue
    wb.save('股票.xls') # 保存為股票.xls


stockListUrl = 'http://quote.eastmoney.com/stock_list.html'
stockInfoUrl = 'http://quote.cfi.cn/quote_'
slist = []
getStockList(slist, stockListUrl)
getStockInfo(slist, stockInfoUrl, fpath)

結果：

運行結果

本文鏈接：https://blog.csdn.net/qq_18543557/article/details/104203757

智能推薦

Python爬蟲入門實例八之股票數據定向爬取并保存(優化版)

文章目錄寫在前面一、準備工作 1.功能描述 2.候選數據網站的選擇 3.程序的結構設計 4.本篇選取的數據網站 (1)網站鏈接 (2)網站內容二、數據網站分析 1.股票列表的分析 2.個股信息的分析三、編程過程 1.使用到的庫 2.獲取頁面(編寫getHTMLText()函數) 3.獲取股票的信息列表(編寫getStockList()函數) 4.獲取個股信息(編寫getStockInfo(...

python——爬取網頁定向爬蟲（6）

定向爬蟲定義：定向爬蟲可以精準的獲取目標站點信息。僅對輸入URL進行爬取，不拓展爬取。【實例練習】【中國的大學排名爬取】 1 查看網頁是否對爬蟲有限制 1.查看 robots.txt 無robots.txt文件說明無爬蟲限制 2.查看原網頁要提取的信息被封裝在html內 2.程序的結構設計步驟1：從網絡上獲取大學排名網頁內容步驟2：提取網頁內容中信息到合適的數據結構（二維數據）步驟3...

python--定向爬蟲基本操作

一、認識爬蟲瀏覽器的工作原理：爬蟲的工作原理：爬蟲工作4個步驟：第0步：獲取數據。爬蟲程序會根據我們提供的網址，向服務器發起請求，然后返回數據。第1步：解析數據。爬蟲程序會把服務器返回的數據解析成我們能讀懂的格式。第2步：提取數據。爬蟲程序再從中提取出我們需要的數據。第3步：儲存數據。爬蟲程序把這些有用的數據保存起來，便于你日后的使用和分析。二、獲取數據：模塊requ...

Python淘寶商品比價定向爬蟲

1.項目基本信息目標：獲取淘寶搜索頁面的信息，提取其中的商品名稱和價格理解：淘寶的搜索接口、翻頁的處理 URL樣式： 2.程序的結構設計步驟1：提交商品搜索請求，循環獲取頁面步驟2：對于每個頁面，提取商品名稱和價格信息步驟3：將信息輸出到屏幕上 3.Cookie內容的獲取由于淘寶的反爬機制，需要修改請求頭，添加Cookie信息運行結果： 4.代碼最近在重溫之前看過的北理的嵩天老...

HTML中常用操作關于：頁面跳轉，空格

1.頁面跳轉 2.空格的代替符...