Python 2.7 当源文件的格式不一致时，如何使用BeautifulSoup刮取不同的页面？_Python 2.7_Web Scraping_Beautifulsoup

Python 2.7 当源文件的格式不一致时，如何使用BeautifulSoup刮取不同的页面？

python-2.7 web-scraping

Python 2.7 当源文件的格式不一致时，如何使用BeautifulSoup刮取不同的页面？,python-2.7,web-scraping,beautifulsoup,Python 2.7,Web Scraping,Beautifulsoup,我想从几个页面中提取数据，这些页面的结构不同，不是完全不同，而是不完全相同。我的代码只会遍历len（d）==7的所有页面，并跳过其他页面，其中d是d=soup.findAll（'span'，class='property\uuu base-info\uu value'）。我怎么能得到所有的页面？！有没有可能引入页面上不存在的变量，然后给它们NA值？！这是我的代码： A=[] B=[] C=[] D=[] E=[] F=[] G=[] H=[] I=[] J=[] K=[] L=[] url

我想从几个页面中提取数据，这些页面的结构不同，不是完全不同，而是不完全相同。我的代码只会遍历len（d）==7的所有页面，并跳过其他页面，其中d是d=soup.findAll（'span'，class='property\uuu base-info\uu value'）。我怎么能得到所有的页面？！有没有可能引入页面上不存在的变量，然后给它们NA值？！这是我的代码：

A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
H=[]
I=[]
J=[]
K=[]
L=[]

url =   

['https://www.booli.se/annons/2272818','https://www.booli.se/annons/2082826'] 

import requests
from bs4 import BeautifulSoup

for page in url: 
    request = requests.get(page)
    soup = BeautifulSoup(request.text,'lxml')
    #
    d = soup.findAll('span', class_='property__base-info__value')
    if len(d)==7:
        #
        region=soup.findAll('span', itemprop='name')   
        [1].text.strip().encode('utf-8')  ###region###
        #
        a = soup.findAll('span', class_='property__base-info__title__size')
        ar = a[0].text.strip().encode('utf-8').split()
        room=ar[0]  ######Rooms#####
        area=ar[2]  #####Area#####
        #
        temp=[]
        d = soup.findAll('span', class_='property__base-info__value')
        for i in d:
            i = i.text.strip()
            temp.append(i)
        #
        full_date=temp[0].encode('utf-8')
        import datetime as dt
        date=dt.datetime.strptime(full_date, '%d %b %Y').strftime('%Y-%m-%d')    
        #
        tempo = temp[1].split('\n')[0].encode('utf-8')
        Utropspris=tempo.replace('kr','')
        import re
        estimate=re.sub(r'(\d)\s+(\d)', r'\1\2', Utropspris) 
        #
        avgift=temp[2].encode('utf-8').replace('kr/m\xc3\xa5n','')
        fee=re.sub('(?<=\d) (?=\d)', '',avgift) ####avgift####
        #
        apt=[]
        lag=temp[3].encode('utf-8')
        if lag=='L\xc3\xa4genhet':
            apt='apartment'        ######Property type##########
        #
        cost=temp[4].encode('utf-8').replace('kr/m\xc3\xa5n','') 
        #
        floor=temp[5].encode('utf-8').replace('tr','')  
        #
        year=temp[6].encode('utf-8')   ###Year built####
        #
        test=soup.find('span', class_='property__base-info__sub- 
        value').text.strip().encode('utf-8').replace('kr/m\xc2\xb2','')
        krm2=re.sub('(?<=\d) (?=\d)', '',test)   
        #
        main=soup.find('span', class_='property__base- 
        info__title__price').text.strip().split('\n')[0].encode('utf- 
        8').replace('kr','')
        price=re.sub('(?<=\d) (?=\d)', '',main)  ####sold price####
        #
        A.append(region)
        B.append(room)
        C.append(area)
        D.append(date)
        E.append(estimate)
        F.append(fee)
        G.append(apt)
        H.append(cost)
        I.append(floor)
        J.append(year)
        K.append(krm2)
        L.append(price)

另一所房子：

room, area, fee (avgift), year built (Byggår), estimated price (Utropspris)

如果我理解正确的话，你是在要求收集你不确定是否存在的数据。计算机是宇宙中最愚蠢的东西

你说过你的代码只在len（d）==7的页面上运行。你能再设定一个限制吗

是否可以引入页面上不存在的变量，以及那就给他们NA值

是的，如果变量==None:，您可以使用简单的

检查元素（字段）是否存在；如果变量：

，您可以使用

检查元素（字段）是否存在（如果有任何数据，则应返回true）
希望这能回答你的问题。你必须深入了解一些细节，这样我才能正确回答问题，编辑你的问题，我会回答。如果不存在变量==无不起作用。我认为它应该是variable=[]。
room, area, fee (avgift), year built (Byggår), estimated price (Utropspris)