Python Beautifulsoup使用同一类从不同跨度中提取文本

Python Beautifulsoup使用同一类从不同跨度中提取文本,python,class,web-scraping,beautifulsoup,Python,Class,Web Scraping,Beautifulsoup,由于我是数据科学的新手,我正在尝试创建一个房地产网站,以便创建一个包含列表的数据集,我遇到的问题是,不同的元素(房间、表面和厕所数量)具有相同的li类和span类,因此我还为其他2个元素获取了第一个元素(房间)。 我试图实现解决方案,但出现以下错误: “'str'对象没有属性'find_next'” 网站: 代码: 我也不想用硒, 感谢您的帮助作为快速解决方案,您可以找到房间的,然后使用find_next查看表面,再次使用find_next查看厕所数量: 例如(我还使用get_text(str

由于我是数据科学的新手,我正在尝试创建一个房地产网站,以便创建一个包含列表的数据集,我遇到的问题是,不同的元素(房间、表面和厕所数量)具有相同的li类和span类,因此我还为其他2个元素获取了第一个元素(房间)。 我试图实现解决方案,但出现以下错误:
“'str'对象没有属性'find_next'”
网站:

代码:

我也不想用硒,
感谢您的帮助作为快速解决方案,您可以找到房间的
,然后使用
find_next
查看表面,再次使用
find_next
查看厕所数量:

例如(我还使用
get_text(strip=True)
去除文本中的空白):

当我打印变量
web\u content\u dict
时,它是这样的:

{'Title': "Bilocale via Fra' Giovanni Pantaleo 3, Bovisa, Milano", 'Price': '€ 187.000', 'Rooms': '2', 'Surface': '65', 'Toilets': '1'}
{'Title': 'Trilocale via Monte Rosa 15, Amendola - Buonarroti, Milano', 'Price': '€ 730.000', 'Rooms': '3', 'Surface': '140', 'Toilets': '2'}
{'Title': 'Trilocale via San Senatore, 2, Missori, Milano', 'Price': '€ 665.000', 'Rooms': '3', 'Surface': '109', 'Toilets': '2'}
{'Title': 'Quadrilocale viale Duilio 6, Sempione, Milano', 'Price': '€ 1.150.000', 'Rooms': '4', 'Surface': '165', 'Toilets': '2'}
{'Title': "Appartamento piazza Sant'agostino, 6, Corso Genova, Milano", 'Price': '€ 1.650.000', 'Rooms': '5', 'Surface': '275', 'Toilets': '3+'}
{'Title': 'Trilocale via Val Gardena 25, Precotto, Milano', 'Price': '€ 170.000', 'Rooms': '3', 'Surface': '91', 'Toilets': '1'}
{'Title': 'Appartamento corso Di Porta Nuova, Turati, Milano', 'Price': '€ 1.130.000', 'Rooms': '5+', 'Surface': '210', 'Toilets': '3'}
{'Title': 'Trilocale via Francesco Albani 58, Monte Rosa - Lotto, Milano', 'Price': '€ 380.000', 'Rooms': '3', 'Surface': '90', 'Toilets': '1'}
{'Title': 'Bilocale via Antonio Cesari 47, Niguarda, Milano', 'Price': '€ 115.000', 'Rooms': '2', 'Surface': '46', 'Toilets': '1'}
{'Title': 'Trilocale via mazzucotelli 15, Quartiere Forlanini, Milano', 'Price': '€ 215.000', 'Rooms': '3', 'Surface': '91', 'Toilets': '2'}
{'Title': 'Bilocale via Livorno, Palestro, Milano', 'Price': '€ 520.000', 'Rooms': '2', 'Surface': '57', 'Toilets': '1'}
{'Title': 'Bilocale via Maspero 28, Molise - Cuoco, Milano', 'Price': '€ 290.000', 'Rooms': '2', 'Surface': '70', 'Toilets': '1'}
{'Title': 'Trilocale largo Gemito, 3, Casoretto, Milano', 'Price': '€ 308.000', 'Rooms': '3', 'Surface': '93', 'Toilets': '1'}
{'Title': 'Quadrilocale via Pietro Paleocapa, Cadorna - Castello, Milano', 'Price': '€ 1.300.000', 'Rooms': '4', 'Surface': '180', 'Toilets': '3'}
{'Title': 'Bilocale via Renato Fucini, Città Studi, Milano', 'Price': '€ 511.000', 'Rooms': '2', 'Surface': '85', 'Toilets': '1'}
{'Title': 'Quadrilocale via Lucca, Bisceglie, Milano', 'Price': '€ 275.000', 'Rooms': '4', 'Surface': '100', 'Toilets': '1'}
{'Title': 'Trilocale via RIZZARDI 45, Trenno, Milano', 'Price': '€ 485.000', 'Rooms': '3', 'Surface': '127', 'Toilets': '1'}
{'Title': 'Bilocale via bacchiglione, Corvetto, Milano', 'Price': '€ 220.000', 'Rooms': '2', 'Surface': '50', 'Toilets': '1'}
{'Title': 'Quadrilocale via Cadore, Cadore, Milano', 'Price': '€ 1.060.000', 'Rooms': '4', 'Surface': '210', 'Toilets': '2'}
{'Title': 'Bilocale via  bacchiglione, Corvetto, Milano', 'Price': '€ 195.000', 'Rooms': '2', 'Surface': '42', 'Toilets': '1'}
{'Title': 'Bilocale buono stato, primo piano, Brera, Milano', 'Price': '€ 800.000', 'Rooms': '2', 'Surface': '87', 'Toilets': '2'}
{'Title': 'Trilocale via  bacchiglione, Corvetto, Milano', 'Price': '€ 540.000', 'Rooms': '3', 'Surface': '120', 'Toilets': '2'}
{'Title': 'Bilocale via bacchiglione, Corvetto, Milano', 'Price': '€ 350.000', 'Rooms': '2', 'Surface': '81', 'Toilets': '1'}
{'Title': 'Bilocale via  bacchiglione, Corvetto, Milano', 'Price': '€ 265.000', 'Rooms': '2', 'Surface': '50', 'Toilets': '1'}
{'Title': 'Appartamento via Antonio Pianella, 4, San Siro, Milano', 'Price': '€ 649.000', 'Rooms': '5+', 'Surface': '150', 'Toilets': '3'}

我试图通过使用find_all和使用try and error的标记索引来改进您的脚本,但也许您可以使用bs4中的属性

import requests
from bs4 import BeautifulSoup
import pandas
base_url = "https://www.immobiliare.it/vendita-case/milano/"

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c, "html.parser")


# To extract the first and last page numbers
paging = soup.find("div",{"id":"listing-pagination"}).find("ul",{"class":"pagination pagination__number"}).find_all("a")
start_page = paging[0].text
last_page = paging[len(paging)-1].text

#Empty list to append content
web_content_list = []
for page_number in range(int(start_page),2):
    # To form the url based on page numbers
    print(page_number)
    url = base_url + "?pag=" + str(page_number)
    r = requests.get(url)
    c = r.content
    soup = BeautifulSoup(c, "html.parser")
    #Extract info
    listing_content = soup.find_all("div",{"class":"listing-item_body--content"})
    for item in listing_content:
        #Store info to a dictionary
        web_content_dict = {}
        web_content_dict["Title"] = item.find("p",{"class":"titolo text-primary"}).find("a").get("title")
        web_content_dict["Price"] = item.find_all("li",{"class":"lif__item lif__princing"})
        web_content_dict["Rooms"] = item.find_all("li",{"class":"lif__item"})[1].find("span",{"class":"text-bold"}).get_text(strip=True)
        web_content_dict["Surface"] = item.find_all("li",{"class":"lif__item"})[2].find("span",{"class":"text-bold"}).get_text(strip=True)
        web_content_dict["Bath"] = item.find_all("li",{"class":"lif__item"})[3].find("span",{"class":"text-bold"}).get_text(strip=True)
        try:
            web_content_dict["Floor"] = item.find_all("li",{"class":"lif__item"})[4].find("abbr",{"class":"text-bold"}).get_text(strip=True)
        except IndexError as e:
            web_content_dict["Floor"] = 1

        #Store dictionary into a list
        web_content_list.append(web_content_dict)

#Make a dataframe with the list
df = pandas.DataFrame(web_content_list)
print(df)
#Write dataframe to a csv file
df.to_csv("Output.csv")
print("Done")

请不要使用图片在上面写评论/文字/提示/错误。
{'Title': "Bilocale via Fra' Giovanni Pantaleo 3, Bovisa, Milano", 'Price': '€ 187.000', 'Rooms': '2', 'Surface': '65', 'Toilets': '1'}
{'Title': 'Trilocale via Monte Rosa 15, Amendola - Buonarroti, Milano', 'Price': '€ 730.000', 'Rooms': '3', 'Surface': '140', 'Toilets': '2'}
{'Title': 'Trilocale via San Senatore, 2, Missori, Milano', 'Price': '€ 665.000', 'Rooms': '3', 'Surface': '109', 'Toilets': '2'}
{'Title': 'Quadrilocale viale Duilio 6, Sempione, Milano', 'Price': '€ 1.150.000', 'Rooms': '4', 'Surface': '165', 'Toilets': '2'}
{'Title': "Appartamento piazza Sant'agostino, 6, Corso Genova, Milano", 'Price': '€ 1.650.000', 'Rooms': '5', 'Surface': '275', 'Toilets': '3+'}
{'Title': 'Trilocale via Val Gardena 25, Precotto, Milano', 'Price': '€ 170.000', 'Rooms': '3', 'Surface': '91', 'Toilets': '1'}
{'Title': 'Appartamento corso Di Porta Nuova, Turati, Milano', 'Price': '€ 1.130.000', 'Rooms': '5+', 'Surface': '210', 'Toilets': '3'}
{'Title': 'Trilocale via Francesco Albani 58, Monte Rosa - Lotto, Milano', 'Price': '€ 380.000', 'Rooms': '3', 'Surface': '90', 'Toilets': '1'}
{'Title': 'Bilocale via Antonio Cesari 47, Niguarda, Milano', 'Price': '€ 115.000', 'Rooms': '2', 'Surface': '46', 'Toilets': '1'}
{'Title': 'Trilocale via mazzucotelli 15, Quartiere Forlanini, Milano', 'Price': '€ 215.000', 'Rooms': '3', 'Surface': '91', 'Toilets': '2'}
{'Title': 'Bilocale via Livorno, Palestro, Milano', 'Price': '€ 520.000', 'Rooms': '2', 'Surface': '57', 'Toilets': '1'}
{'Title': 'Bilocale via Maspero 28, Molise - Cuoco, Milano', 'Price': '€ 290.000', 'Rooms': '2', 'Surface': '70', 'Toilets': '1'}
{'Title': 'Trilocale largo Gemito, 3, Casoretto, Milano', 'Price': '€ 308.000', 'Rooms': '3', 'Surface': '93', 'Toilets': '1'}
{'Title': 'Quadrilocale via Pietro Paleocapa, Cadorna - Castello, Milano', 'Price': '€ 1.300.000', 'Rooms': '4', 'Surface': '180', 'Toilets': '3'}
{'Title': 'Bilocale via Renato Fucini, Città Studi, Milano', 'Price': '€ 511.000', 'Rooms': '2', 'Surface': '85', 'Toilets': '1'}
{'Title': 'Quadrilocale via Lucca, Bisceglie, Milano', 'Price': '€ 275.000', 'Rooms': '4', 'Surface': '100', 'Toilets': '1'}
{'Title': 'Trilocale via RIZZARDI 45, Trenno, Milano', 'Price': '€ 485.000', 'Rooms': '3', 'Surface': '127', 'Toilets': '1'}
{'Title': 'Bilocale via bacchiglione, Corvetto, Milano', 'Price': '€ 220.000', 'Rooms': '2', 'Surface': '50', 'Toilets': '1'}
{'Title': 'Quadrilocale via Cadore, Cadore, Milano', 'Price': '€ 1.060.000', 'Rooms': '4', 'Surface': '210', 'Toilets': '2'}
{'Title': 'Bilocale via  bacchiglione, Corvetto, Milano', 'Price': '€ 195.000', 'Rooms': '2', 'Surface': '42', 'Toilets': '1'}
{'Title': 'Bilocale buono stato, primo piano, Brera, Milano', 'Price': '€ 800.000', 'Rooms': '2', 'Surface': '87', 'Toilets': '2'}
{'Title': 'Trilocale via  bacchiglione, Corvetto, Milano', 'Price': '€ 540.000', 'Rooms': '3', 'Surface': '120', 'Toilets': '2'}
{'Title': 'Bilocale via bacchiglione, Corvetto, Milano', 'Price': '€ 350.000', 'Rooms': '2', 'Surface': '81', 'Toilets': '1'}
{'Title': 'Bilocale via  bacchiglione, Corvetto, Milano', 'Price': '€ 265.000', 'Rooms': '2', 'Surface': '50', 'Toilets': '1'}
{'Title': 'Appartamento via Antonio Pianella, 4, San Siro, Milano', 'Price': '€ 649.000', 'Rooms': '5+', 'Surface': '150', 'Toilets': '3'}
import requests
from bs4 import BeautifulSoup
import pandas
base_url = "https://www.immobiliare.it/vendita-case/milano/"

r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c, "html.parser")


# To extract the first and last page numbers
paging = soup.find("div",{"id":"listing-pagination"}).find("ul",{"class":"pagination pagination__number"}).find_all("a")
start_page = paging[0].text
last_page = paging[len(paging)-1].text

#Empty list to append content
web_content_list = []
for page_number in range(int(start_page),2):
    # To form the url based on page numbers
    print(page_number)
    url = base_url + "?pag=" + str(page_number)
    r = requests.get(url)
    c = r.content
    soup = BeautifulSoup(c, "html.parser")
    #Extract info
    listing_content = soup.find_all("div",{"class":"listing-item_body--content"})
    for item in listing_content:
        #Store info to a dictionary
        web_content_dict = {}
        web_content_dict["Title"] = item.find("p",{"class":"titolo text-primary"}).find("a").get("title")
        web_content_dict["Price"] = item.find_all("li",{"class":"lif__item lif__princing"})
        web_content_dict["Rooms"] = item.find_all("li",{"class":"lif__item"})[1].find("span",{"class":"text-bold"}).get_text(strip=True)
        web_content_dict["Surface"] = item.find_all("li",{"class":"lif__item"})[2].find("span",{"class":"text-bold"}).get_text(strip=True)
        web_content_dict["Bath"] = item.find_all("li",{"class":"lif__item"})[3].find("span",{"class":"text-bold"}).get_text(strip=True)
        try:
            web_content_dict["Floor"] = item.find_all("li",{"class":"lif__item"})[4].find("abbr",{"class":"text-bold"}).get_text(strip=True)
        except IndexError as e:
            web_content_dict["Floor"] = 1

        #Store dictionary into a list
        web_content_list.append(web_content_dict)

#Make a dataframe with the list
df = pandas.DataFrame(web_content_list)
print(df)
#Write dataframe to a csv file
df.to_csv("Output.csv")
print("Done")