在Python中使用BeautifulSoup4删除html并区分相同的标记

在Python中使用BeautifulSoup4删除html并区分相同的标记,python,Python,我正在尝试使用Python中的bs4刮取html,其中包含重复的相同标记,这些标记包含我想要的数据。我想收集的数据包括class=“tip\u date\u time”、class=“tip\u wave”和class=“tip\u train” 到目前为止,我在Python中完成了以下工作: soup = BeautifulSoup(res.content, 'html.parser') html = soup.find_all("div", {"class&qu

我正在尝试使用Python中的bs4刮取html,其中包含重复的相同标记,这些标记包含我想要的数据。我想收集的数据包括class=“tip\u date\u time”、class=“tip\u wave”和class=“tip\u train”

到目前为止,我在Python中完成了以下工作:

soup = BeautifulSoup(res.content, 'html.parser')
html = soup.find_all("div", {"class": "forecast_tip"}) 

dateCond = []
for date in html:
    for text in date.find_all("div", {"class": "tip_date_time"}):
        dateCond.append(text.getText())

waveCond = []
for wave in html:
    for text in wave.find_all("span", {"class": "tip_wave"}):
        waveCond.append(text.getText())
这将为我打算根据索引排序的每个刮片创建单独的列表。因此dateCond[0]将与waveCond[0]对齐。这可以正常工作,因为每个列表都有相同数量的项目

然而,我遇到了一个问题“提示火车”,因为这可能会从1个条目变化到3个条目,具体取决于日期。因此,如果我使用相同的代码,我可能会有一个与其他列表长度不同的列表,并放弃排序

因此,我希望能够仅选择“tip_train”的前2个实例,因为它位于“tip_date_time”div的每个块中。我不能仅选择前2个被刮取的实例,因为我希望每天的前2个实例

Html代码如下:

    <div class="forecast_tip">
    <div class="tip_date_time">6am Mon 21 Sep</div>

      <div class="tip_surf">
      <span class="tip_wave">2ft ENE</span>
      <span class="tip_wind">7kt NNW</span>
    </div>
    <div class="tip_description">(Waist-Shoulder High)</div>
  
  
      <div class="tip_train">1.3m @ 7.5s ENE (64&deg;)</div>
        <div class="tip_train">0.4m @ 13.1s SSE (167&deg;)</div>
        <div class="tip_train">0.3m @ 13.8s SSW (194&deg;)</div>
  
  <div class="tip_tides">
          <div class="tip_tide">
        <span class="tip_tide_label">Low:</span>
        <span class="tip_tide_value">Sun 4:29pm (0.20m)</span>
      </div>
    
          <div class="tip_tide">
        <span class="tip_tide_label">High:</span>
        <span class="tip_tide_value">Sun 10:40pm (1.67m)</span>
      </div>
      </div>
</div>

结果如下:

Date             Primary Swell            Secondary Swell          Wave Height
---------------  -----------------------  -----------------------  -------------
6am Wed 23 Sep   0.6m @ 8.3s NE (54°)     0.2m @ 13s SSW (195°)    1ft NE
12pm Wed 23 Sep  0.5m @ 8.4s NE (54°)     0.2m @ 12.3s SSW (194°)  1ft NE
6pm Wed 23 Sep   0.4m @ 8.4s NE (56°)     0.2m @ 11.1s SSW (200°)  1ft NE
6am Thu 24 Sep   0.4m @ 10.1s SSW (204°)  0.2m @ 9.9s ENE (77°)    0.5ft SSW
12pm Thu 24 Sep  0.6m @ 10.1s SSW (205°)  0.3m @ 9.8s ENE (73°)    1ft SSW
6pm Thu 24 Sep   0.7m @ 9.9s SSW (203°)   0.2m @ 9.8s ENE (77°)    1ft SSW
6am Fri 25 Sep   0.6m @ 9.1s SSW (197°)   0.2m @ 12.5s SSE (165°)  1ft SSW
12pm Fri 25 Sep  0.3m @ 12.1s S (169°)    0.5m @ 8.9s SSW (192°)   0.5ft S
6pm Fri 25 Sep   0.5m @ 8.8s S (188°)     0.3m @ 11.6s S (169°)    0.5ft S

您实际上不必使用两个
tip\u train
实例。您仍然可以刮取所有数据,如果有任何缺失,替换缺失的部分并打印您得到的数据

这里有一种方法:

import requests
from bs4 import BeautifulSoup
from tabulate import tabulate


url = "https://www.swellnet.com/reports/australia/new-south-wales/northern-beaches/forecast"
response = requests.get(url)
forecast = BeautifulSoup(response.content, 'html.parser').find_all("div", {"class": "forecast_tip"})


def get_data(html, attribute: str, _class: str) -> list:
    result = []

    for tag in html:
        item = tag.find(attribute, {"class": _class})
        if item is not None:
            result.append(item.getText())
        else:
            result.append("N/A")

    return result


date = get_data(forecast, "div", "tip_date_time")
train = get_data(forecast, "div", "tip_train")
wave = get_data(forecast, "span", "tip_wave")

forecast_data = list(zip(date, train, wave))
headers = ["Date", "Swell Train Data", "Wave Height"]

print(tabulate([*forecast_data], headers=headers))
这张照片是:

Date             Swell Train Data         Wave Height
---------------  -----------------------  -------------
6am Wed 23 Sep   0.6m @ 8.3s NE (54°)     1ft NE
12pm Wed 23 Sep  0.5m @ 8.4s NE (54°)     1ft NE
6pm Wed 23 Sep   0.4m @ 8.4s NE (56°)     1ft NE
6am Thu 24 Sep   0.4m @ 10.1s SSW (204°)  0.5ft SSW
12pm Thu 24 Sep  0.6m @ 10.1s SSW (205°)  1ft SSW
6pm Thu 24 Sep   0.7m @ 9.9s SSW (203°)   1ft SSW
6am Fri 25 Sep   0.6m @ 9.1s SSW (197°)   1ft SSW
12pm Fri 25 Sep  0.3m @ 12.1s S (169°)    0.5ft S
6pm Fri 25 Sep   0.5m @ 8.8s S (188°)     0.5ft S

您实际上不必使用两个
tip\u train
实例。您仍然可以刮取所有数据,如果有任何缺失,替换缺失的部分并打印您得到的数据

这里有一种方法:

import requests
from bs4 import BeautifulSoup
from tabulate import tabulate


url = "https://www.swellnet.com/reports/australia/new-south-wales/northern-beaches/forecast"
response = requests.get(url)
forecast = BeautifulSoup(response.content, 'html.parser').find_all("div", {"class": "forecast_tip"})


def get_data(html, attribute: str, _class: str) -> list:
    result = []

    for tag in html:
        item = tag.find(attribute, {"class": _class})
        if item is not None:
            result.append(item.getText())
        else:
            result.append("N/A")

    return result


date = get_data(forecast, "div", "tip_date_time")
train = get_data(forecast, "div", "tip_train")
wave = get_data(forecast, "span", "tip_wave")

forecast_data = list(zip(date, train, wave))
headers = ["Date", "Swell Train Data", "Wave Height"]

print(tabulate([*forecast_data], headers=headers))
这张照片是:

Date             Swell Train Data         Wave Height
---------------  -----------------------  -------------
6am Wed 23 Sep   0.6m @ 8.3s NE (54°)     1ft NE
12pm Wed 23 Sep  0.5m @ 8.4s NE (54°)     1ft NE
6pm Wed 23 Sep   0.4m @ 8.4s NE (56°)     1ft NE
6am Thu 24 Sep   0.4m @ 10.1s SSW (204°)  0.5ft SSW
12pm Thu 24 Sep  0.6m @ 10.1s SSW (205°)  1ft SSW
6pm Thu 24 Sep   0.7m @ 9.9s SSW (203°)   1ft SSW
6am Fri 25 Sep   0.6m @ 9.1s SSW (197°)   1ft SSW
12pm Fri 25 Sep  0.3m @ 12.1s S (169°)    0.5ft S
6pm Fri 25 Sep   0.5m @ 8.8s S (188°)     0.5ft S

你有没有可能分享这个网址?有没有可能分享这个网址?非常感谢!我最终修改了你的代码,因为我实际上也想要第二组tip_train数据。很高兴能帮上忙!如果你觉得我的答案有用,请接受答案和/或投票。非常感谢!我最终修改了你的代码,因为我实际上也想要第二组tip_train数据。很高兴能帮上忙!如果你觉得我的答案有用,请接受答案和/或投票。