在Python中使用BS4确定HTML是否包含文本_Python_Web Scraping_Beautifulsoup

在Python中使用BS4确定HTML是否包含文本

python web-scraping

在Python中使用BS4确定HTML是否包含文本,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图浏览维基百科的美国新冠病毒-19数据表（），但在确定HTML元素是否包含文本时遇到了麻烦。我试过使用 element.text is not None 作为if条件，但这只允许HTML元素不输出任何内容 element.text != '' 结果是一样的。还有什么我可以查的吗？这是我所有的代码 def getCases(page): cases = [] firstCaseChild = page.find(title='January 21, 2020')

我正试图浏览维基百科的美国新冠病毒-19数据表（），但在确定HTML元素是否包含文本时遇到了麻烦。我试过使用

element.text is not None

作为if条件，但这只允许HTML元素不输出任何内容

element.text != ''

结果是一样的。还有什么我可以查的吗？这是我所有的代码

def getCases(page):
    cases = []
    firstCaseChild = page.find(title='January 21, 2020')
    firstCaseChild2 = firstCaseChild.find_parent('th')
    row = 0
    column = 0
    firstRow = []
    for case in firstCaseChild2.find_next_siblings('td'):
        if column == 55:
            break
        if case.text is not None:
            firstRow.append(case.text)
            column = column+1
            print(case.text)
        else:
            firstRow.append('0')
            column = column+1
            print('0')

我不想用beautifulsoup来刮这样的大桌子

import pandas as pd 
df = pd.read_html('https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases')

每个表都在数据帧列表中，例如

df[0]

打印您在wikipedia

df[0]

上看到的第一个表。Nan指缺失的数据

可以使用pandas设置NaN的值。这里我们将其设置为0

df.fillna(0)

另一种解决方案，不使用

pandas

：

import requests
from bs4 import BeautifulSoup


url = 'https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for tr in soup.tbody.select('tr:has(td)'):
    tds = [td.get_text(strip=True) for td in tr.select('td')]
    tds = [int(td) if td else 0 for td in tds]  # replace empty text '' with 0
    print(('{:>5}'*len(tds)).format(*tds))

印刷品：

    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    1    0    0    0    0    0    0    1    0    2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    1    0    0    0    0    0    0    0    0    3    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    5    0    0    0    0    0    0    1    0    7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    2    0    0    0    0    0
    0    0    5    0    0    0    0    0    0    1    0    5    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0
    0    1   12    0    0    0    0    0    0    0    0   10    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    1    0    0    0    0    0    0    0    0    1    0    0    1    0    1    0    0    0    0    0    0    0
    0    0    4    0    0    0    0    0    0    0    0   11    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0    1    9    0    0    0    0    0    0    0
    0    0    8    2    0    0    0    0    1    0    0   31    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    4    0    0    0    0    0    1    3    0    0    1   11    0    0    0    0    0    0    0
    0    1   11    6    1    0    0    0    1    0    1   10    0    0    1    1    0    0    1    0    0    1    0    1    0    0    0    0    3    1    1    0    0    1    0    0    3    0    0    0    0    0    5    0    0    0    2   22    2    1    0    0    0    0    0
    0    2    8    0    0    0    0    0    0    4    0   22    0    0    0    1    1    0    0    1    0    0    0    0    0    0    0    0    2    4    0    0    0    0    2    0    0    1    0    0    1    0    5    0    0    0    0   45    2    0    0    0    0    0    0
    0    0   26    0    1    0    0    0    2    7    0   34    0    3    1    2    0    0    1    0    0    2    0    0    0    0    0    0    4    4    3    0    0    0    4    2    4    1    0    1    1    0   15    2    0    2    2   17    2    0    1    0    0    0    0
    0    0   19    4    0    0    0    0    0    0    0   26    0    5    4    2    0    0    0    0    0    0    3    0    0    1    0    0    2    6    2    1    0    5    1    1    1    3    0    1    3    0   13    0    0    0    5   36    4    0    0    0    0    0    0
    0    1   24    5    0    0    0    0    1    1    1  105    0    5    8    4    0    2    1    0    0    2    1    1    5    1    0    0    9    5    2    2    0    0    2    3    4    4    0    0    0    0   51    1    0    1    4   31    2    2    0    0    0    0    0
    0    3   20   17    0    0    1    4    0    4    2   99    1    1    6    2    0    0    2    0    1    0    0    0    3    3    0    1    3    9    0   10    1    1    1    2    4    0    0    1    5    1    3    3    0    0    8   43    4    0    0    0    0    0    0
    1    0   21   15    0    0    0    2    0    5    1   91    0    2    7    0    4   10    4    1    0    5    1    1    0    2    0    5   18   11    3    6    0    7    2    9    2    8    0    3    0    3   13    3    1    1    6  112    6    0    1    0    0    0    0
    0    0   49   28    0    1    3    4    6    6    4  111    1    1   14    3    1   13    5    2    0    4    8    1    1   11    6    6   33    0    3   17    5    0    1    8   16   13    0    5    0    0   15    5    2    1   21   93   19   15    0    0    0    3    1

...and so on.

感谢您的回答，但是为什么您不想在大型表中使用beautifulsoup呢？它非常混乱，并且经常需要大量for循环，特别是因为这里有大量缺少的数据。如果要找到里面有实际数字的单元格，那将是一场噩梦。上面的代码很简洁，pandas是一个很好的数据处理包。我没有准确地使用您的答案，但我认为strip属性正是我所需要的！