在Python中使用BS4确定HTML是否包含文本
我正试图浏览维基百科的美国新冠病毒-19数据表(),但在确定HTML元素是否包含文本时遇到了麻烦。我试过使用在Python中使用BS4确定HTML是否包含文本,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图浏览维基百科的美国新冠病毒-19数据表(),但在确定HTML元素是否包含文本时遇到了麻烦。我试过使用 element.text is not None 作为if条件,但这只允许HTML元素不输出任何内容 element.text != '' 结果是一样的。还有什么我可以查的吗? 这是我所有的代码 def getCases(page): cases = [] firstCaseChild = page.find(title='January 21, 2020')
element.text is not None
作为if条件,但这只允许HTML元素不输出任何内容
element.text != ''
结果是一样的。还有什么我可以查的吗?
这是我所有的代码
def getCases(page):
cases = []
firstCaseChild = page.find(title='January 21, 2020')
firstCaseChild2 = firstCaseChild.find_parent('th')
row = 0
column = 0
firstRow = []
for case in firstCaseChild2.find_next_siblings('td'):
if column == 55:
break
if case.text is not None:
firstRow.append(case.text)
column = column+1
print(case.text)
else:
firstRow.append('0')
column = column+1
print('0')
我不想用beautifulsoup来刮这样的大桌子
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases')
每个表都在数据帧列表中,例如df[0]
打印您在wikipediadf[0]
上看到的第一个表。Nan指缺失的数据
可以使用pandas设置NaN的值。这里我们将其设置为0
df.fillna(0)
另一种解决方案,不使用
pandas
:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tr in soup.tbody.select('tr:has(td)'):
tds = [td.get_text(strip=True) for td in tr.select('td')]
tds = [int(td) if td else 0 for td in tds] # replace empty text '' with 0
print(('{:>5}'*len(tds)).format(*tds))
印刷品:
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 5 0 0 0 0 0 0 1 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0
0 0 5 0 0 0 0 0 0 1 0 5 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 12 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0
0 0 4 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 9 0 0 0 0 0 0 0
0 0 8 2 0 0 0 0 1 0 0 31 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 0 0 0 0 0 1 3 0 0 1 11 0 0 0 0 0 0 0
0 1 11 6 1 0 0 0 1 0 1 10 0 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 3 1 1 0 0 1 0 0 3 0 0 0 0 0 5 0 0 0 2 22 2 1 0 0 0 0 0
0 2 8 0 0 0 0 0 0 4 0 22 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 2 4 0 0 0 0 2 0 0 1 0 0 1 0 5 0 0 0 0 45 2 0 0 0 0 0 0
0 0 26 0 1 0 0 0 2 7 0 34 0 3 1 2 0 0 1 0 0 2 0 0 0 0 0 0 4 4 3 0 0 0 4 2 4 1 0 1 1 0 15 2 0 2 2 17 2 0 1 0 0 0 0
0 0 19 4 0 0 0 0 0 0 0 26 0 5 4 2 0 0 0 0 0 0 3 0 0 1 0 0 2 6 2 1 0 5 1 1 1 3 0 1 3 0 13 0 0 0 5 36 4 0 0 0 0 0 0
0 1 24 5 0 0 0 0 1 1 1 105 0 5 8 4 0 2 1 0 0 2 1 1 5 1 0 0 9 5 2 2 0 0 2 3 4 4 0 0 0 0 51 1 0 1 4 31 2 2 0 0 0 0 0
0 3 20 17 0 0 1 4 0 4 2 99 1 1 6 2 0 0 2 0 1 0 0 0 3 3 0 1 3 9 0 10 1 1 1 2 4 0 0 1 5 1 3 3 0 0 8 43 4 0 0 0 0 0 0
1 0 21 15 0 0 0 2 0 5 1 91 0 2 7 0 4 10 4 1 0 5 1 1 0 2 0 5 18 11 3 6 0 7 2 9 2 8 0 3 0 3 13 3 1 1 6 112 6 0 1 0 0 0 0
0 0 49 28 0 1 3 4 6 6 4 111 1 1 14 3 1 13 5 2 0 4 8 1 1 11 6 6 33 0 3 17 5 0 1 8 16 13 0 5 0 0 15 5 2 1 21 93 19 15 0 0 0 3 1
...and so on.
感谢您的回答,但是为什么您不想在大型表中使用beautifulsoup呢?它非常混乱,并且经常需要大量for循环,特别是因为这里有大量缺少的数据。如果要找到里面有实际数字的单元格,那将是一场噩梦。上面的代码很简洁,pandas是一个很好的数据处理包。我没有准确地使用您的答案,但我认为strip属性正是我所需要的!