Python 表格刮削,缺少4个单元格
我在BS4上遇到了一些奇怪的行为。我已经复制了一个我将要抓取的站点的20页,这段代码在我的私有web服务器上运行得非常好。当我在真实的站点上使用它时,它会随机遗漏一行的第8列。我以前从未经历过这种情况,而且我似乎找不到任何其他关于这个问题的帖子。第8列为“频率等级”。这只会发生在最后一列,这是怎么回事?我该如何修复它Python 表格刮削,缺少4个单元格,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我在BS4上遇到了一些奇怪的行为。我已经复制了一个我将要抓取的站点的20页,这段代码在我的私有web服务器上运行得非常好。当我在真实的站点上使用它时,它会随机遗漏一行的第8列。我以前从未经历过这种情况,而且我似乎找不到任何其他关于这个问题的帖子。第8列为“频率等级”。这只会发生在最后一列,这是怎么回事?我该如何修复它 import requests import json from bs4 import BeautifulSoup base_url = 'http://hanzidb.org'
import requests
import json
from bs4 import BeautifulSoup
base_url = 'http://hanzidb.org'
def soup_the_page(page_number):
url = base_url + '/character-list/by-frequency?page=' + str(page_number)
response = requests.get(url, timeout=5)
soup = BeautifulSoup(response.content, 'html.parser')
return soup
def get_max_page(soup):
paging = soup.find_all("p", {'class': 'rigi'})
# Isolate the first paging link
paging_link = paging[0].find_all('a')
# Extract the last page number of the series
max_page_num = int([item.get('href').split('=')[-1] for item in paging_link][-1])
return max_page_num
def crawl_hanzidb():
result = {}
# Get the page scrape data
page_content = soup_the_page(1)
# Get the page number of the last page
last_page = get_max_page(page_content)
# Get the table data
for p in range(1, last_page + 1):
page_content = soup_the_page(p)
for trow in page_content.find_all('tr')[1:]:
char_dict = {}
i = 0
# Set the character as the dict key
character = trow.contents[0].text
# Initialize list on dict key
result[character] = []
# Return list of strings from trow.children to parse urls
for tcell in trow.children:
char_position = 0
radical_position = 3
if i == char_position or i == radical_position:
for content in tcell.children:
if type(content).__name__ == 'Tag':
if 'href' in content.attrs:
url = base_url + content.attrs.get('href')
if i == char_position:
char_dict['char_url'] = url
if i == radical_position:
char_dict['radical_url'] = url
i += 1
char_dict['radical'] = trow.contents[3].text[:1]
char_dict['pinyin'] = trow.contents[1].text
char_dict['definition'] = trow.contents[2].text
char_dict['hsk_level'] = trow.contents[5].text[:1] if trow.contents[5].text[:1].isdigit() else ''
char_dict['frequency_rank'] = trow.contents[7].text if trow.contents[7].text.isdigit() else ''
result[character].append(char_dict)
print('Progress: ' + str(p) + '%.')
return(result)
crawl_data = crawl_hanzidb()
with open('hanzidb.json', 'w') as f:
json.dump(crawl_data, f, indent=2, ensure_ascii=False)
问题似乎是该网站的HTML格式不正确。如果您查看您发布的站点的来源,在频率排名列之前有两个结束标记。例如:
<tr>
<td><a href="/character/的">的</a></td>
<td>de</td><td><span class="smmr">possessive, adjectival suffix</span></td>
<td><a href="/character/白" title="Kangxi radical 106">白</a> 106.3</td>
<td>8</td><td>1</td>
<td>1155</td></td>
<td>1</td>
</tr>
然后在soup\u页面()中更改解析器
方法:
soup = BeautifulSoup(response.content, 'lxml')
然后运行脚本。它似乎起作用了
print(trow.contents[7].text)
不再给出索引超出范围的错误。我注意到额外的td
,但我检查了导致问题的几个特定行,并且tr
之间的单元格计数似乎总是一致的。不管是谁,你说得对,非常感谢。
soup = BeautifulSoup(response.content, 'lxml')