Python 从basketball-reference.com解析NBA赛季统计数据，如何删除html注释标记_Python_Regex_Beautifulsoup

Python 从basketball-reference.com解析NBA赛季统计数据，如何删除html注释标记

python regex

Python 从basketball-reference.com解析NBA赛季统计数据，如何删除html注释标记,python,regex,beautifulsoup,Python,Regex,Beautifulsoup,我正在尝试解析basketball-reference.com（）中的杂项统计表。然而，我想解析的表在html注释中使用以下代码 html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content cleaned_soup = BeautifulSoup(re.sub("","", html)) html=requests.get（“http://www.

我正在尝试解析basketball-reference.com（）中的杂项统计表。然而，我想解析的表在html注释中

使用以下代码

html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))

html=requests.get（“http://www.basketball-reference.com/leagues/NBA_2016.html1.内容
干净的汤=美丽的汤（re.sub（“，”，html））

结果如下：

TypeError                                 Traceback (most recent call last)
<ipython-input-35-93508687bbc6> in <module>()
----> 1 cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))

~/.pyenv/versions/3.7.0/lib/python3.7/re.py in sub(pattern, repl, string, count, flags)
    190     a callable, it's passed the Match object and must return
    191     a replacement string to be used."""
--> 192     return _compile(pattern, flags).sub(repl, string, count)
    193 
    194 def subn(pattern, repl, string, count=0, flags=0):

TypeError: cannot use a string pattern on a bytes-like object

TypeError回溯（最近一次调用）
在（）
---->1清汤=美汤（re.sub（“，”，html））
sub中的~/.pyenv/versions/3.7.0/lib/python3.7/re.py（模式、repl、字符串、计数、标志）
190一个可调用的，它传递了Match对象并且必须返回
191要使用的替换字符串。”“”
-->192返回编译（模式、标志）.sub（repl、字符串、计数）
193
194 def子网（模式、应答、字符串、计数=0、标志=0）：
TypeError:无法在类似字节的对象上使用字符串模式

我使用的是python3.7。

与其尝试使用

re

将注释中的所有HTML放入HTML，不如使用BeautifulSoup返回HTML中的注释。然后还可以使用BeautifulSoup解析这些注释，以根据需要提取任何表元素，例如：

import requests
from bs4 import BeautifulSoup, Comment


html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
soup = BeautifulSoup(html, "html.parser")

for comment in soup.find_all(text=lambda t : isinstance(t, Comment)):
    comment_html = BeautifulSoup(comment, "html.parser")

    for table in comment_html.find_all("table"):
        for tr in table.find_all("tr"):
            row = [td.text for td in tr.find_all("td")]
            print(row)
        print()

这将为您提供表中的行，从：

[“总决赛”，“克利夫兰骑士队\nover\nGolden州勇士队\n\xa0（4-3）\n”，“系列统计数据”]
['\n\n\nGame 1\nhu，6月2日\n等级骑士\n89@金州勇士\n104\n\nGame 2\nSun，6月5日\n等级骑士\n77@金州勇士\n110\n\nGame 3\nWed，6月8日\nGolden州勇士\n90@克利夫兰骑士\n120\n\nGame 4\nFri，6月10\nGolden州勇士\n108@克利夫兰骑士\n97\n\nGame 5\nwon，6月13日\n等级骑士利尔斯\n112@Golden State Warriors\n97\n\nGame 6\nTu，6月16日\nGolden State Warriors\n101@Cleveland Cavaliers\n115\n\nGame 7\nSun，6月19日\nLevel Cavaliers\n93@Golden State Warriors\n89\n\n\n“第1场”，“6月2日星期四”，“克利夫兰骑士”，“89”，“金州勇士”，“104”，“第2场”，“6月5日太阳”，“克利夫兰骑士”，“77”，“金牌”英州勇士队、110、第三场、6月8日星期三、金州勇士队、90、克利夫兰骑士队、120、第四场、6月10日星期五、金州勇士队、108、克利夫兰骑士队、97、第五场、6月13日星期一、克利夫兰骑士队、112、金州勇士队、97、第六场、6月16日星期四、金州勇士队、101“@Cleveland Cavaliers”，“115”，“第7场”，“6月19日太阳报”，“Cleveland Cavaliers”，“93”，“金州勇士”，“89”]
['Game 1'，'Thu，June 2'，'Cleveland Cavaliers'，'89'，'Golden State Warriors'，'104']
[“第二场”、“6月5日太阳”、“克利夫兰骑士”、“77”、“金州勇士”、“110”]
[‘第三场’、‘6月8日星期三’、‘金州勇士’、‘90’、‘克利夫兰骑士’、‘120’]
[‘第四场’、‘6月10日星期五’、‘金州勇士’、‘108’、‘克里夫兰骑士’、‘97’]

注意：为了避免获取

而不能在像object

这样的字节上使用字符串模式，可以使用

.text

而不是

.content

将字符串传递给正则表达式