Python 使用BeautifulSoup访问表数据_Python_Web Scraping_Beautifulsoup

Python 使用BeautifulSoup访问表数据

python web-scraping

Python 使用BeautifulSoup访问表数据,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,以下代码： from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup import re def getDates(URL): dates = [] # if page not found, HTTPError is thrown try: html = urlopen(URL) except HTTPEr

以下代码：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import re

def getDates(URL):
    dates = []
    # if page not found, HTTPError is thrown
    try:
        html = urlopen(URL)
    except HTTPError:
        print("Page not found.")
        return None

    bsObj = BeautifulSoup(html, "lxml")
    data = bsObj.find("table", {"class":"sortable wikitable"}).children
    for child in data:
        print(child)

生成以下示例输出：

<tr>
<td><a href="/wiki/89th_Academy_Awards" title="89th Academy Awards">89th</a></td>
<td>February 26, 2017</td>
<td>2016</td>
<td><i><a href="/wiki/Moonlight_(2016_film)" title="Moonlight (2016 film)">Moonlight</a></i></td>
<td><span class="sortkey" style="display:none;">217 !</span><span class="sorttext">3 hours, 49 minutes</span></td>
<td>32.9 million</td>
<td>22.4</td>
<td rowspan="2"><a href="/wiki/Jimmy_Kimmel" title="Jimmy Kimmel">Jimmy Kimmel</a></td>
</tr>

我唯一要刮的一行是带日期的那一行。这里是2017年2月26日。像这样的条目大约有80多个。我试着询问td排在前一行的兄弟姐妹，但得到了一个NavigableString错误，正如其他帖子中建议的那样，我无法排除或循环解决，因为Spyder说NavigableString未定义，无法导入，并且不是公认的错误，除非它是AttributeError，否则会产生一个空白屏幕。我知道那里有一个空白。我尝试查找每个带有td标记的孩子，该标记具有解析为与日期对应的正则表达式的字符串。那也没用。该错误表示我可以将该参数放在.find函数中，尽管我面前的文档不这么说

想一想到底出了什么问题，我该如何处理这一排

如果希望像列表一样处理所有标记，则可以调用列表上的索引以获取第二项：

正则表达式可能是正确的选择，索引可能是错误的

日期单元格可以在任何列中，不要假设它是第二列。您是否也生成html？您的生成是否启用了一个变量来控制生成和处理？中间是否有提取层？。将来可能会有一些简单的更改，例如排序或可配置的表列，这些更改可能会破坏您的更改。考虑下面的代码。

re_months = '(January|February|March|April|May|June|July|August|September|October|November|December)'
re_int = '[0-9]+'
date_row_matcher = re.compile('<td>{months} {days}, {years}</td>'.format(months=re_months, days=re_int, years=re_int))

rows = bsObj.find("table", {"class":"sortable wikitable"}).children
for row in rows:
    for cell in row.children:
        match = re.match(date_row_matcher, str(cell))
        if match is not None:
            print cell

谢谢大家，关于循环的需要、标签的使用以及正则表达式的有用性的澄清。以下代码生成了所需的结果

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import re

def getDates(URL):
    # if page not found, HTTPError is thrown
    try:
        html = urlopen(URL)
    except HTTPError:
        print("Page not found.")
        return None

    bsObj = BeautifulSoup(html, "lxml")
    data = bsObj.find("table", {"class":"sortable wikitable"})
    table_data = data.find_all("td", string=re.compile("^[A-Za-z]+\ [0-9]+,\ [0-9]+"))
    print(table_data)

getDates("https://en.wikipedia.org/wiki/List_of_Academy_Awards_ceremonies")

结果集如下所示：

[1929年5月16日、1930年4月3日、1930年11月5日、1931年11月10日、1932年11月18日、1934年3月16日、1935年2月27日、1936年3月5日、1937年3月4日、1938年3月10日、1939年2月23日、1940年2月29日、1942年2月27日、1943年3月4日、1944年3月2日、1945年3月15日、1946年3月7日、1947年3月13日、1948年3月20日、194年3月24日]1950年3月23日、1951年3月29日、1952年3月20日、1953年3月19日、1954年3月25日、1955年3月30日、1956年3月21日、1957年3月27日、1958年3月26日、1959年4月6日、1960年4月4日、1961年4月17日、1962年4月9日、1962年4月8日、1963年4月13日、1965日、1966年4月18日、1967年4月10日、1968年4月14日、1970年4月7日、1971年4月15日，1972年4月10日、1973年3月27日、1974年4月2日、1975年4月8日、1976年3月29日、1977年3月28日、1978年4月3日、1979年4月9日、1980年4月14日、1981年3月31日、1982年3月29日、1983年4月11日、1984年4月9日、1984年3月25日、1985年3月24日、1986年3月30日、1987年4月11日、1988年3月29日、1989年3月29日、1990年3月26日、1991年3月30日、1992年3月29日、1993年3月29日1994年3月21日、1995年3月27日、1996年3月25日、1997年3月24日、1998年3月23日、1999年3月21日、2000年3月26日、2001年3月25日、2002年3月23日、2003年2月29日、2004年2月27日、2005年2月27日、2006年3月5日、2007年2月25日、2008年2月24日、2009年2月22日、2010年3月7日、2011年2月27日、2012年2月26日、2013年2月24日、3月2月2日，2014年、2015年2月22日、2016年2月28日、2017年2月26日]

如果日期在表中的位置相同，那么为什么不获取表中所有td的数组，并选择您想要的tdindex@bigbounty-谢谢你的回复！这是我第一次尝试使用BS4，我忍不住想知道为什么我不能简单地在每个孩子身上循环，然后说，Giv请给我第二个td标记。这应该不是什么大问题，是吗？还是我过于简单化了？或者提取表内容并将其放入一个变量（比如表）。再次使用find_all从该变量中选择所有tds，然后使用indexe选择要使用的td，以便访问不需要在列表中循环的列表元素，只需使用索引即可足够的

re_months = '(January|February|March|April|May|June|July|August|September|October|November|December)'
re_int = '[0-9]+'
date_row_matcher = re.compile('<td>{months} {days}, {years}</td>'.format(months=re_months, days=re_int, years=re_int))

rows = bsObj.find("table", {"class":"sortable wikitable"}).children
for row in rows:
    for cell in row.children:
        match = re.match(date_row_matcher, str(cell))
        if match is not None:
            print cell

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import re

def getDates(URL):
    # if page not found, HTTPError is thrown
    try:
        html = urlopen(URL)
    except HTTPError:
        print("Page not found.")
        return None

    bsObj = BeautifulSoup(html, "lxml")
    data = bsObj.find("table", {"class":"sortable wikitable"})
    table_data = data.find_all("td", string=re.compile("^[A-Za-z]+\ [0-9]+,\ [0-9]+"))
    print(table_data)

getDates("https://en.wikipedia.org/wiki/List_of_Academy_Awards_ceremonies")