Python beautifulsoup解析html内容_Python_Html_Web Scraping_Beautifulsoup_Html Parsing

Python beautifulsoup解析html内容

python html web-scraping

Python beautifulsoup解析html内容,python,html,web-scraping,beautifulsoup,html-parsing,Python,Html,Web Scraping,Beautifulsoup,Html Parsing,我需要从每个html文件中获取日期。我尝试查找兄弟姐妹（'p'），但返回None 日期在下面的标签下（主要是第三个p标签），但有时与第一个标签id=“a-body” 对于所提供的代码，还不清楚是否真的发生了这种情况，但我想，您正在尝试根据页面的根查找。如果是这样的工作，请尝试： d_date = soup.find_all('div', { "id" : "a-body" })[0].find_all("p")[0] print d_date.get_text(strip=True) fr

我需要从每个html文件中获取日期。我尝试查找兄弟姐妹（'p'），但返回

None

日期在下面的标签下（主要是第三个

标签），但有时与第一个标签

id=“a-body”

对于所提供的代码，还不清楚是否真的发生了这种情况，但我想，您正在尝试根据页面的根查找。如果是这样的工作，请尝试：

d_date = soup.find_all('div', { "id" : "a-body" })[0].find_all("p")[0] 
print d_date.get_text(strip=True)

from bs4 import BeautifulSoup
div_test='<div class="sa-art article-width" id="a-body" itemprop="articleBody">\
<p class="p p1">text1</p>\
<p class="p p1">\
  participant text1 text2 text3 January  8, 2009  5:00 a.m. EST\
</p>\
<p class="p p1">text2</p>\
<p class="p p1">\
January 6, 2009  8:00 pm ET\
</p></div>'
soup = BeautifulSoup(div_test, "lxml")
month_list = ['January','February','March','April','May','June','July','August','September','October','November','December']

def first_date_p():
    for p in soup.find_all('p',{"class":"p p1"}):
        for month in month_list:
            if month in p.get_text():
                first_date_p = p.get_text()
                date_start= first_date_p.index(month)
                date_text = first_date_p[date_start:]
                return date_text
first_date_p()

更新：

for page in pages:
    soup = BeautifulSoup(page,'html.parser')
    if soup.find_all("p")[2].get_text():
        d_date = soup.find_all("p")[2]
        print d_date.get_text(strip=True)
    else:
        d_date = soup.find_all("p")[0]
        print d_date.get_text(strip=True)

问题是，您必须找到带有日期的元素

，然后您可以使用月份列表，如下所示：

d_date = soup.find_all('div', { "id" : "a-body" })[0].find_all("p")[0] 
print d_date.get_text(strip=True)

from bs4 import BeautifulSoup
div_test='<div class="sa-art article-width" id="a-body" itemprop="articleBody">\
<p class="p p1">text1</p>\
<p class="p p1">\
  participant text1 text2 text3 January  8, 2009  5:00 a.m. EST\
</p>\
<p class="p p1">text2</p>\
<p class="p p1">\
January 6, 2009  8:00 pm ET\
</p></div>'
soup = BeautifulSoup(div_test, "lxml")
month_list = ['January','February','March','April','May','June','July','August','September','October','November','December']

def first_date_p():
    for p in soup.find_all('p',{"class":"p p1"}):
        for month in month_list:
            if month in p.get_text():
                first_date_p = p.get_text()
                date_start= first_date_p.index(month)
                date_text = first_date_p[date_start:]
                return date_text
first_date_p()

如果你能提供网站，这将是非常有用的。请提供你的pythoncode@SergeiZ更新code@ElvirMuslic不过，我已经将所有网页下载到每个html文件中。所以我只提供标签contentsIt只会找到第一个标签（text1），尽管date在第三个标签中。如果它在第三个“p”中，则有效。如果在第一个“p”中，则返回空。当日期在第一个“p”中时，在哪里有结束标记？在这两个示例中，div元素都没有结束标记。在第二种情况下，如果p标签不在这里面-它将不起作用，可能你们只需要在页面上取第一个p，如果日期不在第三个总是在里面。。。只是在第一个“p”问题中有时会出现这样的情况，即应该检查条件。get_text（）。同时检查@Tiny.D解决方案，它可能对您更有效。我的代码有意义，只有第三个p标签是空的，它将打印其他包含月份的文本。它能只打印第一次发现吗？它能去掉文本只保留日期吗？我发现一些文件am和ET之间没有空格，这就产生了一个错误。有什么办法可以解决吗？检查更新后的第行

date\u end

，如果有其他格式，可以扩展更多条件。索引（'amET'）+4或+3是什么意思？

from bs4 import BeautifulSoup
div_test='<div class="sa-art article-width" id="a-body" itemprop="articleBody">\
<p class="p p1">text1</p>\
<p class="p p1">\
  participant text1 text2 text3 January  8, 2009  5:00 a.m. EST\
</p>\
<p class="p p1">text2</p>\
<p class="p p1">\
January 6, 2009  8:00 pm ET\
</p></div>'
soup = BeautifulSoup(div_test, "lxml")
month_list = ['January','February','March','April','May','June','July','August','September','October','November','December']

def first_date_p():
    for p in soup.find_all('p',{"class":"p p1"}):
        for month in month_list:
            if month in p.get_text():
                first_date_p = p.get_text()
                date_start= first_date_p.index(month)
                date_text = first_date_p[date_start:]
                return date_text
first_date_p()

u'January  8, 2009  5:00 a.m. EST'