Python+;BeautifulSoup纽约时报网页文章刮
我试图提取任何《纽约时报》文章的内容,并将其放入字符串中以计算某些单词。所有文章内容都可以在HTML“p”标记中找到。我能够一个接一个地获取段落(在代码中进行注释),但我无法迭代变量段落,因为我不断得到以下错误:Python+;BeautifulSoup纽约时报网页文章刮,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图提取任何《纽约时报》文章的内容,并将其放入字符串中以计算某些单词。所有文章内容都可以在HTML“p”标记中找到。我能够一个接一个地获取段落(在代码中进行注释),但我无法迭代变量段落,因为我不断得到以下错误: --------------------------------------------------------------------------- TypeError Traceback (most recent c
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-ccc2f7cf5763> in <module>()
16
17 for i in paragraphs:
---> 18 article = article + paragraphs[i].get_text()
19
20 print(article)
TypeError: list indices must be integers, not Tag
你想要:
for p in paragraphs:
article = article + p.get_text()
或:
别忘了检查《纽约时报》的服务条款,尤其是如果你使用他们的文章不仅仅是为了学习。
for p in paragraphs:
article = article + p.get_text()
for i in range(len(paragraphs)):
article = article + paragraphs[i].get_text()
p_tags = soup.find_all(class_="story-body-text story-content")
# method 1
article = ''
for p_tag in p_tags:
p_text = p_tag.get_text()
article += p_text
print(article)
# method 2
article2 = ''.join(p_tag.get_text() for p_tag in p_tags)
print(article2)