Python+；BeautifulSoup纽约时报网页文章刮_Python_Web Scraping_Beautifulsoup

Python+；BeautifulSoup纽约时报网页文章刮

python web-scraping

Python+；BeautifulSoup纽约时报网页文章刮,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图提取任何《纽约时报》文章的内容，并将其放入字符串中以计算某些单词。所有文章内容都可以在HTML“p”标记中找到。我能够一个接一个地获取段落（在代码中进行注释），但我无法迭代变量段落，因为我不断得到以下错误： --------------------------------------------------------------------------- TypeError Traceback (most recent c

我试图提取任何《纽约时报》文章的内容，并将其放入字符串中以计算某些单词。所有文章内容都可以在HTML“p”标记中找到。我能够一个接一个地获取段落（在代码中进行注释），但我无法迭代变量段落，因为我不断得到以下错误：

 ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-52-ccc2f7cf5763> in <module>()
     16 
     17 for i in paragraphs:
---> 18     article = article + paragraphs[i].get_text()
     19 
     20 print(article)

TypeError: list indices must be integers, not Tag

你想要：

for p in paragraphs:
    article = article + p.get_text()

或：

别忘了检查《纽约时报》的服务条款，尤其是如果你使用他们的文章不仅仅是为了学习。

for p in paragraphs:
    article = article + p.get_text()

for i in range(len(paragraphs)):
    article = article + paragraphs[i].get_text()

p_tags = soup.find_all(class_="story-body-text story-content")
# method 1
article = ''
for p_tag in p_tags:
    p_text = p_tag.get_text()
    article += p_text
print(article)

# method 2
article2 = ''.join(p_tag.get_text() for p_tag in p_tags)
print(article2)