使用Python和BeautifulSoup从嵌入的推文中提取文本_Python_Web Scraping_Beautifulsoup

使用Python和BeautifulSoup从嵌入的推文中提取文本

python web-scraping

使用Python和BeautifulSoup从嵌入的推文中提取文本,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我需要从网页上嵌入的推文中分别提取文本。下面的代码工作正常，但我需要摆脱这样的开始和结束行：跳过Twitter帖子。。。并在…之前结束推特帖子，日期和报告只留下推特。我甚至看不出这些行来自何处以及使用哪个标记。非常感谢你的帮助 import requests from bs4 import BeautifulSoup r = requests.get('https://www.bbc.co.uk/news/uk-44496876') soup = BeautifulSoup(r.content

我需要从网页上嵌入的推文中分别提取文本。下面的代码工作正常，但我需要摆脱这样的开始和结束行：跳过Twitter帖子。。。并在…之前结束推特帖子，日期和报告只留下推特。我甚至看不出这些行来自何处以及使用哪个标记。非常感谢你的帮助

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.bbc.co.uk/news/uk-44496876')
soup = BeautifulSoup(r.content, "html.parser")
article_soup = [s.get_text() for s in soup.find_all( 'div', {'class': 'social-embed'})]
tweets = '\n'.join(article_soup)
print(tweets)

如果你还想得到推特的作者，这有点棘手，因为你没有作者的标签。因此，我使用python代码删除了作者之间的所有标记，如下所示：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.bbc.co.uk/news/uk-44496876')
soup = BeautifulSoup(r.content, "html.parser")
articles_soup = [s for s in soup.find_all('blockquote', {'class': 'twitter-tweet'})]
tweets = []
for article_soup in articles_soup:
    tweet = article_soup.find('p').get_text()
    # The last <a href='...'></a> is the date, others are part of the tweet
    date = article_soup.find_all('a')[-1].get_text()
    tweet_author = article_soup.get_text()[len(tweet):-len(date)].strip()
    tweets.append((tweet_author, tweet))
print(tweets)

注1：如果您只想获得tweet_作者的一部分，您可以轻松地获取tuple-first元素并将其tweek以获得所需的对象

注2：问题代码示例并不总是返回tweets，问题在于html页面，因为有时有几个元素不返回。快速的解决方案是再次运行requests.get方法——我建议您研究一下这个问题。

一旦我得到了带有原始问题的tweet，我找到了标签，我得到了您期望得到的tweet，每条tweet在我的代码中的不同行中

请注意，有时tweet不会与您的示例一起出现。当他们出现时，检查答案，它只会给你预期得到的tweet——由于特定页面的问题，很难找到@是的，我确实注意到了。你知道为什么吗？当推文被加载时，它会工作。但我仍然无法为标签“ltr”提供资金。你能解释一下我应该看哪里吗？是的，所以问题是，通常不是所有的DOM元素都已加载。这与您在原始答案中遇到的问题相同，以及为什么您有时看到推文，有时不看到推文。在这种情况下，您不能使用requests.get，而是需要使用Selenium包并等待页面加载，请查看：@avissYeah，我过去使用过Selenium，但现在它已被弃用。我必须寻找一个替代方案。我注意到提取的文本缺少一个推特用户的姓名。它位于“ltr”标记之后，但本身未标记。有可能包括在内吗？这是一个例子：

Active shooter 888 Bestgate请帮助我们

-Anthony Messenger@amescapgaz>@aviss检查我编辑的回复，这很难做到，因为没有标签，但它可以工作。

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.bbc.co.uk/news/uk-44496876')
soup = BeautifulSoup(r.content, "html.parser")
articles_soup = [s for s in soup.find_all('blockquote', {'class': 'twitter-tweet'})]
tweets = []
for article_soup in articles_soup:
    tweet = article_soup.find('p').get_text()
    # The last <a href='...'></a> is the date, others are part of the tweet
    date = article_soup.find_all('a')[-1].get_text()
    tweet_author = article_soup.get_text()[len(tweet):-len(date)].strip()
    tweets.append((tweet_author, tweet))
print(tweets)