使用bs4的Python web抓取不使用类pg bodyCopy具有apos

使用bs4的Python web抓取不使用类pg bodyCopy具有apos,python,web-scraping,beautifulsoup,request,Python,Web Scraping,Beautifulsoup,Request,我正试图从以下内容中删除: 左侧的所有日期和文本 到目前为止,我尝试了以下代码,它只检索了17个结果,还从正确的文本中获得了一些结果 import requests from bs4 import BeautifulSoup r=requests.get('https://www.washingtonpost.com/graphics/politics/trump-claims-database/?noredirect=on&utm_term=.4a34a0231c12') htm

我正试图从以下内容中删除:

左侧的所有日期和文本

到目前为止,我尝试了以下代码,它只检索了17个结果,还从正确的文本中获得了一些结果

import requests
from bs4 import BeautifulSoup

r=requests.get('https://www.washingtonpost.com/graphics/politics/trump-claims-database/?noredirect=on&utm_term=.4a34a0231c12')
html=BeautifulSoup(r.content,'html.parser')
results=html.find_all('p','pg-bodyCopy')
我的问题是:

如何获取包含所有左侧文本的列表和另一个包含文本对应日期的列表?

样本输出:

[(Mar 3 2019,After more than two years of Presidential Harassment, the only things that have been proven is that Democrats and other broke the law. The hostile Cohen testimony, given by a liar to reduce his prison time, proved no Collusion! His just written book manuscript showed what he said was a total lie, but Fake Media won't show it. I am an innocent man being persecuted by some very bad, conflicted & corrupt people in a Witch Hunt that is illegal & should never have been allowed to start - And only because I won the Election!)]
编辑:只是想知道是否可以根据图片检索来源(Twitter、Facebook等)


您正在寻找的所有项目都无法直接获得。您可以使用多次单击“加载更多”按钮来加载所有数据并获取页面源

代码:


您要查找的数据如下:

它是一个名为“claims”的JS数组。每个条目看起来像:

{
  id: "8920",
  date: "Mar 3 2019",
  location: "Twitter",
  claim: "“Presidential Harassment by 'crazed' Democrats at the highest level in the history of our Country. Likewise, the most vicious and corrupt Mainstream Media that any president has ever had to endure.â€",
  analysis: 'The scrutiny of President Trump by the House of Representatives is little different than the probes launched by Republicans of Barack Obama, Democrats of George W. Bush or Republicans of Bill Clinton, just to name of few recent examples. President John Tyler was actually ousted by his party (the Whigs) while Andrew Johnson and Clinton were impeached. As for media coverage, Trump regularly appears to believe it should only be positive. He has offered little evidence the media is "corrupt."',
  pinocchios: null,
  category: "Miscellaneous",
  repeated: null,
  r_id: null,
  full_story_url: null,
  unixDate: "1551589200"
}
代码(我已将页面内容下载到我的文件系统-cliams.txt)

我使用的是使json字符串成为dict

import demjson
start_str = 'e.exports={claims:'
end_str = 'lastUpdated'
with open('c:\\temp\\claims.txt','r',encoding="utf8") as claims_file:
    dirty_claims = claims_file.read()
    start_str_idx = dirty_claims.find(start_str)
    end_str_idx = dirty_claims.rfind(end_str)
    print('{} {}'.format(start_str_idx,end_str_idx))
    claims_str = dirty_claims[start_str_idx + len(start_str):end_str_idx-1]
    claims = demjson.decode(claims_str)
    for claim in claims:
        print(claim)

你能提供一个预期输出的例子吗?@JackFleeting我已经添加了一个预期输出的例子。感谢您花时间阅读我的问题,我正在Linux Mint 18.3-Firefox 65.0(64位)上运行此程序,在尝试运行代码时以及更改
driver=webdriver.Chrome()
for
driver=webdriver.Firefox()后
我收到错误消息:“geckodriver”可执行文件需要在PATH@Moreno您必须将路径添加到geckodriver。参见Ubuntu中的@Moreno我临时添加了如下路径
export path=$path:/home/bitto/path/to/gekodriver\u文件夹
我已经测试过,下载gecko驱动程序后,它工作得非常好。最后一个问题:是否有办法同时检索源代码(Twitter、Facebook等?),我尝试了'loc=row.find('div',class='details expanded').text.strip()',但没有results@Moreno是的,这是可能的。我已经编辑了我的答案,也包括了这一点。有没有一种方法可以直接从这个来源使用pyhton进行废弃?如果是这样,请与我们分享,看看答案。
{
  id: "8920",
  date: "Mar 3 2019",
  location: "Twitter",
  claim: "“Presidential Harassment by 'crazed' Democrats at the highest level in the history of our Country. Likewise, the most vicious and corrupt Mainstream Media that any president has ever had to endure.â€",
  analysis: 'The scrutiny of President Trump by the House of Representatives is little different than the probes launched by Republicans of Barack Obama, Democrats of George W. Bush or Republicans of Bill Clinton, just to name of few recent examples. President John Tyler was actually ousted by his party (the Whigs) while Andrew Johnson and Clinton were impeached. As for media coverage, Trump regularly appears to believe it should only be positive. He has offered little evidence the media is "corrupt."',
  pinocchios: null,
  category: "Miscellaneous",
  repeated: null,
  r_id: null,
  full_story_url: null,
  unixDate: "1551589200"
}
import demjson
start_str = 'e.exports={claims:'
end_str = 'lastUpdated'
with open('c:\\temp\\claims.txt','r',encoding="utf8") as claims_file:
    dirty_claims = claims_file.read()
    start_str_idx = dirty_claims.find(start_str)
    end_str_idx = dirty_claims.rfind(end_str)
    print('{} {}'.format(start_str_idx,end_str_idx))
    claims_str = dirty_claims[start_str_idx + len(start_str):end_str_idx-1]
    claims = demjson.decode(claims_str)
    for claim in claims:
        print(claim)