Python 漂亮的汤\u003b出现并弄乱了所有的东西？_Python_Html_Web Scraping_Beautifulsoup

Python 漂亮的汤\u003b出现并弄乱了所有的东西？

python html web-scraping

Python 漂亮的汤\u003b出现并弄乱了所有的东西？,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我一直在为顶级新闻网站制作网页刮板。python中的BeautifulSoup是一个很棒的工具，它让我能够用非常简单的代码获得完整的文章。但是 article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824' session = requests.Session() retry = Retry(conne

我一直在为顶级新闻网站制作网页刮板。python中的BeautifulSoup是一个很棒的工具，它让我能够用非常简单的代码获得完整的文章。但是

article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'

session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)


user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source=session.get(article_url, headers=request_header).text


soup = BeautifulSoup(source,'lxml')

#get all <p> paragraphs from article
paragraphs=soup.find_all('p')

#print each paragraph as a line
for paragraph in paragraphs:
    print(paragraph)

很明显，

有人知道为什么会这样吗？或者我能做些什么来修复它？因为我非常困惑

至少，我必须用regex提取一个包含数据的javascript对象，然后用

json

解析成json对象，然后在浏览器中获取与页面html相关的值，然后提取段落。我删除了重试的东西；你可以很容易地重新插入

import requests
#from requests.adapters import HTTPAdapter
#from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re,json

article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source = requests.get(article_url, headers=request_header).text
data = json.loads(re.search(r"window\['titanium-state'\] = (.*)", source, re.M).group(1))
content = data['content']['data']
content = content[list(content.keys())[0]]
soup = BeautifulSoup(content['storyHTML'])

for p in soup.select('p'):
    print(p.text.strip())

Regex:

这绝对有效！非常感谢。但是你知道为什么会出现这个问题吗？这是你以前遇到过的问题吗？另外，我也不太明白您在regex中输入的参数是什么（r“window['titanium-state']=（*））意思是。我已经添加了一个正则表达式解释。我无法重新设置你的位置，在执行p标记循环时，你会返回unicode。对我来说，内容是从一个脚本标记加载的，该脚本标记有一个包含数据的javascript对象。然后有一个js文件给出了如何处理该对象的方向。好的，我想我明白了。所以代码检索设置为window['ti-state'的任何内容=在源代码中。很好！我想你选择它是为了让程序只从文章中提取相关的正文文本。从网页抓取的角度来看，你如何决定将正则表达式搜索放在哪里？因为你的搜索结果非常好，得到的正是我想要的文本，没有额外的内容。我首先搜索了你的tar中包含的一个词获取文本。当我找到脚本标记时，我将其转移到VS代码中。我检查了结构，然后编写了一个正则表达式来提取我想要的内容。实际的最终缩小是通过字典子集完成的。通常更容易查找要匹配的javascript对象，使用正则表达式提取该对象，将其传递给json进行序列化，然后访问w感谢您的帮助。您的网站是用于抓取网页，还是仅在遇到问题时才使用？对于Hill view source这样的网站，您会怎么做：它没有明显的javascript对象和相关文本，也没有引用文章正文的“storyHTML”这样的键？

import requests
#from requests.adapters import HTTPAdapter
#from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re,json

article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source = requests.get(article_url, headers=request_header).text
data = json.loads(re.search(r"window\['titanium-state'\] = (.*)", source, re.M).group(1))
content = data['content']['data']
content = content[list(content.keys())[0]]
soup = BeautifulSoup(content['storyHTML'])

for p in soup.select('p'):
    print(p.text.strip())