Python 3.x Web抓取-使用python从页面提取数据_Python 3.x_Web Scraping

Python 3.x Web抓取-使用python从页面提取数据

python-3.x web-scraping

Python 3.x Web抓取-使用python从页面提取数据,python-3.x,web-scraping,Python 3.x,Web Scraping,这是我正在使用的代码。它返回一个空列表。我想不出我做错了什么 from urllib request import urlopen import re url = 'http://pubs.acs.org/doi/full/10.1021/jacs.6b10998'# example of a web page html = urlopen(url).read().decode('utf-8')# decoding cite_year='<span class="citation_ye

这是我正在使用的代码。它返回一个空列表。我想不出我做错了什么

from urllib request import urlopen
import re

url = 'http://pubs.acs.org/doi/full/10.1021/jacs.6b10998'# example of a web page
html = urlopen(url).read().decode('utf-8')# decoding

cite_year='<span class="citation_year">(.+?)</span>'# extract citation year
pattern = re.compile(cite_year) #compile
citation_year = re.findall(pattern, html) #store data into a variable

print(citation_year)# and print

从urllib请求导入urlopen
进口稀土
url='1〕http://pubs.acs.org/doi/full/10.1021/jacs.6b10998“#网页示例
html=urlopen（url）.read（）.decode（'utf-8'）#解码
引用年='（.+？）'#摘录引用年
模式=重新编译（引用年份）#编译
引文_year=re.findall（模式，html）#将数据存储到变量中
打印（引用年）和打印

向请求添加标题，我使用

请求

和

bs4

库：

import requests
import bs4
headers = {'User-Agent':'Mozilla/5.0'}
url = 'http://pubs.acs.org/doi/full/10.1021/jacs.6b10998'# example of a web page
html = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(html.text, 'lxml')
year = soup.find(class_="citation_year").text
print(year)

你确定你的正则表达式是正确的吗？建议用示例数据替换你的前两行（我做了html=“”测试……栏…三四……栏“”然后剩下的代码按预期工作……这将允许您分类问题所在，以及您的数据是否具有与预期相同的引号等。还请注意，这样做往往不鼓励使用regexp解析HTML