使用LXML-Python解析HTML
我正试图解析牛津词典,以获得给定单词的词源使用LXML-Python解析HTML,python,Python,我正试图解析牛津词典,以获得给定单词的词源 class SkipException (Exception): def __init__(self, value): self.value = value try: doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good")) except SkipException: doc = '' if
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
我似乎不知道如何获得所需的文本字符串。我知道我复制的代码中缺少一些代码行,但我不知道HTML和LXML是如何完全工作的。如果有人能为我提供解决这个问题的正确方法,我将不胜感激。你不想做网页抓取,尤其是当可能每本字典都有API接口时。如果是牛津大学,请在创建帐户。从您的帐户获取API凭据并执行以下操作:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))
下面是一个让您开始抓取牛津词典页面的示例:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[@class='ind']")
for element in elements:
print(element.text)
要找到正确的搜索字符串,需要格式化html,以便查看结构。我使用了html格式化程序。查看格式化的HTML,我可以看到定义是在带有“ind”class属性的span元素中。这个API的问题是,它只允许您每分钟搜索60个单词,每月搜索10000个单词,这对于我想要做的是不够的。