Javascript python的Webscraping:信息不完整,被多节框架隐藏
提前谢谢大家。我是网络垃圾和堆叠溢出的新手。我试着从中提取一些生物数据 我要删除的链接来自一个表 outerHTML代码为Javascript python的Webscraping:信息不完整,被多节框架隐藏,javascript,python,web-scraping,sparql,Javascript,Python,Web Scraping,Sparql,提前谢谢大家。我是网络垃圾和堆叠溢出的新手。我试着从中提取一些生物数据 我要删除的链接来自一个表 outerHTML代码为 <a href="http://identifiers.org/pubmed/7503987" target="_blank">7503987</a> 此方法返回一个没有我要查找的链接的链接列表 方法2: from selenium import webdriver from selenium.webdriver.common.keys impor
<a href="http://identifiers.org/pubmed/7503987" target="_blank">7503987</a>
此方法返回一个没有我要查找的链接的链接列表
方法2:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://glytoucan.org/Structures/Glycans/G00055MO")
elem = driver.find_element_by_xpath("//*[@id='literature']/togostanza-literature//main/ul/li/ul/li[1]")
此方法找不到我输入的xpath
有人能帮我找到另一种获取数据的方法吗?我真的很感激
谢谢,
博坎
--封闭的--
谢谢大家帮我重新编排这个问题。这是关于stackoverflow的第一篇文章
我用PhantomJS和Firefox驱动程序尝试了第二种方法。最后,firefix Web驱动程序可以工作。JS似乎正在调用它。输入参数是URL编码的查询,如下所示:
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT DISTINCT ?from ?partner_url ?description ?pubmed_id ?pubmed_url
WHERE{
VALUES ?accNum {"G00055MO"}
?saccharide glytoucan:has_primary_id ?accNum .
GRAPH ?graph {
?saccharide dcterms:references ?article .
?article a bibo:Article .
?article dcterms:identifier ?pubmed_id .
?article rdfs:seeAlso ?pubmed_url .
}
?graph rdfs:label ?from .
OPTIONAL {?graph rdfs:seeAlso ?partner_url.}
?graph dcterms:description ?description.
} ORDER by ?from
使用以下链接将获得您的链接:
import requests
query = """
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT DISTINCT ?from ?partner_url ?description ?pubmed_id ?pubmed_url
WHERE{
VALUES ?accNum {"G00055MO"}
?saccharide glytoucan:has_primary_id ?accNum .
GRAPH ?graph {
?saccharide dcterms:references ?article .
?article a bibo:Article .
?article dcterms:identifier ?pubmed_id .
?article rdfs:seeAlso ?pubmed_url .
}
?graph rdfs:label ?from .
OPTIONAL {?graph rdfs:seeAlso ?partner_url.}
?graph dcterms:description ?description.
} ORDER by ?from
"""
headers = {'Accept': 'application/sparql-results+json'}
payload = {'query': query}
r = requests.get('https://ts.glytoucan.org/sparql', params=payload, headers=headers)
print(r.status_code)
data = r.json()
links = [ t["pubmed_url"]["value"] for t in data["results"]["bindings"] ]
print(links)
你太专业了!您能告诉我如何获得上面显示的URL编码查询信息吗?我只从中找到查询信息。但是我以前没有学过这种数据查询方法,任何背景信息都会有帮助@BokanBao当你点击internal API太长时间没有评论时,url编码的查询已经在上面的帖子中链接。注意,你可以在Chrome控制台中查看完整的查询,打开控制台转到网络选项卡可能刷新页面并添加过滤器sparql以查看所有这些内部API调用谢谢你的帮助!我将驱动程序改为Firefox,解决了这个问题。但是,我对如何使用sparql语言查询数据库非常感兴趣!
import requests
query = """
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan: <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT DISTINCT ?from ?partner_url ?description ?pubmed_id ?pubmed_url
WHERE{
VALUES ?accNum {"G00055MO"}
?saccharide glytoucan:has_primary_id ?accNum .
GRAPH ?graph {
?saccharide dcterms:references ?article .
?article a bibo:Article .
?article dcterms:identifier ?pubmed_id .
?article rdfs:seeAlso ?pubmed_url .
}
?graph rdfs:label ?from .
OPTIONAL {?graph rdfs:seeAlso ?partner_url.}
?graph dcterms:description ?description.
} ORDER by ?from
"""
headers = {'Accept': 'application/sparql-results+json'}
payload = {'query': query}
r = requests.get('https://ts.glytoucan.org/sparql', params=payload, headers=headers)
print(r.status_code)
data = r.json()
links = [ t["pubmed_url"]["value"] for t in data["results"]["bindings"] ]
print(links)