Python 我们可以将XPath与BeautifulSoup一起使用吗？_Python_Xpath_Beautifulsoup_Urllib

Python 我们可以将XPath与BeautifulSoup一起使用吗？

python xpath

Python 我们可以将XPath与BeautifulSoup一起使用吗？,python,xpath,beautifulsoup,urllib,Python,Xpath,Beautifulsoup,Urllib,我正在使用BeautifulSoup刮取一个URL，我有以下代码 import urllib import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html" req = urllib2.Request(url) response = urllib2.urlopen(req) the_pa

我正在使用BeautifulSoup刮取一个URL，我有以下代码

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

现在在上面的代码中，我们可以使用

findAll

来获取标记和与它们相关的信息，但我想使用XPath。是否可以将XPath与BeautifulSoup一起使用？如果可能的话，有人能给我提供一个示例代码，这样会更有帮助吗？

我已经搜索了他们的示例代码，似乎没有xpath选项。另外，正如你在一个类似的问题上看到的，OP要求将xpath转换为BeautifulSoup，所以我的结论是-不，没有可用的xpath解析

不，BeautifulSoup本身不支持XPath表达式

另一个库支持XPath 1.0。它有一个框架，它将尝试像Soup那样解析损坏的HTML。然而，在解析被破坏的HTML方面，它也做得很好，而且我相信它会更快

将文档解析为lxml树后，可以使用

.xpath（）

方法搜索元素

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

还有一个具有附加功能的

请注意，在上面的示例中，我将

响应

对象直接传递给

lxml

，因为让解析器直接从流中读取比首先将响应读入大字符串更有效。要对

请求

库执行相同的操作，您需要设置

stream=True

并传入

响应。原始

对象：

您可能感兴趣的是：；

CSSSelector

类将CSS语句转换为XPath表达式，使您搜索

td.empformbody

更加容易：

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

来个完整的循环：BeautifulSoup本身有非常完整的：

我可以确认Beautiful Soup中不支持XPath。

BeautifulSoup有一个名为current element directed childern的函数，因此：

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

上述代码可以模拟以下xpath：

div[class=class_value]/div[id=id_value]

正如其他人所说，BeautifulSoup不支持xpath。从xpath获取内容可能有多种方法，包括使用Selenium。但是，这里有一个在Python 2或Python 3中都可以使用的解决方案：

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

我用作参考。

这是一个非常古老的线程，但现在有一个变通解决方案，它可能当时不在BeautifulSoup中

这是我所做的一个例子。我使用“requests”模块读取RSS提要，并在一个名为“RSS_text”的变量中获取其文本内容。这样，我就可以通过BeautifulSoup运行它，搜索xpath/rss/channel/title，并检索其内容。它并不完全是XPath（通配符、多路径等），但如果您只需要找到一个基本路径，它就可以工作

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

使用lxml时，所有操作都很简单：

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

但使用BeautifulSoup BS4时也很简单：

首先删除“/”和“@”
第二个-在“=”之前添加星号

试试这个魔术：

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

如您所见，这不支持子标记，因此我删除了“/@href”部分

也许您可以在不使用XPath的情况下尝试以下操作

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

从simplified_scrapy.simplified_文档导入SimplifiedDoc
html=“”
示例域
此域用于文档中的示例。你可以用这个
未经事先协调或请求许可的文献中的域

'''
#XPath能做什么，它也能做什么
doc=SimplifiedDoc（html）
#结果与doc.getElementByTag（'body'）.getElementByTag（'div'）.getElementByTag（'h1'）.text相同
打印（doc.body.div.h1.text）
打印（doc.div.h1.text）
打印（doc.h1.text）#路径越短速度越快
打印（doc.div.getChildren（））
打印（doc.div.getChildren（'p'））

上面使用了Soup对象与lxml的组合，可以使用xpath提取值。使用

Soup.find（class='myclass'）

是的，实际上直到现在我都使用了scrapy，它使用xpath提取标记内的数据。它非常方便，很容易获取数据，但是我也需要对beautifulsoup做同样的事情，所以我很期待。非常感谢Pieters，我从你的代码中得到了两个信息，1。澄清了我们不能将xpath与BS2一起使用。这是一个关于如何使用lxml的好例子。我们可以在一个特定的文档中看到“我们不能以书面形式使用BS实现xpath”，因为我们应该向那些要求澄清的人展示一些证据，对吗？很难证明是否定的；有一个搜索函数，没有找到“xpath”。我尝试运行上面的代码，但出现了一个错误“name'xpathselector'未定义”@Zvi代码未定义xpath选择器；我的意思是“在这里使用您自己的XPath表达式”。注意：Leonard Richardson是《美丽的汤》的作者，如果您点击他的用户档案，您就会看到。如果能够在BeautifulSoup中使用XPath，那将是一件非常好的事。那么还有什么选择呢？@Leonard Richardson现在是2021年，您仍在确认BeautifulSoup仍然不支持xpath吗？一个警告：我注意到如果根之外有东西（例如外部标记之外的\n），则通过根引用xpath将不起作用，您必须使用相对xpath。Martijn的代码不再正常工作（现在已经4年多了…），etree.parse（）行打印到控制台，并且不会将值分配给树变量。这是一个很好的说法。我当然无法复制，这也没有任何意义。您确定您正在使用Python 2来测试我的代码，或者已经将

urllib2

库使用转换为Python 3

urllib.request

？是的，可能是我在编写时使用了Python3，但它没有按预期工作。刚刚测试过，您的使用Python2，但Python3更受欢迎，因为2将在2020年推出（不再得到官方支持）。完全同意，但这里的问题是使用Python2。我相信这只会找到子元素。XPath是另一回事？

select（）

用于CSS selecto

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')