Python XPath表达式可以'；t提取mailto:attribute_Python_Regex_Xpath

Python XPath表达式可以'；t提取mailto:attribute

python regex xpath

Python XPath表达式可以'；t提取mailto:attribute,python,regex,xpath,Python,Regex,Xpath,我使用此XPath获取mailto后面的文本（）： //a[starts-with(@href, 'mailto')]/text() 现在，我希望能够为这样的属性提取mailto:之后的内容： <a href="mailto:info@info.com?subject=hello">here</a> 解决方案：我想我应该使用Selenium来编写javascript for $a in //a[starts-with(@href, 'mailto')] ret

我使用此XPath获取mailto后面的文本（）：

//a[starts-with(@href, 'mailto')]/text()

现在，我希望能够为这样的属性提取

mailto:

之后的内容：

<a href="mailto:info@info.com?subject=hello">here</a>

解决方案：我想我应该使用Selenium来编写javascript

for $a in //a[starts-with(@href, 'mailto')]
    return substring-after(normalize-space($a/@href),'mailto:')

更新

//a[starts-with(@href, 'mailto')]/substring-after(normalize-space(./@href),'mailto:')

考虑以下示例XML，用于获取

mailto:

<?xml version="1.0" encoding="UTF-8"?>
<div>
    <a href="mailto:info@info.com?subject=hello">here</a>
</div>

它返回

info@info.com？主题=你好

在您的情况下，xpath将如下所示：

为什么你不能在得到完整的href之后再得到那个部分，在从stringmailto中得到mailto之后再参与呢？第一个问题是：如何获得属性的值？我不知道，所以我删除了我的错误答案。是否可以使用xpath在一行中实现这一点？这不起作用。使用您提供的表达式在scrapy shell中尝试（规范化空格（./@href）、“mailto:”）之后的XPath://a[以（@href，'mailto'）]/substring开头]无效。你检查过了吗？@user1537701:试试这个

hxs.select（'//a[contains（@href，“href”）]/@href'）.re（r'mailto:\s*（*））

。这就成功了，你知道如何摆脱它吗？接下来会发生什么呢？只要有一封没有主题的普通电子邮件（如果有的话），这只会返回第一个结果，我需要得到所有结果。

<?xml version="1.0" encoding="UTF-8"?>
<div>
    <a href="mailto:info@info.com?subject=hello">here</a>
</div>

substring-after(/div/a/@href,'mailto:')

substring-after(//a[starts-with(@href, 'mailto')]/@href,'mailto:')