Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python/lxml/Xpath:如何查找包含特定文本的行?_Python_Xpath_Python 2.7_Lxml - Fatal编程技术网

Python/lxml/Xpath:如何查找包含特定文本的行?

Python/lxml/Xpath:如何查找包含特定文本的行?,python,xpath,python-2.7,lxml,Python,Xpath,Python 2.7,Lxml,给定URL,如何捕获和打印整行数据的内容 例如,需要什么才能获得类似以下内容的输出: “现金和短期投资1448411697601892528674357379”?或类似“物业、厂房及设备-总金额725104 632332 571467 538805 465493” 我已经通过网站了解了Xpath的基础知识。然而,Xpath语法对我来说仍然是个谜 我已经在BeautifulSoup成功地做到了这一点。我喜欢BeautifulSoup不要求我知道文件的结构——它只查找包含我搜索的文本的元素。不幸的是

给定URL,如何捕获和打印整行数据的内容

例如,需要什么才能获得类似以下内容的输出: “现金和短期投资1448411697601892528674357379”?或类似“物业、厂房及设备-总金额725104 632332 571467 538805 465493”

我已经通过网站了解了Xpath的基础知识。然而,Xpath语法对我来说仍然是个谜

我已经在BeautifulSoup成功地做到了这一点。我喜欢BeautifulSoup不要求我知道文件的结构——它只查找包含我搜索的文本的元素。不幸的是,对于一个需要执行数千次的脚本来说,BeautifulSoup太慢了。我在BeautifulSoup中的任务的源代码是(标题输入等于“现金和短期投资”):

那么lxml中的等效代码是什么呢

编辑1:URL在我第一次发布时被隐藏了。我现在已经解决了这个问题

编辑2:我添加了基于BeautifulSoup的解决方案,以明确我正在尝试做什么

为您的解决方案将3:+10编辑为root。为了让未来的开发人员有同样的问题,我在这里发布了一个对我有用的快速而肮脏的脚本:

    #!/usr/bin/env python
    import urllib
    import lxml.html

    url = 'balancesheet.html'

    result = urllib.urlopen(url)
    html = result.read()


    doc = lxml.html.document_fromstring(html)
    x = doc.xpath(u'.//th[div[text()="Cash & Short Term Investments"]]/following-sibling::td/text()')
    print x
或者,您可以定义一个小函数来按文本获取行:

In [19]: def func(doc,txt):
    ...:     exp=u'.//th[div[text()="{0}"]]'\
    ...:         u'/following-sibling::td/text()'.format(txt)
    ...:     return [i.strip() for i in doc.xpath(exp)]

In [20]: func(doc,u'Total Accounts Receivable')
Out[20]: ['338,594', '270,133', '214,169', '244,940', '236,331']
或者,您可以将所有行放入一个
dict

In [21]: d={}

In [22]: for i in doc.xpath(u'.//tbody/tr'):
    ...:     if len(i.xpath(u'.//th/div/text()')):
    ...:         d[i.xpath(u'.//th/div/text()')[0]]=\
    ...:         [e.strip() for e in i.xpath(u'.//td/text()')]

In [23]: d.items()[:3]
Out[23]: 
[('Accounts Receivables, Gross',
     ['344,241', '274,894', '218,255', '247,600', '238,596']),
 ('Short-Term Investments', 
     ['27,165', '26,067', '24,400', '851', '159']),
 ('Cash & Short Term Investments',
     ['144,841', '169,760', '189,252', '86,743', '57,379'])] 

让html保存html源代码:

import lxm.html
doc = lxml.html.document_fromstring(html)
rows_element = doc.xpath('/html/body/div/div[2]/div/div[5]/div/div/table/tbody/tr')
for row in rows_element:
     print row.text_content()
未经测试,但应能正常工作


另外,在firefox中安装xpath cheker或firefinder,以帮助您使用命令的xpath

+10:doc.xpath(u'.//th[div[text()=“Cash&Short-Term Investments”]]/following sibling::td/text()
In [21]: d={}

In [22]: for i in doc.xpath(u'.//tbody/tr'):
    ...:     if len(i.xpath(u'.//th/div/text()')):
    ...:         d[i.xpath(u'.//th/div/text()')[0]]=\
    ...:         [e.strip() for e in i.xpath(u'.//td/text()')]

In [23]: d.items()[:3]
Out[23]: 
[('Accounts Receivables, Gross',
     ['344,241', '274,894', '218,255', '247,600', '238,596']),
 ('Short-Term Investments', 
     ['27,165', '26,067', '24,400', '851', '159']),
 ('Cash & Short Term Investments',
     ['144,841', '169,760', '189,252', '86,743', '57,379'])] 
import lxm.html
doc = lxml.html.document_fromstring(html)
rows_element = doc.xpath('/html/body/div/div[2]/div/div[5]/div/div/table/tbody/tr')
for row in rows_element:
     print row.text_content()