Python：使用lxml xpath从所有HTML子元素文本中获取文本_Python_Xpath_Lxml

Python：使用lxml xpath从所有HTML子元素文本中获取文本

python xpath

Python：使用lxml xpath从所有HTML子元素文本中获取文本,python,xpath,lxml,Python,Xpath,Lxml,我正在使用python的lxml xpath。如果我给出HTML标记的完整路径，我就能够提取文本。但是，我无法将标签及其子元素中的所有文本提取到列表中。例如，给定这个html，我想得到“示例”类的所有文本： mzjn-s anwer是正确的。经过反复试验，我终于使它工作了。这就是最终代码的样子。您需要将//text（）放在xpath的末尾。目前还没有进行重构，因此肯定会有一些错误和坏的做法，但它是有效的 session = requests.Session() retry = R

我正在使用python的lxml xpath。如果我给出HTML标记的完整路径，我就能够提取文本。但是，我无法将标签及其子元素中的所有文本提取到列表中。例如，给定这个html，我想得到“示例”类的所有文本：

mzjn-s anwer是正确的。经过反复试验，我终于使它工作了。这就是最终代码的样子。您需要将

//text（）

放在xpath的末尾。目前还没有进行重构，因此肯定会有一些错误和坏的做法，但它是有效的

    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    page = session.get("The url you are webscraping")
    content = page.content

    htmlsite = urllib.request.urlopen("The url you are webscraping")
    soup = BeautifulSoup(htmlsite, 'lxml')
    htmlsite.close()

    tree = html.fromstring(content)
    scraped = tree.xpath('//html[contains(@class, "no-js")]/body/div[contains(@class, "container")]/div[contains(@class, "content")]/div[contains(@class, "row")]/div[contains(@class, "col-md-6")]/div[contains(@class, "clearfix")]//text()')

我已经在keeleyteton.com的团队介绍页面上试用过了。它返回了以下正确的列表（尽管需要大量修改！），因为它们位于不同的标记中，有些是子标记。谢谢你的帮助

['\r\n        ', '\r\n        ', 'Nicholas F. Galluccio', '\r\n        ', '\r\n        ', 'Managing Director and Portfolio Manager', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Scott R. Butler', '\r\n        ', '\r\n        ', 'Senior Vice President and Portfolio Manager ', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Thomas E. Browne, Jr., CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian P. Leonard, CFA', '\r\n        ', '\r\n
  ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Robert M. Goldsborough', '\r\n        ', '\r\n        ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian R. Keeley, CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Edward S. Borland', '\r\n        ', '\r\n
  ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Kevin M. Keeley', '\r\n        ', '\r\n        ', 'President', '\r\n
 ', '\r\n        ', '\r\n        ', 'Deanna B. Marotz', '\r\n        ', '\r\n        ', 'Chief Compliance Officer', '\r\n      ']

使用

text（）

应该有效：不幸的是，它对我不起作用，我再次检查了它，它返回一个空列表。请向我们展示您的代码。提供一个。我发现了我的一个错误。你写的很好，我已经在上面贴了答案！

    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    page = session.get("The url you are webscraping")
    content = page.content

    htmlsite = urllib.request.urlopen("The url you are webscraping")
    soup = BeautifulSoup(htmlsite, 'lxml')
    htmlsite.close()

    tree = html.fromstring(content)
    scraped = tree.xpath('//html[contains(@class, "no-js")]/body/div[contains(@class, "container")]/div[contains(@class, "content")]/div[contains(@class, "row")]/div[contains(@class, "col-md-6")]/div[contains(@class, "clearfix")]//text()')

['\r\n        ', '\r\n        ', 'Nicholas F. Galluccio', '\r\n        ', '\r\n        ', 'Managing Director and Portfolio Manager', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Scott R. Butler', '\r\n        ', '\r\n        ', 'Senior Vice President and Portfolio Manager ', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Thomas E. Browne, Jr., CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian P. Leonard, CFA', '\r\n        ', '\r\n
  ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Robert M. Goldsborough', '\r\n        ', '\r\n        ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian R. Keeley, CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Edward S. Borland', '\r\n        ', '\r\n
  ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Kevin M. Keeley', '\r\n        ', '\r\n        ', 'President', '\r\n
 ', '\r\n        ', '\r\n        ', 'Deanna B. Marotz', '\r\n        ', '\r\n        ', 'Chief Compliance Officer', '\r\n      ']