Python lxml:splitat属性？_Python_Html_Xpath_Screen Scraping_Lxml

Python lxml:splitat属性？

python html xpath

Python lxml:splitat属性？,python,html,xpath,screen-scraping,lxml,Python,Html,Xpath,Screen Scraping,Lxml,我正在使用lxml刮取一些HTML，如下所示： <div align=center><a style="font-size: 1.1em">Football</a></div> <a href="">Team A</a> <a href="">Team B</a> <div align=center><a style="font-size: 1.1em">Baseball<

我正在使用lxml刮取一些HTML，如下所示：

<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>

到目前为止，我已经：

results = []
for (i,a) in enumerate(content[0].xpath('./a')):
     data['text'] = a.text
     results.append(data)

但是我不知道如何通过在

font size

处拆分并保留兄弟标记来获得类别名称-有什么建议吗

谢谢

我成功地使用了以下代码：

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []
current_category = None

for element in body.xpath('./*'):
    if element.tag == 'div':
        current_category = element.xpath('./a')[0].text
    elif element.tag == 'a':
        results.append({ 'category' : current_category, 
            'title' : element.text })

print results

刮擦是易碎的。例如，在这里，我们明确地依赖于元素的顺序以及嵌套。然而，有时这种硬连接的方法可能已经足够好了

下面是另一种（更面向xpath的方法）使用前面的同级轴：

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []

for e in body.xpath('./a'):
    results.append(dict(
        category=e.xpath('preceding-sibling::div/a')[-1].text,
        title=e.text))

print results

#/usr/bin/env python
代码片段=”“”
"""
导入lxml.html
html=lxml.html.fromstring（代码段）
body=html[1]
结果=[]
对于body.xpath（'./a'）中的e：
结果：追加（dict(
category=e.xpath（'previous-sibling:：div/a'）[-1]。text，
title=e.text）
打印结果

我成功地使用了以下代码：

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []
current_category = None

for element in body.xpath('./*'):
    if element.tag == 'div':
        current_category = element.xpath('./a')[0].text
    elif element.tag == 'a':
        results.append({ 'category' : current_category, 
            'title' : element.text })

print results

刮擦是易碎的。例如，在这里，我们明确地依赖于元素的顺序以及嵌套。然而，有时这种硬连接的方法可能已经足够好了

下面是另一种（更面向xpath的方法）使用前面的同级轴：

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []

for e in body.xpath('./a'):
    results.append(dict(
        category=e.xpath('preceding-sibling::div/a')[-1].text,
        title=e.text))

print results

#/usr/bin/env python
代码片段=”“”
"""
导入lxml.html
html=lxml.html.fromstring（代码段）
body=html[1]
结果=[]
对于body.xpath（'./a'）中的e：
结果：追加（dict(
category=e.xpath（'previous-sibling:：div/a'）[-1]。text，
title=e.text）
打印结果

此外，如果您正在寻找其他方法（只是一个选项-不要太过击败我），如何做到这一点，或者您没有导入lxml的能力，您可以使用以下奇怪的代码：

text = """ <a href="">Team YYY</a> <div align=center><a style="font-size: 1.1em">Polo</a></div> <div align=center><a style="font-size: 1.1em">Football</a></div> <a href="">Team A</a> <a href="">Team B</a> <div align=center><a style="font-size: 1.1em">Baseball</a></div> <a href="">Team C</a> <a href="">Team D</a> <a href="">Team X</a> <div align=center><a style="font-size: 1.1em">Tennis</a></div> """ # next variables could be modified depending on what you really need keyStartsWith = '<div align=center><a style="font-size: 1.1em">' categoryStart = len(keyStartsWith) categoryEnd = -len('</a></div>') output = [] data = text.split('\n') titleStart = len('<a href="">') titleEnd = -len('</a>') getdict = lambda category, title: {'category': category, 'title': title} # main loop for i, line in enumerate(data): line = line.strip() if keyStartsWith in line and len(data)-1 >= i+1: category = line[categoryStart: categoryEnd] (len(data)-1 == i and output.append(getdict(category, ''))) if i+1 < len(data)-1 and keyStartsWith in data[i+1]: output.append(getdict(category, '')) else: while i+1 < len(data)-1 and keyStartsWith not in data[i+1]: title = data[i+1].strip()[titleStart: titleEnd] output.append(getdict(category, title)) i += 1

text=”“” ') getdict=lambda category，标题：{'category'：category，'title'：title} #主回路对于i，枚举中的行（数据）： line=line.strip（）如果键开始与行对齐且len（数据）-1>=i+1：类别=行[类别开始：类别结束] （len（data）-1==i和output.append（getdict（category.））如果i+1
此外，如果您正在寻找其他方法（只是一个选项-不要太过击败我），如何做到这一点，或者您没有导入lxml的能力，您可以使用以下奇怪的代码：

text = """ <a href="">Team YYY</a> <div align=center><a style="font-size: 1.1em">Polo</a></div> <div align=center><a style="font-size: 1.1em">Football</a></div> <a href="">Team A</a> <a href="">Team B</a> <div align=center><a style="font-size: 1.1em">Baseball</a></div> <a href="">Team C</a> <a href="">Team D</a> <a href="">Team X</a> <div align=center><a style="font-size: 1.1em">Tennis</a></div> """ # next variables could be modified depending on what you really need keyStartsWith = '<div align=center><a style="font-size: 1.1em">' categoryStart = len(keyStartsWith) categoryEnd = -len('</a></div>') output = [] data = text.split('\n') titleStart = len('<a href="">') titleEnd = -len('</a>') getdict = lambda category, title: {'category': category, 'title': title} # main loop for i, line in enumerate(data): line = line.strip() if keyStartsWith in line and len(data)-1 >= i+1: category = line[categoryStart: categoryEnd] (len(data)-1 == i and output.append(getdict(category, ''))) if i+1 < len(data)-1 and keyStartsWith in data[i+1]: output.append(getdict(category, '')) else: while i+1 < len(data)-1 and keyStartsWith not in data[i+1]: title = data[i+1].strip()[titleStart: titleEnd] output.append(getdict(category, title)) i += 1

text=”“” ') getdict=lambda category，标题：{'category'：category，'title'：title} #主回路对于i，枚举中的行（数据）： line=line.strip（）如果键开始与行对齐且len（数据）-1>=i+1：类别=行[类别开始：类别结束] （len（data）-1==i和output.append（getdict（category.））如果i+1
我不确定您丢失了哪些数据-结果对我来说似乎没问题。它丢失了类别-足球或棒球。对不起，我不知道我怎么能在表单中找到数据…健全性检查：你是否控制HTML以确保它是正确的xml我不确定你丢失了什么数据-结果对我来说似乎没问题。它缺少了类别-足球或棒球。对不起，没有注意到我怎么能以表格的形式结束数据…健全性检查：你是否控制HTML以确保它是正确的XMLNo冒犯-这可能是正确的，但它太复杂了。@miku-是的，我知道，你的解决方案更简单-这就是我投票支持它的原因，我只是把我的解决方案放在这里，作为那些由于任何当地原因无法使用您的解决方案的人的选择。当然，我不会投反对票。但一般来说，如果你试图做任何类似于解析HTML的事情，你应该与一个专门的库合作——人们甚至试图用正则表达式解析HTML，然后有趣的事情发生了——请看：@miku——我完全同意你的看法。我也相信html Vs regex是一场邪恶与天堂之间的战争）@miku-我完全同意你的看法。另外，我相信html Vs regex是一场邪恶与天堂之间的战争）关于SO线程-它可能是我们公司最喜欢的线程）无意冒犯-这可能是正确的，但它太复杂了。@miku-是的，我知道，你的解决方案更简单-这就是为什么我投票支持它，我只是把我的解决方案放在这里，作为那些由于任何当地原因无法使用您的解决方案的人的选择。当然，我不会投反对票。但一般来说，如果你试图做任何类似于解析HTML的事情，你应该与一个专门的库合作——人们甚至试图用正则表达式解析HTML，然后有趣的事情发生了——请看：@miku——我完全同意你的看法。我也相信html Vs regex是一场邪恶与天堂之间的战争）@miku-我完全同意你的看法。另外，我相信html Vs regex是一场邪恶与天堂之间的战争）关于SO线程-它可能是我们公司最喜欢的线程）天才，谢谢。是的，在我的实际页面上，
前面的兄弟姐妹
工作得更好！我现在意识到了我的错误：试图从lxml文档而不是xpath文档中找出该做什么！天才，谢谢。是的，在我的实际页面上，
前面的兄弟姐妹
工作得更好！我现在意识到了我的错误：试图从lxml文档而不是xpath文档中找出该做什么！