Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/templates/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在python中使用Beautiful soup分析表_Python_Html_Parsing_Beautifulsoup - Fatal编程技术网

在python中使用Beautiful soup分析表

在python中使用Beautiful soup分析表,python,html,parsing,beautifulsoup,Python,Html,Parsing,Beautifulsoup,所以我有一张桌子: <table border="1" style="width: 100%"> <caption></caption> <col> <col> <tbody> <tr> <td>Pig</td> <td>House Type</td> </tr> <tr> <td>Pig A</

所以我有一张桌子:

<table border="1" style="width: 100%">
  <caption></caption>
  <col>
  <col>
  <tbody>
<tr>
  <td>Pig</td>
  <td>House Type</td>
</tr>
<tr>
  <td>Pig A</td>
  <td>Straw</td>
</tr>
<tr>
  <td>Pig B</td>
  <td>Stick</td>
</tr>
<tr>
  <td>Pig C</td>
  <td>Brick</td>
</tr>
然而,我的代码似乎无法摆脱HTML标记:

stable = soup.find('table')

cells = [ ]
rows = stable.findAll('tr')
for tr in rows[1:4]:
    # Process the body of the table
    row = []
    td = tr.findAll('td')
    #td = [el.text for el in soup.tr.finall('td')]
    row.append( td[0])
    row.append( td[1])
    cells.append( row )


return cells
#最后,我想这样做: #h=json.dumps(单元格) #返回h

我的输出是:


[[Pig A,Straw],[Pig B,Stick],[Pig C,Brick]]
使用
text
属性仅获取元素的内部文本:

row.append(td[0].text)
row.append(td[1].text)

您可以尝试使用lxml库

from lxml.html import fromstring
import lxml.html as PARSER

#data = open('example.html').read() # You can read it from a html file.
#OR
data = """
<table border="1" style="width: 100%">
  <caption></caption>
  <col>
  <col>
  <tbody>
<tr>
  <td>Pig</td>
  <td>House Type</td>
</tr>
<tr>
  <td>Pig A</td>
  <td>Straw</td>
</tr>
<tr>
  <td>Pig B</td>
  <td>Stick</td>
</tr>
<tr>
  <td>Pig C</td>
  <td>Brick</td>
</tr>
"""
root = PARSER.fromstring(data)
main_list = []

for ele in root.getiterator():
    if ele.tag == "tr":
        text = ele.text_content().strip().split('\n')
        main_list.append(text)

print main_list
从lxml.html导入fromstring
将lxml.html作为解析器导入
#data=open('example.html').read()#您可以从html文件中读取它。
#或
data=”“”
猪
户型
猪A
稻草
猪B
棍子
猪C
砖
"""
root=PARSER.fromstring(数据)
主列表=[]
对于root.getiterator()中的元素:
如果ele.tag==“tr”:
text=ele.text\u content().strip().split('\n')
主列表。追加(文本)
打印主目录
输出:
第一个成功了。第二个给了我
AttributeError:'ResultSet'对象没有属性“contents”
from lxml.html import fromstring
import lxml.html as PARSER

#data = open('example.html').read() # You can read it from a html file.
#OR
data = """
<table border="1" style="width: 100%">
  <caption></caption>
  <col>
  <col>
  <tbody>
<tr>
  <td>Pig</td>
  <td>House Type</td>
</tr>
<tr>
  <td>Pig A</td>
  <td>Straw</td>
</tr>
<tr>
  <td>Pig B</td>
  <td>Stick</td>
</tr>
<tr>
  <td>Pig C</td>
  <td>Brick</td>
</tr>
"""
root = PARSER.fromstring(data)
main_list = []

for ele in root.getiterator():
    if ele.tag == "tr":
        text = ele.text_content().strip().split('\n')
        main_list.append(text)

print main_list