Python 使用beautifulsoup提取标记属性_Python_Parsing_Beautifulsoup

Python 使用beautifulsoup提取标记属性

python parsing

Python 使用beautifulsoup提取标记属性,python,parsing,beautifulsoup,Python,Parsing,Beautifulsoup,我一整天都在为此烦恼。基本上我无法从标签中提取信息比如：我得到的只是[] 究竟为什么从标记中提取LEWISSPLIT属性的值如此困难非常感谢你阅读这篇文章另请参见Joel Cornett是正确的 “reuters”和“lewissplit”应该用小写：（正确语法： for item in soup.findAll('reuters'): tags.append(item['lewissplit']) 调用soup.findAll（'REUTERS'）时会发生什么？您得到了什么样

我一整天都在为此烦恼。基本上我无法从标签中提取信息比如：

我得到的只是[]

究竟为什么从

标记中提取LEWISSPLIT属性的值如此困难

非常感谢你阅读这篇文章

另请参见Joel Cornett是正确的

“reuters”和“lewissplit”应该用小写：（正确语法：

for item in soup.findAll('reuters'):
    tags.append(item['lewissplit'])

调用

soup.findAll（'REUTERS'）

时会发生什么？您得到了什么样的输出？您是否尝试过

soup.findAll（'REUTERS'）

？我注意到，在解析您提供的xml时，BeautifulSoup将所有标记转换为小写。

import arff
from xml.etree import ElementTree
import re
from StringIO import StringIO

import BeautifulSoup
from BeautifulSoup import BeautifulSoup

totstring=""

with open('reut2-000.sgm', 'r') as inF:
    for line in inF:
        string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line)
    totstring+=string

soup = BeautifulSoup(totstring)

bodies = list()
topics = list()
tags = list()

for a in soup.findAll("body"):
    bodies.append(a)


for b in soup.findAll("topics"):
    topics.append(b)

for item in soup.findAll('REUTERS'):
    tags.append(item['TOPICS'])



outputstring=""

for x in range(0,len(bodies)):
    if topics[x].text=="":
        continue
    outputstring=outputstring+"<TOPICS>"+topics[x].text+"</TOPICS>\n"+"<BODY>"+bodies[x].text+"</BODY>\n"

outfile=open("output.sgm","w")
outfile.write(outputstring)

outfile.close()

print tags[0]

file.close

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;C T
&#22;&#22;&#1;f0704&#31;reute
u f BC-BAHIA-COCOA-REVIEW   02-26 0105</UNKNOWN>
<TEXT>&#2;
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
&#3;</BODY></TEXT>
</REUTERS>
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<DATE>26-FEB-1987 15:02:20.00</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;F Y
&#22;&#22;&#1;f0708&#31;reute
d f BC-STANDARD-OIL-&lt;SRD>-TO   02-26 0082</UNKNOWN>
<TEXT>&#2;
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE>    CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
    BP North America is a subsidiary of British Petroleum Co
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil.
    The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.
&#3;</BODY></TEXT>
</REUTERS>
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5546" NEWID="3">
<DATE>26-FEB-1987 15:03:27.51</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;F A
&#22;&#22;&#1;f0714&#31;reute
d f BC-TEXAS-COMMERCE-BANCSH   02-26 0064</UNKNOWN>
<TEXT>&#2;
<TITLE>TEXAS COMMERCE BANCSHARES &lt;TCB> FILES PLAN</TITLE>
<DATELINE>    HOUSTON, Feb 26 - </DATELINE><BODY>Texas Commerce Bancshares Inc's Texas
Commerce Bank-Houston said it filed an application with the
Comptroller of the Currency in an effort to create the largest
banking network in Harris County.
    The bank said the network would link 31 banks having
13.5 billion dlrs in assets and 7.5 billion dlrs in deposits.

 Reuter
&#3;</BODY></TEXT>
</REUTERS>

<topic>oil</topic>
<body>asdsd</body>
<topic>grain</topic>
<body>asdsdds</body>

for item in soup.findAll('REUTERS'):
    tags.append(item['LEWISSPLIT'])

print tags[0]

for item in soup.findAll('reuters'):
    tags.append(item['lewissplit'])