Python从文件中删除元素_Python_Html

Python从文件中删除元素

python html

Python从文件中删除元素,python,html,Python,Html,以下是我的代码片段： from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_endtag(self, tag): if(tag == 'tr'): textFile.write('\n')

以下是我的代码片段：

from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
        def handle_endtag(self, tag):
                if(tag == 'tr'):
                    textFile.write('\n')
        def handle_data(self, data):
                textFile.write(data+"\t")

textFile = open('instaQueryResult', 'w+')

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
fh = open('/data/aman/aggregate.html','r')
l = fh.readlines()
for line in l:
        parser.feed(line)

我解析一个HTML文件并获得以下预期输出：

plantype        count(distinct(SubscriberId))   sum(DownBytesNONE)      sum(UpBytesNONE)            sum(SessionCountNONE)
1006657 341175  36435436130     36472526498     694016
1013287 342280  36694005846     36533489363     697098
1006613 343867  36763692173     36755893252     699976
1014883 342436  36575951812     36572503611     695683
1003022 343238  36705838418     36637429353     698618
plantype        count(distinct(SubscriberId))   sum(DownBytesNONE)      sum(UpBytesNONE)            sum(SessionCountNONE)
1013287 342280  36694005846     36533489363     697098
1006657 341175  36435436130     36472526498     694016
1006613 343867  36763692173     36755893252     699976
1014883 342436  36575951812     36572503611     695683
1003022 343238  36705838418     36637429353     698618

此输出是正确的，但我希望删除标题。I包含要从文件中删除的头的第一行，只留下值

预期产出：

1006657 341175  36435436130     36472526498     694016
1013287 342280  36694005846     36533489363     697098
1006613 343867  36763692173     36755893252     699976
1014883 342436  36575951812     36572503611     695683
1003022 343238  36705838418     36637429353     698618
1013287 342280  36694005846     36533489363     697098
1006657 341175  36435436130     36472526498     694016
1006613 343867  36763692173     36755893252     699976
1014883 342436  36575951812     36572503611     695683
1003022 343238  36705838418     36637429353     698618

由于您正在尝试删除任何没有数字的内容，您可以尝试修改

handle\u data（self，data）

方法，如下所示：

def handle_data(self, data):
    if data.isdigit():
        textFile.write(data+"\t")

我假设您的html数据具有以下形式：

<table>
    <tr>
        <td>plantype</td>
        <td>count(distinct(SubscriberId))</td>
        ...
    </tr>
    <tr>
        <td>1006657</td>
        <td>341175</td>
        ...
    </tr>
</table>

试试这个：

fh = open('/data/aman/aggregate.html','r')
l = fh.readlines()
for line in l:
    if 'plantype' not in line:
        parser.feed(line)

您正在逐行读取文件。当您将“if'part of the string'not in line”（如果“字符串的一部分”不在同一行中）放入时，它只对其他行（您想要的行）执行下一个块。

我尝试了您的代码，但这不会在文件中写入任何内容。文件是空的。等等，让我也显示HTML文件格式。plantypecount（distinct（SubscriberId））sum（DownBytesNONE）sum（upbytes）sum（upbytes）sum（sessionCountone）10066573411755364354361303647252649869401610132734228036694005846365389369709100661334866763692173755893252699976所以我想删除之间的数据。是的。您能告诉我是否要从文件中删除“\n”吗。在哪里应用strip（'\n'）函数？ThanksNewlines将添加到您的

handle\u endtag

方法中。尝试不要在那里写换行符。TypeError:handle_starttag（）正好接受2个参数（给定3个）。它给了我这个错误，我不知道为什么。我粘贴了与上面相同的starttag函数，该函数有2个参数，但它表示给定了3个参数。

fh = open('/data/aman/aggregate.html','r')
l = fh.readlines()
for line in l:
    if 'plantype' not in line:
        parser.feed(line)