beautifulsoup解析文件夹中的每个html文件_Html_Python 2.7_Beautifulsoup_Html Parsing

beautifulsoup解析文件夹中的每个html文件

html python-2.7

beautifulsoup解析文件夹中的每个html文件,html,python-2.7,beautifulsoup,html-parsing,Html,Python 2.7,Beautifulsoup,Html Parsing,我的任务是从目录中读取每个html文件。条件是确定每个文件是否包含标记 (1) <strong>OO</strong> (2) <strong>QQ</strong> （1）OO （2） QQ 然后您的write函数嵌套在for循环中，这就是为什么您将多行写入index.txt，只需将write移出循环，并将所有parti文本放入变量parti_name，如下所示： participants = soup.find(find_partici

我的任务是从目录中读取每个html文件。条件是确定每个文件是否包含标记

(1) <strong>OO</strong>  
(2) <strong>QQ</strong>

（1）OO
（2） QQ

然后您的

write

函数嵌套在

for

循环中，这就是为什么您将多行写入

index.txt

，只需将

write

移出循环，并将所有parti文本放入变量

parti_name

，如下所示：

participants = soup.find(find_participant)
parti_names = ""
for parti in participants.find_next_siblings("p"):
    if parti.find("strong", text=re.compile(r"(Operator)")):
        break
    parti_names += parti.get_text(strip=True)+","
    print parti.get_text(strip=True)

indexFile = open('index.txt', 'a+')
indexFile.write(filename + ', ' + title.get_text(strip=True) + ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n' )
indexFile.close()

更新：

您可以使用

basename

获取文件名：

from os.path import basename

# you can call it directly with basename
print(basename("C:/Users/.../output/100107-.html"))

输出：

100107-.html

我还有一个问题，我只想要文件名，但输出给了我路径+文件名。我刚刚更新了代码。