如何读取目录中的所有HTML文件，并使用Python将内容写入CSV文件？_Python_Csv_File Io

如何读取目录中的所有HTML文件，并使用Python将内容写入CSV文件？

python csv file-io

如何读取目录中的所有HTML文件，并使用Python将内容写入CSV文件？,python,csv,file-io,Python,Csv,File Io,我试图读取目录中的所有HTML文件，并将它们写入CSV文件。CSV文件中的每一行将包含一个HTML文件的内容我似乎能够读取一个HTML文件并写入CSV文件导入操作系统，csv 导入fnmatch 从pathlib导入路径 directory=“directory/” 对于os.walk（目录）中的dirpath、dirs和文件：对于fnmatch.filter（文件“*.html”）中的文件名：将open（os.path.join（dirpath，filename））作为f: html=

我试图读取目录中的所有HTML文件，并将它们写入CSV文件。CSV文件中的每一行将包含一个HTML文件的内容

我似乎能够读取一个HTML文件并写入CSV文件

导入操作系统，csv
导入fnmatch
从pathlib导入路径
directory=“directory/”
对于os.walk（目录）中的dirpath、dirs和文件：
对于fnmatch.filter（文件“*.html”）中的文件名：
将open（os.path.join（dirpath，filename））作为f:
html=f.read（）
如果html格式为“苹果和橙子”：
将open（'output.csv'，'w'）作为f：
writer=csv.writer（f）
行=[[html]]
对于l in行：
writer.writerow（左）

我目前只看到一个HTML文件被打印到一个CSV行。

这是因为您使用了

它将截断文件并覆盖以前的内容

你应该使用

如果文件尚不存在，则打开文件进行写入；如果文件已存在，则将其追加到文件末尾。

使用open（'output.csv'，'w'）时，这意味着您每次执行此操作时都会重写文件，因此每次循环迭代都是如此。这就像将新材料保存到同名文件中一样。保存前文件中的材质在保存后将不可见。您需要使用open（'output.csv'，'a'），以便您写入的文件只被附加到其中，而不是被写入。但是，如果确实使用了append，则首先要删除已经存在的文件，否则将附加到旧结果

下面是一个工作示例，我添加了一些额外的格式化内容，以实现您在问题中描述的输出

import os
import fnmatch
import re


directory = "directory/"

#Remove the output file if it exists, otherwise you'll have output from the previous execution
if (os.path.exists('output.csv')):
    os.remove('output.csv')

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = [line.rstrip("\n") for line in f]    #Puts each line from the html file into a list
            lines = "".join(html)                       #Concats that list into a single string
            line = re.sub(" +", " ", lines)             #Gets rid of superfluous whitespace, but substituting any chains of spaces " +" for just one single space
            if re.search("apples and oranges", line):   
                with open('output.csv', 'a') as f:      #Changed w (which stand for write) to append. With w the file is rewritten every time it's called. With a the file only has text appended to the end
                    f.write(line + ",\n")

#Removes the comma and newline at the end of the file
with open("output.csv", 'rb+') as filehandle:
    filehandle.seek(-2, os.SEEK_END)
    filehandle.truncate()

我仍然得到同样的结果。

with open('output.csv', 'a') as f:

import os
import fnmatch
import re


directory = "directory/"

#Remove the output file if it exists, otherwise you'll have output from the previous execution
if (os.path.exists('output.csv')):
    os.remove('output.csv')

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = [line.rstrip("\n") for line in f]    #Puts each line from the html file into a list
            lines = "".join(html)                       #Concats that list into a single string
            line = re.sub(" +", " ", lines)             #Gets rid of superfluous whitespace, but substituting any chains of spaces " +" for just one single space
            if re.search("apples and oranges", line):   
                with open('output.csv', 'a') as f:      #Changed w (which stand for write) to append. With w the file is rewritten every time it's called. With a the file only has text appended to the end
                    f.write(line + ",\n")

#Removes the comma and newline at the end of the file
with open("output.csv", 'rb+') as filehandle:
    filehandle.seek(-2, os.SEEK_END)
    filehandle.truncate()