Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何读取目录中的所有HTML文件,并使用Python将内容写入CSV文件?_Python_Csv_File Io - Fatal编程技术网

如何读取目录中的所有HTML文件,并使用Python将内容写入CSV文件?

如何读取目录中的所有HTML文件,并使用Python将内容写入CSV文件?,python,csv,file-io,Python,Csv,File Io,我试图读取目录中的所有HTML文件,并将它们写入CSV文件。CSV文件中的每一行将包含一个HTML文件的内容 我似乎能够读取一个HTML文件并写入CSV文件 导入操作系统,csv 导入fnmatch 从pathlib导入路径 directory=“directory/” 对于os.walk(目录)中的dirpath、dirs和文件: 对于fnmatch.filter(文件“*.html”)中的文件名: 将open(os.path.join(dirpath,filename))作为f: html=

我试图读取目录中的所有HTML文件,并将它们写入CSV文件。CSV文件中的每一行将包含一个HTML文件的内容

我似乎能够读取一个HTML文件并写入CSV文件

导入操作系统,csv
导入fnmatch
从pathlib导入路径
directory=“directory/”
对于os.walk(目录)中的dirpath、dirs和文件:
对于fnmatch.filter(文件“*.html”)中的文件名:
将open(os.path.join(dirpath,filename))作为f:
html=f.read()
如果html格式为“苹果和橙子”:
将open('output.csv','w')作为f:
writer=csv.writer(f)
行=[[html]]
对于l in行:
writer.writerow(左)
我目前只看到一个HTML文件被打印到一个CSV行。

这是因为您使用了

它将截断文件并覆盖以前的内容

你应该使用

如果文件尚不存在,则打开文件进行写入;如果文件已存在,则将其追加到文件末尾。

使用open('output.csv','w')时,这意味着您每次执行此操作时都会重写文件,因此每次循环迭代都是如此。这就像将新材料保存到同名文件中一样。保存前文件中的材质在保存后将不可见。您需要使用open('output.csv','a'),以便您写入的文件只被附加到其中,而不是被写入。但是,如果确实使用了append,则首先要删除已经存在的文件,否则将附加到旧结果

下面是一个工作示例,我添加了一些额外的格式化内容,以实现您在问题中描述的输出

import os
import fnmatch
import re


directory = "directory/"

#Remove the output file if it exists, otherwise you'll have output from the previous execution
if (os.path.exists('output.csv')):
    os.remove('output.csv')

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = [line.rstrip("\n") for line in f]    #Puts each line from the html file into a list
            lines = "".join(html)                       #Concats that list into a single string
            line = re.sub(" +", " ", lines)             #Gets rid of superfluous whitespace, but substituting any chains of spaces " +" for just one single space
            if re.search("apples and oranges", line):   
                with open('output.csv', 'a') as f:      #Changed w (which stand for write) to append. With w the file is rewritten every time it's called. With a the file only has text appended to the end
                    f.write(line + ",\n")

#Removes the comma and newline at the end of the file
with open("output.csv", 'rb+') as filehandle:
    filehandle.seek(-2, os.SEEK_END)
    filehandle.truncate()

我仍然得到同样的结果。
with open('output.csv', 'a') as f:
import os
import fnmatch
import re


directory = "directory/"

#Remove the output file if it exists, otherwise you'll have output from the previous execution
if (os.path.exists('output.csv')):
    os.remove('output.csv')

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = [line.rstrip("\n") for line in f]    #Puts each line from the html file into a list
            lines = "".join(html)                       #Concats that list into a single string
            line = re.sub(" +", " ", lines)             #Gets rid of superfluous whitespace, but substituting any chains of spaces " +" for just one single space
            if re.search("apples and oranges", line):   
                with open('output.csv', 'a') as f:      #Changed w (which stand for write) to append. With w the file is rewritten every time it's called. With a the file only has text appended to the end
                    f.write(line + ",\n")

#Removes the comma and newline at the end of the file
with open("output.csv", 'rb+') as filehandle:
    filehandle.seek(-2, os.SEEK_END)
    filehandle.truncate()