Python 使用Beauty Soup遍历多个文件并附加HTML中的文本
我有一个下载的HTML文件目录(其中46个),我正试图遍历它们中的每一个,读取它们的内容,剥离HTML,并仅将文本附加到文本文件中。但是,我不确定我在哪里搞砸了,因为我的文本文件中没有写入任何内容Python 使用Beauty Soup遍历多个文件并附加HTML中的文本,python,beautifulsoup,Python,Beautifulsoup,我有一个下载的HTML文件目录(其中46个),我正试图遍历它们中的每一个,读取它们的内容,剥离HTML,并仅将文本附加到文本文件中。但是,我不确定我在哪里搞砸了,因为我的文本文件中没有写入任何内容 import os import glob from bs4 import BeautifulSoup path = "/" for infile in glob.glob(os.path.join(path, "*.html")): markup = (path) s
import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (path)
soup = BeautifulSoup(markup)
with open("example.txt", "a") as myfile:
myfile.write(soup)
f.close()
-----更新----
我已经更新了我的代码如下,但是文本文件仍然没有被创建
import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(markup)
with open("example.txt", "a") as myfile:
myfile.write(soup)
myfile.close()
-----更新2-----
啊,我发现我的目录不正确,所以现在我:
import os
import glob
from bs4 import BeautifulSoup
path = "c:\\users\\me\\downloads\\"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(markup)
with open("example.txt", "a") as myfile:
myfile.write(soup)
myfile.close()
执行此操作时,我得到以下错误:
Traceback (most recent call last):
File "C:\Users\Me\Downloads\bsoup.py, line 11 in <module>
myfile.write(soup)
TypeError: must be str, not BeautifulSoup
到
-----更新3----
它现在工作正常,下面是工作代码:
import os
import glob
from bs4 import BeautifulSoup
path = "c:\\users\\me\\downloads\\"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(open(markup, "r").read())
with open("example.txt", "a") as myfile:
myfile.write(soup.get_text())
myfile.close()
实际上你并不是在读html文件,这应该是可行的
soup=BeautifulSoup(open(webpage,'r').read(), 'lxml')
如果您想直接使用,这里是我在项目中使用的一些代码的修改版本。如果你想抓取所有的文本,就不要按标签过滤。也许有一种不需要迭代的方法,但我不知道。它将数据保存为unicode格式,因此您在使用时必须考虑到这一点
什么是
f
?似乎您以前打开过HTML文件(您应该这样做),但后来更改了代码。此外,您没有剥离HTML。我本想编写“myfile.close()”-抱歉,我似乎无法理解这一点。我的'infle in glob.glob(os.path.join(path,“*.html”):'行是否正确?这将在目录中迭代,对吗?除了缺少右括号外,该部分似乎正确。和soup=beautifulsou(标记)是什么剥离了HTML,我想?这应该会创建一个BeautifulSoup对象,它包含解析的HTML树和访问数据的简便方法。但如果创建的不正确,则需要打开文件并为其提供文件对象,如下面的答案所示。
import os
import glob
from bs4 import BeautifulSoup
path = "c:\\users\\me\\downloads\\"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(open(markup, "r").read())
with open("example.txt", "a") as myfile:
myfile.write(soup.get_text())
myfile.close()
soup=BeautifulSoup(open(webpage,'r').read(), 'lxml')
import os
import glob
import lxml.html
path = '/'
# Whatever tags you want to pull text from.
visible_text_tags = ['p', 'li', 'td', 'h1', 'h2', 'h3', 'h4',
'h5', 'h6', 'a', 'div', 'span']
for infile in glob.glob(os.path.join(path, "*.html")):
doc = lxml.html.parse(infile)
file_text = []
for element in doc.iter(): # Iterate once through the entire document
try: # Grab tag name and text (+ tail text)
tag = element.tag
text = element.text
tail = element.tail
except:
continue
words = None # text words split to list
if tail: # combine text and tail
text = text + " " + tail if text else tail
if text: # lowercase and split to list
words = text.lower().split()
if tag in visible_text_tags:
if words:
file_text.append(' '.join(words))
with open('example.txt', 'a') as myfile:
myfile.write(' '.join(file_text).encode('utf8'))