Python 附加列表元素“;“随机无重复”;复制到多个html文件
我正在尝试使用Python 附加列表元素“;“随机无重复”;复制到多个html文件,python,html,random,beautifulsoup,href,Python,Html,Random,Beautifulsoup,Href,我正在尝试使用regex用结果值替换hrefurl,我也尝试了Beautifulsoup模块,但没有成功。请在所有html文件中不断获取一个相同的url class RandomChoiceNoImmediateRepeat(object): def __init__(self, lst): self.lst = lst self.last = None def choice(self): if self.last is None:
regex
用结果值替换href
url
,我也尝试了Beautifulsoup
模块,但没有成功。请在所有html文件中不断获取一个相同的url
class RandomChoiceNoImmediateRepeat(object):
def __init__(self, lst):
self.lst = lst
self.last = None
def choice(self):
if self.last is None:
self.last = random.choice(self.lst)
return self.last
else:
nxt = random.choice(self.lst)
# make a new choice as long as it's equal to the last.
while nxt == self.last:
nxt = random.choice(self.lst)
# Replace the last and return the choice
self.last = nxt
return nxt
for filename in glob.glob('/docs/*.txt'):
file_metadata = { 'name': 'file.txt', 'mimeType': '*/*' }
media = MediaFileUpload(filename, mimetype='*/*', resumable=True)
file = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()
link = 'https://drive.google.com/uc?export=download&id=' + file.get('id')
linkd = []
linkd.append(link)
for filename in glob.glob('/docs/htmlz/*.html'):
with open(filename, "r") as html_file:
soup = BeautifulSoup(html_file,'html.parser')
for anchor in soup.findAll("a", attrs={ "class" : "downloadme" }):
gen = RandomChoiceNoImmediateRepeat(linkd)
i = gen.choice()
anchor['href'] = str(i)
with open(filename, "w") as html_file:
html_file.write(str(soup))
html_file.close()
首先,根本原因是
re.sub
需要类似字符串或字节的对象,但您提供了其他类型
编辑:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<tr class="hello">first_elem</tr><tr>second_elem</tr>', "html.parser")
trs = soup.find_all("tr")
print("Content: {} --> Type: {}".format(trs, type(trs)))
print("Content: {} --> Type: {}".format(trs[0], type(trs[0])))
print("Content: {} --> Type: {}".format(trs[0]["class"], type(trs[0]["class"])))
print("Content: {} --> Type: {}".format(trs[0]["class"][0], type(trs[0]["class"][0])))
>>> python3 ci/common/python_utils/test_file.py
Content: [<tr class="hello">first_elem</tr>, <tr>second_elem</tr>] --> Type: <class 'bs4.element.ResultSet'>
Content: <tr class="hello">first_elem</tr> --> Type: <class 'bs4.element.Tag'>
Content: ['hello'] --> Type: <class 'list'>
Content: hello --> Type: <class 'str'>
我创建了一个示例,说明如何访问bs4.element.ResultSet
类型的元素
代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<tr class="hello">first_elem</tr><tr>second_elem</tr>', "html.parser")
trs = soup.find_all("tr")
print("Content: {} --> Type: {}".format(trs, type(trs)))
print("Content: {} --> Type: {}".format(trs[0], type(trs[0])))
print("Content: {} --> Type: {}".format(trs[0]["class"], type(trs[0]["class"])))
print("Content: {} --> Type: {}".format(trs[0]["class"][0], type(trs[0]["class"][0])))
>>> python3 ci/common/python_utils/test_file.py
Content: [<tr class="hello">first_elem</tr>, <tr>second_elem</tr>] --> Type: <class 'bs4.element.ResultSet'>
Content: <tr class="hello">first_elem</tr> --> Type: <class 'bs4.element.Tag'>
Content: ['hello'] --> Type: <class 'list'>
Content: hello --> Type: <class 'str'>
从bs4导入美化组
soup=BeautifulSoup('first_elemsecond_elem',“html.parser”)
trs=汤。全部查找(“tr”)
打印(“内容:{}-->类型:{}”。格式(trs,类型(trs)))
打印(“内容:{}-->类型:{}”。格式(trs[0],类型(trs[0]))
打印(“内容:{}-->类型:{}”。格式(trs[0][“类”],类型(trs[0][“类”]))
打印(“内容:{}-->类型:{}”。格式(trs[0][“类”][0],类型(trs[0][“类”][0]))
输出:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<tr class="hello">first_elem</tr><tr>second_elem</tr>', "html.parser")
trs = soup.find_all("tr")
print("Content: {} --> Type: {}".format(trs, type(trs)))
print("Content: {} --> Type: {}".format(trs[0], type(trs[0])))
print("Content: {} --> Type: {}".format(trs[0]["class"], type(trs[0]["class"])))
print("Content: {} --> Type: {}".format(trs[0]["class"][0], type(trs[0]["class"][0])))
>>> python3 ci/common/python_utils/test_file.py
Content: [<tr class="hello">first_elem</tr>, <tr>second_elem</tr>] --> Type: <class 'bs4.element.ResultSet'>
Content: <tr class="hello">first_elem</tr> --> Type: <class 'bs4.element.Tag'>
Content: ['hello'] --> Type: <class 'list'>
Content: hello --> Type: <class 'str'>
python3 ci/common/python\u utils/test\u file.py
内容:[第一要素,第二要素]-->类型:
内容:第一要素-->类型:
内容:['hello']-->键入:
内容:hello-->类型:
如上所示,
.findAll
提供了一个bs4.element.ResultSet
类型,其中包含bs4.element.Tag
元素。如果选择标记,您将得到一个列表,如:['hello']
,您必须使用正确的索引,如:[0]
,您将得到字符串类型变量(如您在输出的最后一行中所见) re.sub需要字符串。因此,请转换为“str(anchor)”,看看它是否有效。我没有这样做,错误消失了,但href的结果url没有改变。我这次总是出错,关键是“”回溯(最近一次调用):文件“newpost.py”,第95行,在re.sub中(“https\:\/\/drive.google.com\/uc\?export\=download&id\=(.*),r'\g'+e,anchor[0])文件“/.local/lib/python3.6/site packages/bs4/element.py”,第971行,在getitem return self.attrs[key]KeyError:0”我明白了!您应该打印anchor
变量,并检查内容以及如何访问元素。问题是锚定
“容器”没有0
键。我已更新了答案。我希望它能帮助解决你的问题。如果你需要更多的支持,请告诉我。我没有成功,我尝试了另一种方法,但如果你能帮助我,我有一个问题