Python 附加列表元素“；“随机无重复”；复制到多个html文件_Python_Html_Random_Beautifulsoup_Href

Python 附加列表元素“；“随机无重复”；复制到多个html文件

python html random

Python 附加列表元素“；“随机无重复”；复制到多个html文件,python,html,random,beautifulsoup,href,Python,Html,Random,Beautifulsoup,Href,我正在尝试使用regex用结果值替换hrefurl，我也尝试了Beautifulsoup模块，但没有成功。请在所有html文件中不断获取一个相同的url class RandomChoiceNoImmediateRepeat(object): def __init__(self, lst): self.lst = lst self.last = None def choice(self): if self.last is None:

我正在尝试使用

regex

用结果值替换

href

url

，我也尝试了

Beautifulsoup

模块，但没有成功。请在所有html文件中不断获取一个相同的url

class RandomChoiceNoImmediateRepeat(object):
    def __init__(self, lst):
        self.lst = lst
        self.last = None
    def choice(self):
        if self.last is None:
            self.last = random.choice(self.lst)
            return self.last
        else:
            nxt = random.choice(self.lst)
            # make a new choice as long as it's equal to the last.
            while nxt == self.last:   
                nxt = random.choice(self.lst)
            # Replace the last and return the choice
            self.last = nxt
            return nxt

for filename in glob.glob('/docs/*.txt'):
    file_metadata = { 'name': 'file.txt', 'mimeType': '*/*' }
    media = MediaFileUpload(filename, mimetype='*/*', resumable=True)
    file = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()
    link = 'https://drive.google.com/uc?export=download&id=' + file.get('id')
    linkd = []
    linkd.append(link)
    for filename in glob.glob('/docs/htmlz/*.html'):
        with open(filename, "r") as html_file:
            soup = BeautifulSoup(html_file,'html.parser')
            for anchor in soup.findAll("a", attrs={ "class" : "downloadme" }):
                gen = RandomChoiceNoImmediateRepeat(linkd)
                i = gen.choice()
                anchor['href'] = str(i)
                with open(filename, "w") as html_file:
                    html_file.write(str(soup))
                    html_file.close()

首先，根本原因是

re.sub

需要类似字符串或字节的对象，但您提供了其他类型

编辑：

from bs4 import BeautifulSoup

soup = BeautifulSoup('<tr class="hello">first_elem</tr><tr>second_elem</tr>', "html.parser")
trs = soup.find_all("tr")
print("Content: {}  -->  Type: {}".format(trs, type(trs)))
print("Content: {}  -->  Type: {}".format(trs[0], type(trs[0])))
print("Content: {}  -->  Type: {}".format(trs[0]["class"], type(trs[0]["class"])))
print("Content: {}  -->  Type: {}".format(trs[0]["class"][0], type(trs[0]["class"][0])))

>>> python3 ci/common/python_utils/test_file.py 
Content: [<tr class="hello">first_elem</tr>, <tr>second_elem</tr>]  -->  Type: <class 'bs4.element.ResultSet'>
Content: <tr class="hello">first_elem</tr>  -->  Type: <class 'bs4.element.Tag'>
Content: ['hello']  -->  Type: <class 'list'>
Content: hello  -->  Type: <class 'str'>

我创建了一个示例，说明如何访问

bs4.element.ResultSet

类型的元素

代码：

from bs4 import BeautifulSoup

soup = BeautifulSoup('<tr class="hello">first_elem</tr><tr>second_elem</tr>', "html.parser")
trs = soup.find_all("tr")
print("Content: {}  -->  Type: {}".format(trs, type(trs)))
print("Content: {}  -->  Type: {}".format(trs[0], type(trs[0])))
print("Content: {}  -->  Type: {}".format(trs[0]["class"], type(trs[0]["class"])))
print("Content: {}  -->  Type: {}".format(trs[0]["class"][0], type(trs[0]["class"][0])))

>>> python3 ci/common/python_utils/test_file.py 
Content: [<tr class="hello">first_elem</tr>, <tr>second_elem</tr>]  -->  Type: <class 'bs4.element.ResultSet'>
Content: <tr class="hello">first_elem</tr>  -->  Type: <class 'bs4.element.Tag'>
Content: ['hello']  -->  Type: <class 'list'>
Content: hello  -->  Type: <class 'str'>

从bs4导入美化组
soup=BeautifulSoup（'first_elemsecond_elem'，“html.parser”）
trs=汤。全部查找（“tr”）
打印（“内容：{}-->类型：{}”。格式（trs，类型（trs）））
打印（“内容：{}-->类型：{}”。格式（trs[0]，类型（trs[0]））
打印（“内容：{}-->类型：{}”。格式（trs[0][“类”]，类型（trs[0][“类”]））
打印（“内容：{}-->类型：{}”。格式（trs[0][“类”][0]，类型（trs[0][“类”][0]））

输出：

from bs4 import BeautifulSoup

soup = BeautifulSoup('<tr class="hello">first_elem</tr><tr>second_elem</tr>', "html.parser")
trs = soup.find_all("tr")
print("Content: {}  -->  Type: {}".format(trs, type(trs)))
print("Content: {}  -->  Type: {}".format(trs[0], type(trs[0])))
print("Content: {}  -->  Type: {}".format(trs[0]["class"], type(trs[0]["class"])))
print("Content: {}  -->  Type: {}".format(trs[0]["class"][0], type(trs[0]["class"][0])))

>>> python3 ci/common/python_utils/test_file.py 
Content: [<tr class="hello">first_elem</tr>, <tr>second_elem</tr>]  -->  Type: <class 'bs4.element.ResultSet'>
Content: <tr class="hello">first_elem</tr>  -->  Type: <class 'bs4.element.Tag'>
Content: ['hello']  -->  Type: <class 'list'>
Content: hello  -->  Type: <class 'str'>

python3 ci/common/python\u utils/test\u file.py 内容：[第一要素，第二要素]-->类型：内容：第一要素-->类型：内容：['hello']-->键入：内容：hello-->类型：

如上所示，

.findAll

提供了一个

bs4.element.ResultSet

类型，其中包含

bs4.element.Tag

元素。如果选择标记，您将得到一个列表，如：

['hello']

，您必须使用正确的索引，如：

[0]

，您将得到字符串类型变量（如您在输出的最后一行中所见）

re.sub需要字符串。因此，请转换为“str（anchor）”，看看它是否有效。我没有这样做，错误消失了，但href的结果url没有改变。我这次总是出错，关键是“”回溯（最近一次调用）：文件“newpost.py”，第95行，在re.sub中（“https\：\/\/drive.google.com\/uc\？export\=download&id\=（.*），r'\g'+e，anchor[0]）文件“/.local/lib/python3.6/site packages/bs4/element.py”，第971行，在getitem return self.attrs[key]KeyError:0”我明白了！您应该打印

anchor

变量，并检查内容以及如何访问元素。问题是

锚定

“容器”没有

键。我已更新了答案。我希望它能帮助解决你的问题。如果你需要更多的支持，请告诉我。我没有成功，我尝试了另一种方法，但如果你能帮助我，我有一个问题