Python 如何将相应的HTML结尾附加到每一段已删除的文本？_Python_Regex_Loops_Web Scraping_Beautifulsoup

Python 如何将相应的HTML结尾附加到每一段已删除的文本？

python regex loops web-scraping

Python 如何将相应的HTML结尾附加到每一段已删除的文本？,python,regex,loops,web-scraping,beautifulsoup,Python,Regex,Loops,Web Scraping,Beautifulsoup,简而言之，我正在制作一个程序，从URL列表中提取特定的引用。我需要的结果也有从相应的URL结尾的先生编号，添加到每个刮引用从bs4导入美化组导入请求进口稀土作为pd进口熊猫进口itertools #定义两个函数以帮助仅查找所需引用：匹配所有=r'.*' def like（字符串）： """ 返回与给定值匹配的已编译正则表达式带有任何前缀和后缀的字符串，例如如果string=“hello”，返回的正则表达式与r“*hello.*”匹配 """ 字符串如果不存在（字符串，str）：

简而言之，我正在制作一个程序，从URL列表中提取特定的引用。我需要的结果也有从相应的URL结尾的先生编号，添加到每个刮引用

从bs4导入美化组
导入请求
进口稀土
作为pd进口熊猫
进口itertools
#定义两个函数以帮助仅查找所需引用：
匹配所有=r'.*'
def like（字符串）：
"""
返回与给定值匹配的已编译正则表达式
带有任何前缀和后缀的字符串，例如如果string=“hello”，
返回的正则表达式与r“*hello.*”匹配
"""
字符串
如果不存在（字符串，str）：
字符串=str（字符串）
regex=MATCH\u ALL+re.escape（字符串）+MATCH\u ALL
返回re.compile（regex，flags=re.DOTALL）
def按文本查找（汤、文本、标记、**kwargs）：
"""
在soup中查找与所有提供的Kwarg匹配的标记，并包含
文本。
如果未找到匹配项，则引发ValueError。
"""
元素=汤。全部查找（标记，**kwargs）
匹配项=[]
对于元素中的元素：
if element.find（text=like（text））：
匹配。追加（元素）
如果len（匹配）==0：
raise VALUE ERROR（“未找到匹配的引用”）
其他：
复赛
#定义URL列表：
基本URL=”https://sis1.host.cs.st-andrews.ac.uk/GAP/"
mrn=[“MR4044696”、“MR2900886”、“MR3169623”、“MR4180136”]
url_list=[]
对于范围内的i（len（mrn））：
url=（基本url+mrn[i]+'.html'）
url\u list.append（url）
打印（url\u列表）
所有内容=[]
所有匹配项=[]
#这是迭代并收集结果的循环：
对于url_列表中的url：
page=请求.get（url）
soup=BeautifulSoup（page.content，'html.parser'）
匹配=（按文本查找（soup，'GAP，'li'））
所有匹配项。追加（匹配项）
打印（所有匹配项）
输出
“[[
GAP组，GAP–组、算法和编程，版本4.10，可从http://www.gap-system.org, 2018. 
]，[
GAP组，$GAP$GAP组、算法和编程，版本4.4.12（2008），http://www.gap-system.org. 
]，[
Distler，A.，Mitchell，J.D.（2011）。Smallsemi——一个小半群库。http://url.com，Oct A GAP 4软件包[5]，版本0.6.4。
，

GAP小组（2008年）(http://www.gap-system.org).GAP–组、算法和编程，版本4.4.12。
]，[
GAP集团，2019年。GAP–集团、算法和编程，版本4.10.1；https://www.gap-system.org. 
]]

我需要在每个结果的开头添加相应的MR编号，例如：

MR4044696, The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018. 

MR2900886, Distler, A., Mitchell, J. D. (2011). <span class="it">Smallsemi - A Library of Small Semigroups.</span> http://url.com, Oct A GAP 4 package [5], Version 0.6.4. 

MR3169623, The GAP Group, (2008). (http://www.gap-system.org). <span class="it">GAP–Groups, Algorithms, and Programming, Version 4.4.12.</span>

在

find_by_text

函数中，但无论我在哪里添加它，所有内容都会中断，并且无法正常工作。

看在上帝的份上，我做不到这一点，请帮助我，并向你表示衷心的感谢

我将创建一个字典而不是一个列表，然后遍历该字典并将该值附加到

匹配项中。另一种wya方法是对url进行切片，并使用您在其中创建的mrn
from bs4 import BeautifulSoup, NavigableString
import requests
import re
import pandas as pd
import itertools

#DEFINING TWO FUNCTIONS TO HELP FINDING ONLY THE WANTED CITATIONS:
MATCH_ALL = r'.*'


def like(string):
    """
    Return a compiled regular expression that matches the given
    string with any prefix and postfix, e.g. if string = "hello",
    the returned regex matches r".*hello.*"
    """
    string_ = string
    if not isinstance(string_, str):
        string_ = str(string_)
    regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
    return re.compile(regex, flags=re.DOTALL)


def find_by_text(soup, text, tag, **kwargs):
    """
    Find the tag in soup that matches all provided kwargs, and contains the
    text.

    If no match is found, raise ValueError.
    """
    elements = soup.find_all(tag, **kwargs)
    matches = []
    for element in elements:
        if element.find(text=like(text)):
            matches.append(element)
    if len(matches) == 0:
        raise ValueError("No matching citations were found")
    else:
        return matches

#DEFINING URL LIST:

base_URL = "https://sis1.host.cs.st-andrews.ac.uk/GAP/"
mrn = ["MR4044696", "MR2900886", "MR3169623", "MR4180136"]
url_dict = {}

for i in range(len(mrn)):
    url = (base_URL + mrn[i] + '.html')
    url_dict[url] = mrn[i]
    
print(url_dict)

all_content = []
all_matches = []

#THIS IS THE LOOP WHICH ITERATES THROUGH AND GATHER THE RESULTS:

for url, mrn in url_dict.items():
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    match = (find_by_text(soup, 'GAP', 'li'))[0]
    match.insert(0, NavigableString("%s, " %mrn))
    all_matches.append(match)
print(all_matches)

输出：
[<li>MR4044696, 
  The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018. 

</li>, <li>MR2900886, 
  The GAP Group, <span class="MathTeX">$GAP$</span><script type="math/tex">GAP</script> groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org. 

</li>, <li>MR3169623, 
  Distler, A., Mitchell, J. D. (2011). <span class="it">Smallsemi - A Library of Small Semigroups.</span> http://blacklistshorteners.com/, Oct A GAP 4 package [5], Version 0.6.4. 

</li>, <li>MR4180136, 
  The GAP Group, 2019. GAP – Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org. 

</li>]

[MR4044696，
GAP组，GAP–组、算法和编程，版本4.10，可从http://www.gap-system.org, 2018. 
，
MR2900886，
GAP组，$GAP$GAP组、算法和编程，版本4.4.12（2008），http://www.gap-system.org. 
，
MR3169623，
Distler，A.，Mitchell，J.D.（2011）。Smallsemi——一个小半群库。http://blacklistshorteners.com/，Oct A GAP 4软件包[5]，版本0.6.4。
，
MR4180136，
GAP集团，2019年。GAP–集团、算法和编程，版本4.10.1；https://www.gap-system.org. 
]

[<li>MR4044696, 
  The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018. 

</li>, <li>MR2900886, 
  The GAP Group, <span class="MathTeX">$GAP$</span><script type="math/tex">GAP</script> groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org. 

</li>, <li>MR3169623, 
  Distler, A., Mitchell, J. D. (2011). <span class="it">Smallsemi - A Library of Small Semigroups.</span> http://blacklistshorteners.com/, Oct A GAP 4 package [5], Version 0.6.4. 

</li>, <li>MR4180136, 
  The GAP Group, 2019. GAP – Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org. 

</li>]