Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何将相应的HTML结尾附加到每一段已删除的文本?_Python_Regex_Loops_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 如何将相应的HTML结尾附加到每一段已删除的文本?

Python 如何将相应的HTML结尾附加到每一段已删除的文本?,python,regex,loops,web-scraping,beautifulsoup,Python,Regex,Loops,Web Scraping,Beautifulsoup,简而言之,我正在制作一个程序,从URL列表中提取特定的引用。我需要的结果也有从相应的URL结尾的先生编号,添加到每个刮引用 从bs4导入美化组 导入请求 进口稀土 作为pd进口熊猫 进口itertools #定义两个函数以帮助仅查找所需引用: 匹配所有=r'.*' def like(字符串): """ 返回与给定值匹配的已编译正则表达式 带有任何前缀和后缀的字符串,例如如果string=“hello”, 返回的正则表达式与r“*hello.*”匹配 """ 字符串 如果不存在(字符串,str):

简而言之,我正在制作一个程序,从URL列表中提取特定的引用。我需要的结果也有从相应的URL结尾的先生编号,添加到每个刮引用

从bs4导入美化组
导入请求
进口稀土
作为pd进口熊猫
进口itertools
#定义两个函数以帮助仅查找所需引用:
匹配所有=r'.*'
def like(字符串):
"""
返回与给定值匹配的已编译正则表达式
带有任何前缀和后缀的字符串,例如如果string=“hello”,
返回的正则表达式与r“*hello.*”匹配
"""
字符串
如果不存在(字符串,str):
字符串=str(字符串)
regex=MATCH\u ALL+re.escape(字符串)+MATCH\u ALL
返回re.compile(regex,flags=re.DOTALL)
def按文本查找(汤、文本、标记、**kwargs):
"""
在soup中查找与所有提供的Kwarg匹配的标记,并包含
文本。
如果未找到匹配项,则引发ValueError。
"""
元素=汤。全部查找(标记,**kwargs)
匹配项=[]
对于元素中的元素:
if element.find(text=like(text)):
匹配。追加(元素)
如果len(匹配)==0:
raise VALUE ERROR(“未找到匹配的引用”)
其他:
复赛
#定义URL列表:
基本URL=”https://sis1.host.cs.st-andrews.ac.uk/GAP/"
mrn=[“MR4044696”、“MR2900886”、“MR3169623”、“MR4180136”]
url_list=[]
对于范围内的i(len(mrn)):
url=(基本url+mrn[i]+'.html')
url\u list.append(url)
打印(url\u列表)
所有内容=[]
所有匹配项=[]
#这是迭代并收集结果的循环:
对于url_列表中的url:
page=请求.get(url)
soup=BeautifulSoup(page.content,'html.parser')
匹配=(按文本查找(soup,'GAP,'li'))
所有匹配项。追加(匹配项)
打印(所有匹配项)
输出
“[[
  • GAP组,GAP–组、算法和编程,版本4.10,可从http://www.gap-system.org, 2018.
  • ],[
  • GAP组,$GAP$GAP组、算法和编程,版本4.4.12(2008),http://www.gap-system.org.
  • ],[
  • Distler,A.,Mitchell,J.D.(2011)。Smallsemi——一个小半群库。http://url.com,Oct A GAP 4软件包[5],版本0.6.4。
  • GAP小组(2008年)(http://www.gap-system.org).GAP–组、算法和编程,版本4.4.12。
  • ],[
  • GAP集团,2019年。GAP–集团、算法和编程,版本4.10.1;https://www.gap-system.org.
  • ]]
    我需要在每个结果的开头添加相应的MR编号,例如:

    MR4044696, The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018. 
    
    MR2900886, Distler, A., Mitchell, J. D. (2011). <span class="it">Smallsemi - A Library of Small Semigroups.</span> http://url.com, Oct A GAP 4 package [5], Version 0.6.4. 
    
    MR3169623, The GAP Group, (2008). (http://www.gap-system.org). <span class="it">GAP–Groups, Algorithms, and Programming, Version 4.4.12.</span>
    
    find_by_text
    函数中,但无论我在哪里添加它,所有内容都会中断,并且无法正常工作。
    看在上帝的份上,我做不到这一点,请帮助我,并向你表示衷心的感谢

    我将创建一个字典而不是一个列表,然后遍历该字典并将该值附加到
    匹配项中。另一种wya方法是对url进行切片,并使用您在其中创建的mrn

    from bs4 import BeautifulSoup, NavigableString
    import requests
    import re
    import pandas as pd
    import itertools
    
    #DEFINING TWO FUNCTIONS TO HELP FINDING ONLY THE WANTED CITATIONS:
    MATCH_ALL = r'.*'
    
    
    def like(string):
        """
        Return a compiled regular expression that matches the given
        string with any prefix and postfix, e.g. if string = "hello",
        the returned regex matches r".*hello.*"
        """
        string_ = string
        if not isinstance(string_, str):
            string_ = str(string_)
        regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
        return re.compile(regex, flags=re.DOTALL)
    
    
    def find_by_text(soup, text, tag, **kwargs):
        """
        Find the tag in soup that matches all provided kwargs, and contains the
        text.
    
        If no match is found, raise ValueError.
        """
        elements = soup.find_all(tag, **kwargs)
        matches = []
        for element in elements:
            if element.find(text=like(text)):
                matches.append(element)
        if len(matches) == 0:
            raise ValueError("No matching citations were found")
        else:
            return matches
    
    #DEFINING URL LIST:
    
    base_URL = "https://sis1.host.cs.st-andrews.ac.uk/GAP/"
    mrn = ["MR4044696", "MR2900886", "MR3169623", "MR4180136"]
    url_dict = {}
    
    for i in range(len(mrn)):
        url = (base_URL + mrn[i] + '.html')
        url_dict[url] = mrn[i]
        
    print(url_dict)
    
    all_content = []
    all_matches = []
    
    #THIS IS THE LOOP WHICH ITERATES THROUGH AND GATHER THE RESULTS:
    
    for url, mrn in url_dict.items():
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')
        match = (find_by_text(soup, 'GAP', 'li'))[0]
        match.insert(0, NavigableString("%s, " %mrn))
        all_matches.append(match)
    print(all_matches)
    
    输出:

    [<li>MR4044696, 
      The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018. 
    
    </li>, <li>MR2900886, 
      The GAP Group, <span class="MathTeX">$GAP$</span><script type="math/tex">GAP</script> groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org. 
    
    </li>, <li>MR3169623, 
      Distler, A., Mitchell, J. D. (2011). <span class="it">Smallsemi - A Library of Small Semigroups.</span> http://blacklistshorteners.com/, Oct A GAP 4 package [5], Version 0.6.4. 
    
    </li>, <li>MR4180136, 
      The GAP Group, 2019. GAP – Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org. 
    
    </li>]
    
    [
  • MR4044696, GAP组,GAP–组、算法和编程,版本4.10,可从http://www.gap-system.org, 2018.
  • MR2900886, GAP组,$GAP$GAP组、算法和编程,版本4.4.12(2008),http://www.gap-system.org.
  • MR3169623, Distler,A.,Mitchell,J.D.(2011)。Smallsemi——一个小半群库。http://blacklistshorteners.com/,Oct A GAP 4软件包[5],版本0.6.4。
  • MR4180136, GAP集团,2019年。GAP–集团、算法和编程,版本4.10.1;https://www.gap-system.org.
  • ]
    [<li>MR4044696, 
      The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018. 
    
    </li>, <li>MR2900886, 
      The GAP Group, <span class="MathTeX">$GAP$</span><script type="math/tex">GAP</script> groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org. 
    
    </li>, <li>MR3169623, 
      Distler, A., Mitchell, J. D. (2011). <span class="it">Smallsemi - A Library of Small Semigroups.</span> http://blacklistshorteners.com/, Oct A GAP 4 package [5], Version 0.6.4. 
    
    </li>, <li>MR4180136, 
      The GAP Group, 2019. GAP – Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org. 
    
    </li>]