Python 如何将相应的HTML结尾附加到每一段已删除的文本?
简而言之,我正在制作一个程序,从URL列表中提取特定的引用。我需要的结果也有从相应的URL结尾的先生编号,添加到每个刮引用Python 如何将相应的HTML结尾附加到每一段已删除的文本?,python,regex,loops,web-scraping,beautifulsoup,Python,Regex,Loops,Web Scraping,Beautifulsoup,简而言之,我正在制作一个程序,从URL列表中提取特定的引用。我需要的结果也有从相应的URL结尾的先生编号,添加到每个刮引用 从bs4导入美化组 导入请求 进口稀土 作为pd进口熊猫 进口itertools #定义两个函数以帮助仅查找所需引用: 匹配所有=r'.*' def like(字符串): """ 返回与给定值匹配的已编译正则表达式 带有任何前缀和后缀的字符串,例如如果string=“hello”, 返回的正则表达式与r“*hello.*”匹配 """ 字符串 如果不存在(字符串,str):
从bs4导入美化组
导入请求
进口稀土
作为pd进口熊猫
进口itertools
#定义两个函数以帮助仅查找所需引用:
匹配所有=r'.*'
def like(字符串):
"""
返回与给定值匹配的已编译正则表达式
带有任何前缀和后缀的字符串,例如如果string=“hello”,
返回的正则表达式与r“*hello.*”匹配
"""
字符串
如果不存在(字符串,str):
字符串=str(字符串)
regex=MATCH\u ALL+re.escape(字符串)+MATCH\u ALL
返回re.compile(regex,flags=re.DOTALL)
def按文本查找(汤、文本、标记、**kwargs):
"""
在soup中查找与所有提供的Kwarg匹配的标记,并包含
文本。
如果未找到匹配项,则引发ValueError。
"""
元素=汤。全部查找(标记,**kwargs)
匹配项=[]
对于元素中的元素:
if element.find(text=like(text)):
匹配。追加(元素)
如果len(匹配)==0:
raise VALUE ERROR(“未找到匹配的引用”)
其他:
复赛
#定义URL列表:
基本URL=”https://sis1.host.cs.st-andrews.ac.uk/GAP/"
mrn=[“MR4044696”、“MR2900886”、“MR3169623”、“MR4180136”]
url_list=[]
对于范围内的i(len(mrn)):
url=(基本url+mrn[i]+'.html')
url\u list.append(url)
打印(url\u列表)
所有内容=[]
所有匹配项=[]
#这是迭代并收集结果的循环:
对于url_列表中的url:
page=请求.get(url)
soup=BeautifulSoup(page.content,'html.parser')
匹配=(按文本查找(soup,'GAP,'li'))
所有匹配项。追加(匹配项)
打印(所有匹配项)
输出
“[[
GAP组,GAP–组、算法和编程,版本4.10,可从http://www.gap-system.org, 2018.
],[
GAP组,$GAP$GAP组、算法和编程,版本4.4.12(2008),http://www.gap-system.org.
],[
Distler,A.,Mitchell,J.D.(2011)。Smallsemi——一个小半群库。http://url.com,Oct A GAP 4软件包[5],版本0.6.4。
,
GAP小组(2008年)(http://www.gap-system.org).GAP–组、算法和编程,版本4.4.12。
],[
GAP集团,2019年。GAP–集团、算法和编程,版本4.10.1;https://www.gap-system.org.
]]
我需要在每个结果的开头添加相应的MR编号,例如:
MR4044696, The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.
MR2900886, Distler, A., Mitchell, J. D. (2011). <span class="it">Smallsemi - A Library of Small Semigroups.</span> http://url.com, Oct A GAP 4 package [5], Version 0.6.4.
MR3169623, The GAP Group, (2008). (http://www.gap-system.org). <span class="it">GAP–Groups, Algorithms, and Programming, Version 4.4.12.</span>
在find_by_text
函数中,但无论我在哪里添加它,所有内容都会中断,并且无法正常工作。
看在上帝的份上,我做不到这一点,请帮助我,并向你表示衷心的感谢 我将创建一个字典而不是一个列表,然后遍历该字典并将该值附加到
匹配项中。另一种wya方法是对url进行切片,并使用您在其中创建的mrn
from bs4 import BeautifulSoup, NavigableString
import requests
import re
import pandas as pd
import itertools
#DEFINING TWO FUNCTIONS TO HELP FINDING ONLY THE WANTED CITATIONS:
MATCH_ALL = r'.*'
def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.
If no match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) == 0:
raise ValueError("No matching citations were found")
else:
return matches
#DEFINING URL LIST:
base_URL = "https://sis1.host.cs.st-andrews.ac.uk/GAP/"
mrn = ["MR4044696", "MR2900886", "MR3169623", "MR4180136"]
url_dict = {}
for i in range(len(mrn)):
url = (base_URL + mrn[i] + '.html')
url_dict[url] = mrn[i]
print(url_dict)
all_content = []
all_matches = []
#THIS IS THE LOOP WHICH ITERATES THROUGH AND GATHER THE RESULTS:
for url, mrn in url_dict.items():
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
match = (find_by_text(soup, 'GAP', 'li'))[0]
match.insert(0, NavigableString("%s, " %mrn))
all_matches.append(match)
print(all_matches)
输出:
[<li>MR4044696,
The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.
</li>, <li>MR2900886,
The GAP Group, <span class="MathTeX">$GAP$</span><script type="math/tex">GAP</script> groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org.
</li>, <li>MR3169623,
Distler, A., Mitchell, J. D. (2011). <span class="it">Smallsemi - A Library of Small Semigroups.</span> http://blacklistshorteners.com/, Oct A GAP 4 package [5], Version 0.6.4.
</li>, <li>MR4180136,
The GAP Group, 2019. GAP – Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org.
</li>]
[MR4044696,
GAP组,GAP–组、算法和编程,版本4.10,可从http://www.gap-system.org, 2018.
, MR2900886,
GAP组,$GAP$GAP组、算法和编程,版本4.4.12(2008),http://www.gap-system.org.
, MR3169623,
Distler,A.,Mitchell,J.D.(2011)。Smallsemi——一个小半群库。http://blacklistshorteners.com/,Oct A GAP 4软件包[5],版本0.6.4。
, MR4180136,
GAP集团,2019年。GAP–集团、算法和编程,版本4.10.1;https://www.gap-system.org.
]
[<li>MR4044696,
The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.
</li>, <li>MR2900886,
The GAP Group, <span class="MathTeX">$GAP$</span><script type="math/tex">GAP</script> groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org.
</li>, <li>MR3169623,
Distler, A., Mitchell, J. D. (2011). <span class="it">Smallsemi - A Library of Small Semigroups.</span> http://blacklistshorteners.com/, Oct A GAP 4 package [5], Version 0.6.4.
</li>, <li>MR4180136,
The GAP Group, 2019. GAP – Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org.
</li>]