Python 如何让我的正则表达式函数识别更多的子字符串_Python_Regex_Python 3.x

Python 如何让我的正则表达式函数识别更多的子字符串

python regex python-3.x

Python 如何让我的正则表达式函数识别更多的子字符串,python,regex,python-3.x,Python,Regex,Python 3.x,我不熟悉正则表达式，目前正在使用re python库提取嵌入css文件中的某些字体。目前，我的正则表达式忽略了它不应该忽略的某些字体，但正确处理了其他字体。这是我正在使用的正则表达式语法： “font-family\s*？：\s*？（.*？\s*？[；\}]” 这是示例输入： .ui block:last child，.ui block.last{margin right:0}body{font family:“乔治亚州”、“乔治亚州-1”、乔治亚州、泰晤士报、泰晤士报、新罗马时代、serif}

我不熟悉正则表达式，目前正在使用re python库提取嵌入css文件中的某些字体。目前，我的正则表达式忽略了它不应该忽略的某些字体，但正确处理了其他字体。这是我正在使用的正则表达式语法：

“font-family\s*？：\s*？（.*？\s*？[；\}]”

这是示例输入：

.ui block:last child，.ui block.last{margin right:0}body{font family:“乔治亚州”、“乔治亚州-1”、乔治亚州、泰晤士报、泰晤士报、新罗马时代、serif}.nav logo mod{font family:“联盟哥特式”、“联盟哥特式-1”、“Helvetica Neue”、Arial、Helvetica、Verdana、sans serif}.wf-load.nav logo{可见性：隐藏}.wf-load.nav logo.has img{可见性：可见}.h1，.post txt h1、.h2、.post txt h2、.h3、.post txt h3、.h4、.post txt h4、.h5、.post txt h5、.h6、.post txt h6{线条高度：100%；边距底部：6px}.h1 a、.post txt h1 a、a>.h1、.post txt txt a>h1、.h1、.h2、.h2、.h2、.post txt txt a>h2、h2、.h3、.h3、.h3、.h3、.h3、.h4、.h4、.h4、，.post-txt h6 a，a>.h6，.post-txt a>h6{text-decoration:none}.h1 a:hover，.h1:hover，.post-txt a>h1:hover，.h2 a:hover，.h2:hover，.post-txt a>h2:hover，.h3 a:hover，.h3:hover，.h3:hover，.h3:hover，.h4 a:hover，.h4:hover，.h4:hover，.post txt h5 a:悬停，a>.h5:悬停，.post txt a>h5:悬停，.h6 a:悬停，.post txt h6 a:悬停，.h6:悬停，.post txt a>h6:悬停{文本装饰：下划线}@media（最小宽度：450px）{.h1、.post txt h1、.h2、.post txt h2、.h3、.h4、.post txt h4、.h4、.h5、.post txt、.h5、.h6、.post}.h6{行高：112%.post txt、.h2、.h4、，.post txt h4{font family:“league gothic”，“league-gothic-1”，“Helvetica Neue”，Arial，Helvetica，Verdana，无衬线字体；字体重量：正常；线条高度：110%}.h5.post txt h5.h6.post txt h6{font family:“georgia”，“georgia-1”，georgia，Times，Times New Roman，衬线字体高度：140%}.h1.post txt h1{字体大小：27px}@media（最小宽度：500px）{.h1.post h1{font size:37px}.h2、.post-txt h3{font size:1.76923em}.h4、.post-txt h4{font-size:1.30769em}.h5、.post-txt h5{font-size:14px}.h6、.post-txt h3{font-size:1.76923em}.h4、.post-txt h4{font-txt h5{font-font-size:1.307669em}.h5、.post-txt h5{font-font-size:14px}.h6、.post-txt h6{font-font-post-font-size:12px}.post-post-poster-poster-poster-poster-poster-model:“哥特式联盟”、“哥特式联盟”、“哥特式联盟”、“新，Arial，Helvetica，Verdana，无衬线；字体大小：38px；线条高度：100%；页边底部：16px}@media（最小宽度：500px）{.poster-h{font-size:48px}}。截面-h，.section-h1{font:normal 1.23077em/100%“league-哥特式”、“league-哥特式-1”、“Helvetica Neue”，Arial，Helvetica，Verdana，无衬线；文本转换：大写；字母间距：4px；填充底部：10px；边框底部：1px实心#ccc；边距底部：20px}。section-h>a、.section-h1>a{color:inherit；文本装饰：inherit；光标：inherit}。section-h>a:active、.section-h>a:focus、.section-h1>a:active、.section-h1>a:focus{outline:none}.section-h>a:hover，.section-h1>a:hover{text decoration:underline}.section-h2{font:normal 0.92308em/100%“league-gothic”、“league-gothic-1”、“Helvetica-Neue”、Arial、Helvetica、Verdana、sans-serif”；文本转换：大写；字母间距：4px；填充底部：4px；边框底部：1px实心#ccc；边距底部：12px}.section-h2>a{color:inherit；text-decoration:inherit；cursor:inherit}。section-h2>a:active，.section-h2>a:focus{outline:none}。section-h2>a:hover{text-decoration:underline}。

这是我的示例输出：

['Georgia，Georgia，Times，Times New Roman，serif'，“Helvetica Neue”，Arial，Helvetica，Verdana，sans serif'，“Helvetica Neue”，Arial，Helvetica，Helvetica，Verdana，sans serif'，'Georgia，Times，Times New Roman，serif'，“Helvetica Neue”，Arial，Helvetica，Verdana，sans serif'，“Helvetica Neue”，Arial，Helvetica，Verdana，sans serif'，“乔治亚州，乔治亚州，泰晤士报，泰晤士报，新罗马时代，serif'，“Helvetica Neue”，Arial，Helvetica，Verdana，sans serif'，“Helvetica Neue”，Arial，Helvetica，Verdana，sans serif'，“乔治亚州，泰晤士报，泰晤士报，新罗马时代，serif'，“Helvetica Neue”，Arial，Helvetica，Verdana，sans serif'，Georgia，Georgia，Times，Times New Roman，serif'，Georgia，Georgia，Times New Roman，serif'，Georgia，Times，Times New Roman，serif'，Georgia，Georgia，Times New Roman，serif'，Georgia，Georgia，Times，Times New Roman，serif'.]

我想在我的输出中包括联盟哥特式字体

以下是我的python代码：


from selenium import webdriver
import time
import re
import sys
import os
if __name__ == '__main__':
    url = sys.argv[1]
    url = url.replace("\n", "").replace("\r", "")
    driver = webdriver.Chrome()
    driver.get(f"http://{url}/")
    time.sleep(5)
    html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
    outfile = open("full_site.html", "w+")
    outfile.write(html)
    outfile.close()
    outfile_path = os.path.abspath("full_site.html")
    driver.get('file://' + outfile_path)
    time.sleep(5)
    elems = driver.find_elements_by_xpath("//link[@href]")
    css_links = []
    font_list = []
    font_file = open("input.txt", "w+")
    font_file.write(url + "\n")
    font_file.close()
    for elem in elems:
        if (("css" in elem.get_attribute("href")) and (elem.get_attribute("href") not in css_links)):
            css_links.append(elem.get_attribute("href"))
    #print(css_links)
    for elem in css_links:
        #print(elem)
        driver.get(elem)
        time.sleep(5)
        lst_of_fonts = []
        html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        result = re.compile('font-family\s*?:\s*?(.*?)\s*?[;\}]')
        result = re.findall(result, html)
        if result:
            #print(result)
            for element in result:
                element = element.replace("font-family", "").replace("}", "").replace(";", "").replace("{", "").replace(":", "")
                element = element.split(",")
                for font in element:
                    #print(font)
                    font_file = open("input.txt", "a")
                    font_file.write(font + ",")
                #print(font + "\n")
    font_file.close()
    driver.close()```

输入文本的某些字体系列名称的前缀仅为单词

font:

，而不是

font-family:

。使用

font:

时，在列出字体系列名称之前，它具有非字体系列信息。您需要删除这些额外信息

我认为下面的正则表达式获得了其捕获组3中的所有字体系列名称：

import re 

pattern = re.compile(r'font(-family)?:([a-z]+\s\d\.\d+[a-z]+/\d+%)?([^};]*,?)')

results = re.finditer(pattern, text)

for result in results:
    print(result.group(3))

打印输出：

“佐治亚州”、“佐治亚-1”、佐治亚州、泰晤士报、新罗马时代、衬线
“哥特式联盟”、“哥特式联盟-1”、“新赫尔维蒂卡”、Arial、赫尔维蒂卡、威尔达纳、无衬线
“哥特式联盟”、“哥特式联盟-1”、“新赫尔维蒂卡”、Arial、赫尔维蒂卡、威尔达纳、无衬线
“佐治亚州”、“佐治亚-1”、佐治亚州、泰晤士报、新罗马时代、衬线
“哥特式联盟”、“哥特式联盟-1”、“新赫尔维蒂卡”、Arial、赫尔维蒂卡、威尔达纳、无衬线
“哥特式联盟”、“哥特式联盟-1”、“新赫尔维蒂卡”、Arial、赫尔维蒂卡、威尔达纳、无衬线
“哥特式联盟”、“哥特式联盟-1”、“新赫尔维蒂卡”、Arial、赫尔维蒂卡、威尔达纳、无衬线

输入文本的某些字体系列名称的前缀仅为单词

font:

，而不是

font-family:

。使用

font:

时，在列出字体系列名称之前，它具有非字体系列信息。您需要删除这些额外信息

我认为下面的正则表达式获得了其捕获组3中的所有字体系列名称：

import re 

pattern = re.compile(r'font(-family)?:([a-z]+\s\d\.\d+[a-z]+/\d+%)?([^};]*,?)')

results = re.finditer(pattern, text)

for result in results:
    print(result.group(3))

打印输出：

“佐治亚州”、“佐治亚-1”、佐治亚州、泰晤士报、新罗马时代、衬线
“我