Python for re.match re.sub_Python_Html_Csv_Text_Match

Python for re.match re.sub

python html csv text

Python for re.match re.sub,python,html,csv,text,match,Python,Html,Csv,Text,Match,处理csv文件。它包含源（简单ssl链接）、位置、网站（非ssl链接）、目录和电子邮件的列表。当某些数据不可用时，它就不会出现。像这样： httpsgoogledotcom, GooglePlace2, Direcciones, Montain View, Email, googplace@yourplace.com > httpsgoogledotcom, GooglePlace, Website, "<a href='httpgoogledotcom'></a>

处理csv文件。它包含源（简单ssl链接）、位置、网站（非ssl链接）、目录和电子邮件的列表。当某些数据不可用时，它就不会出现。像这样：

httpsgoogledotcom, GooglePlace2, Direcciones, Montain View, Email, googplace@yourplace.com

> httpsgoogledotcom, GooglePlace, Website, "<a href='httpgoogledotcom'></a>",Direcciones, Montain View, Email, googplace@yourplace.com 
> httpsbingdotcom, BingPlace, Direcciones,MicroWorld, Email, bing@yourplace.com
> httpsgoogledotcom, GooglePlace,Website, <a href='httpgoogledotcom'></a>"
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com

然而，网站“一个html标签”链接总是出现两次，后跟几个逗号。同样，逗号后面紧跟着，有时是Direcciones，有时是sources（https）。因此，如果进程在EOF时没有中断，它可能会“替换”数小时，并使用gbs的冗余和错误信息创建一个输出文件。让我们选取四个条目作为Reutput.csv的示例：

> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> ,,Direcciones, Montain View, Email, googplace@yourplace.com
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com
> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com

>httpsgoogledotcom，谷歌广场，网站，，，，，，，，，，，，，，
> "",,,,,,,,,,,,, 
>、Direcciones、Montain View、电子邮件、，googplace@yourplace.com
>httpsbingdotcom、BingPlace、Direcciones、微世界、电子邮件、，bing@yourplace.com
>httpsgoogledotcom，GooglePlace，网站，“，，，，，，，，，，，，，，”，，，，，，，，，，，，，，
> "",,,,,,,,,,,,, 
>httpsbingdotcom、BingPlace、Direcciones、微世界、电子邮件、，bing@yourplace.com

因此，我们的想法是删除不必要的网站“一个html标签”链接和多余的逗号，但尊重新的行/n，而不是陷入循环。像这样：

httpsgoogledotcom, GooglePlace2, Direcciones, Montain View, Email, googplace@yourplace.com

> httpsgoogledotcom, GooglePlace, Website, "<a href='httpgoogledotcom'></a>",Direcciones, Montain View, Email, googplace@yourplace.com 
> httpsbingdotcom, BingPlace, Direcciones,MicroWorld, Email, bing@yourplace.com
> httpsgoogledotcom, GooglePlace,Website, <a href='httpgoogledotcom'></a>"
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, bing@yourplace.com

http://www.httpsgoogledotcom/GooglePlace/GooglePlace/Website/Direcciones/Montain-View/Email/，googplace@yourplace.com >httpsbingdotcom、BingPlace、Direcciones、微世界、电子邮件、，bing@yourplace.com >httpsgoogledotcom、GooglePlace、网站、 >httpsbingdotcom、BingPlace、Direcciones、微世界、电子邮件、，bing@yourplace.com 这是代码的最新版本：

with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf:
    text = str(reuf.read())
    for lines in text:
        d = re.match('</a>".*D?',text,re.DOTALL)
        if d is not None:
            if not 'https' in d:
                replace = re.sub(d,'</a>",Direc',lines)
        h = re.match('</a>".*?http',text,re.DOTALL|re.MULTILINE)
        if h is not None:
            if not 'Direc' in h:
                replace = re.sub(h,'</a>"\nhttp',lines)
        replace = str(replace)
        putuf.write(replace)

以open（'Reutput.csv'）作为reuf，open（'Put.csv'，'w'）作为putuf：
text=str（reuf.read（））
对于文本中的行：
d=re.match（“*d”，文本，re.DOTALL）
如果d不是无：
如果d中不是“https”：
replace=re.sub（d，“，Direc”，行）
h=re.match（‘“*？http’，文本，re.DOTALL | re.MULTILINE）
如果h不是无：
如果h中没有“Direc”：
replace=re.sub（h，“\nhttp”，行）
replace=str（replace）
putuf.write（替换）

现在我得到了一个Put.csv文件，最后一行永远重复。为什么会出现这种循环？我已经尝试了几种方法来处理这段代码，但遗憾的是，我仍然停留在这一点上。提前感谢。

当没有匹配时，

组

将是

无

。您需要注意这一点（或者重构正则表达式，使其始终匹配某些内容）

groups=re.search（“*？Direc”，行，re.DOTALL）
如果组不是无：
如果组中没有“https”：

请注意添加了

非无

条件，并随后缩进了它所管辖的以下行。

最后我自己获得了代码。我将其发布在这里，希望有人发现它有用。无论如何，感谢您的帮助和反对票

import re
with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf:
    text = str(reuf.read())
    d = re.findall('</a>".*?Direc',text,re.DOTALL|re.MULTILINE)
    if d is not None:
        for elements in d:
            elements = str(elements)
            if not 'https' in elements:
                    s = re.compile('</a>".*?Direc',re.DOTALL)
                    replace = re.sub(s,'</a>",Direc',text)
    h = re.findall('</a>".*?https',text,re.DOTALL|re.MULTILINE)
    if h is not None:
        for elements in h:
            if not 'Direc' in elements:
                s = re.compile('</a>".*?https',re.DOTALL)
                replace = re.sub(s,'</a>"\nhttps',text)
        replace = str(replace)
        putuf.write(replace)

重新导入
使用open（'Reutput.csv'）作为reuf，open（'Put.csv'，'w'）作为putuf：
text=str（reuf.read（））
d=re.findall（‘“*？Direc’，text，re.DOTALL | re.MULTILINE）
如果d不是无：
对于d中的元素：
元素=str（元素）
如果元素中没有“https”：
s=re.compile（'''.*？Direc'，re.DOTALL）
replace=re.sub（s'，”，Direc'，文本）
h=re.findall（““*？https”，文本，re.DOTALL | re.MULTILINE）
如果h不是无：
对于h中的元素：
如果元素中没有“Direc”：
s=re.compile（''.*？https'，re.DOTALL）
replace=re.sub，“\nhttps”，文本）
replace=str（replace）
putuf.write（替换）

我试过了，我得到了一个空白文件，所以你是对的，组的匹配必须是无。为什么？以及如何修复Reutput.csv？提前感谢tripleee