Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/302.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python:隔离搜索结果_Python_Regex_Csv - Fatal编程技术网

Python:隔离搜索结果

Python:隔离搜索结果,python,regex,csv,Python,Regex,Csv,所以我有一个代码(可能效率非常低,但这是另一个故事),它从博客的html代码中提取URL。我把html放在一个.csv文件中,我把它放到python中,然后运行正则表达式来获取URL。代码如下: import csv, re # required imports infile = open('Book1.csv', 'rt') # open the csv file reader = csv.reader(infile) # read the csv file strings = []

所以我有一个代码(可能效率非常低,但这是另一个故事),它从博客的html代码中提取URL。我把html放在一个.csv文件中,我把它放到python中,然后运行正则表达式来获取URL。代码如下:

import csv, re # required imports

infile = open('Book1.csv', 'rt')  # open the csv file
reader = csv.reader(infile)  # read the csv file


strings = [] # initialize a list to read the rows into

for row in reader: # loop over all the rows in the csv file 
    strings += row  # put them into the list

link_list = []  # initialize list that all the links will be put in
for i in strings:  #  loop over the list to access each string for regex (can't regex on lists)

    links = re.search(r'((https?|ftp)://|www\.)[^\s/$.?#].[^\s]*', i) # regex to find the links
    if links != None: # if it finds a link..
        link_list.append(links) # put it into the list!

for link in link_list: # iterate the links over a loop so we can have them in a nice column format
    print(link)
但是,当我打印结果时,它会以以下形式出现:

<_sre.SRE_Match object; span=(49, 80), match='http://buy.tableausoftware.com"'>
<_sre.SRE_Match object; span=(29, 115), match='https://c.velaro.com/visitor/requestchat.aspx?sit>
<_sre.SRE_Match object; span=(34, 117), match='https://www.tableau.com/about/blog/2015/6/become->
<_sre.SRE_Match object; span=(32, 115), match='https://www.tableau.com/about/blog/2015/6/become->
<_sre.SRE_Match object; span=(76, 166), match='https://www.tableau.com/about/blog/2015/6/become->
<_sre.SRE_Match object; span=(9, 34), match='http://twitter.com/share"'>


有没有一种方法可以让我从包含的其他废话中提取链接?还有,这只是正则表达式搜索的一部分吗?谢谢

这里的问题是
re.search
返回一个不匹配的字符串,您需要使用属性来访问所需的结果

如果需要所有捕获的组,可以使用
groups
属性,对于特殊组,可以将预期组的数量传递给它

在这种情况下,您似乎需要整个匹配,因此可以使用
组(0)

组([group1,…])

返回匹配项的一个或多个子组。如果只有一个参数,则结果是一个字符串;如果有多个参数,则结果是一个元组,每个参数有一个项。如果没有参数,group1默认为零(返回整个匹配)。如果groupN参数为零,则相应的返回值为整个匹配字符串;如果它在包含范围[1..99]内,则它是与相应括号组匹配的字符串。如果组号为负数或大于模式中定义的组数,则引发索引器异常。如果某个组包含在不匹配的模式部分中,则相应的结果为“无”。如果一个组包含在多次匹配的模式部分中,则返回最后一个匹配

for i in strings:  #  loop over the list to access each string for regex (can't regex on lists)

    links = re.search(r'((https?|ftp)://|www\.)[^\s/$.?#].[^\s]*', i) # regex to find the links
    if links != None: # if it finds a link..
        link_list.append(links.group(0))