从文本文件中提取URL-Python_Python_Python 3.6

从文本文件中提取URL-Python

python

从文本文件中提取URL-Python,python,python-3.6,Python,Python 3.6,我试图从包含网站源代码的文本文件中提取URL。我想在href中获得网站链接，我写了一些从stackoverflow借来的代码，但我无法让它工作 with open(sourcecode.txt) as f: urls = f.readlines() urls = ([s.strip('\n') for s in urls ]) print(url) 您可以为此使用正则表达式 import re with open('sourcecode.txt') as f: text

我试图从包含网站源代码的文本文件中提取URL。我想在href中获得网站链接，我写了一些从stackoverflow借来的代码，但我无法让它工作

with open(sourcecode.txt) as f:
    urls = f.readlines()

urls = ([s.strip('\n') for s in urls ]) 

print(url)

您可以为此使用正则表达式

import re

with open('sourcecode.txt') as f:
    text = f.read()

href_regex = r'href=[\'"]?([^\'" >]+)'
urls = re.findall(href_regex, text)

print(urls)

您可能会遇到这样的错误：未定义“源代码”；这是因为传递给

open（）

的参数必须是字符串（见上文）

您可以使用正则表达式

import re

with open('sourcecode.txt') as f:
    text = f.read()

href_regex = r'href=[\'"]?([^\'" >]+)'
urls = re.findall(href_regex, text)

print(urls)

您可能会遇到这样的错误：未定义“源代码”；这是因为传递给

open（）

的参数必须是字符串（见上文）

使用regexp，您可以从文本文件中提取所有URL，而无需逐行循环：

import re
with open('/home/username/Downloads/Stack_Overflow.html') as f:
    urls = f.read()
    links = re.findall('"((http)s?://.*?)"', urls)
for url in links:
    print(url[0])

使用regexp，您可以从文本文件中提取所有URL，而无需逐行循环：

import re
with open('/home/username/Downloads/Stack_Overflow.html') as f:
    urls = f.read()
    links = re.findall('"((http)s?://.*?)"', urls)
for url in links:
    print(url[0])

它还提供了一个未定义的错误源。您可能需要查看HTML解析库，如。它还提供了一个未定义的错误源。您可能需要查看HTML解析库，如。NameError:name're'未定义。re是正则表达式模块，它是标准库的一部分<代码>导入re名称错误：未定义名称“re”。re是正则表达式模块，它是标准库的一部分<代码>重新导入