使用python正则表达式从文本中提取特定URL_Python_Regex_Url

使用python正则表达式从文本中提取特定URL

python regex url

使用python正则表达式从文本中提取特定URL,python,regex,url,Python,Regex,Url,因此，我有来自NPR页面的HTML，我想使用正则表达式为我提取特定的URL（这些URL称为页面中嵌套的特定故事的URL）。实际链接显示在文本中（手动检索），如下所示：我有这个密码来抓取他们之间的一切( for line in npr_lines: re.findall('<a href="?\'?([^"\'>]*)', line) 但这不会返回任何结果。那么我做错了什么呢？我如何将更大的URL剥离器与这些针对给定URL的特定单词结合起来请并感谢：）您可以使用执行此操作： &l

因此，我有来自NPR页面的HTML，我想使用正则表达式为我提取特定的URL（这些URL称为页面中嵌套的特定故事的URL）。实际链接显示在文本中（手动检索），如下所示：

我有这个密码来抓取他们之间的一切(

for line in npr_lines:
re.findall('<a href="?\'?([^"\'>]*)', line)

但这不会返回任何结果。那么我做错了什么呢？我如何将更大的URL剥离器与这些针对给定URL的特定单词结合起来

请并感谢：）

您可以使用执行此操作：

<a href="?\'?((?=[^"\'>]*(?:thetwo\-way|parallels|a\-marines))[^"\'>]+)


您可以使用re.search
功能来匹配行中的正则表达式，如果它与
>>> file  = open('/Users/shannonmcgregor/Desktop/npr.txt', 'r')
>>> for line in file:
...     if re.search('<a href=[^>]*(parallels|thetwo-way|a-marines)', line):
...             print line

>>文件=打开（'/Users/cgregor/Desktop/npr.txt'，r'）
>>>对于文件中的行：
...     如果重新搜索（']*（平行于|双向| a-marines），行）：
...             打印行

将输出为
<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">

通过专门为解析html
和xml
文件[]而设计的工具
>>来自bs4导入组
>>>s=”“”
"""
>>>soup=BeautifulSoup（s）#或将文件直接传递到BS，如>>>soup=BeautifulSoup（打开（'/Users/shannonmcgregor/Desktop/npr.txt'））
>>>atag=汤。查找所有（'a'））
>>>链接=[i['href']表示atag中的i]
>>>进口稀土
>>>对于链接中的i：
如果重新匹配（r'.*（平行于|双向| a-海军陆战队）。*，i）：
印刷品（一）
http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war
http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament
http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear
http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice
http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help
您的输入和预期输出是什么？使用HTML解析器代替，您可以将/Users/shannonmcgregor/Desktop/npr.txt
文件的内容与预期输出一起发布吗？re.match（r.*（parallels | two-ways | a-marines）。*'，i）
在这种情况下更好地拼写re.search（r'parallels | two-ways | a-marines'，i）
<a href="?\'?((?=[^"\'>]*(?:thetwo\-way|parallels|a\-marines))[^"\'>]+)

>>> file  = open('/Users/shannonmcgregor/Desktop/npr.txt', 'r')
>>> for line in file:
...     if re.search('<a href=[^>]*(parallels|thetwo-way|a-marines)', line):
...             print line

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">

>>> from bs4 import BeautifulSoup
>>> s = """<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">"""
>>> soup = BeautifulSoup(s) # or pass the file directly into BS like >>> soup = BeautifulSoup(open('/Users/shannonmcgregor/Desktop/npr.txt'))
>>> atag = soup.find_all('a')
>>> links = [i['href'] for i in atag]
>>> import re
>>> for i in links:
        if re.match(r'.*(parallels|thetwo-way|a-marines).*', i):
            print(i)


http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war
http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament
http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear
http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice
http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help