使用python正则表达式从文本中提取特定URL

使用python正则表达式从文本中提取特定URL,python,regex,url,Python,Regex,Url,因此,我有来自NPR页面的HTML,我想使用正则表达式为我提取特定的URL(这些URL称为页面中嵌套的特定故事的URL)。实际链接显示在文本中(手动检索),如下所示: 我有这个密码来抓取他们之间的一切( for line in npr_lines: re.findall('<a href="?\'?([^"\'>]*)', line) 但这不会返回任何结果。那么我做错了什么呢?我如何将更大的URL剥离器与这些针对给定URL的特定单词结合起来 请并感谢:)您可以使用执行此操作: &l

因此,我有来自NPR页面的HTML,我想使用正则表达式为我提取特定的URL(这些URL称为页面中嵌套的特定故事的URL)。实际链接显示在文本中(手动检索),如下所示:

我有这个密码来抓取他们之间的一切(
for line in npr_lines:
re.findall('<a href="?\'?([^"\'>]*)', line)
但这不会返回任何结果。那么我做错了什么呢?我如何将更大的URL剥离器与这些针对给定URL的特定单词结合起来


请并感谢:)

您可以使用执行此操作:

<a href="?\'?((?=[^"\'>]*(?:thetwo\-way|parallels|a\-marines))[^"\'>]+)

您可以使用
re.search
功能来匹配行中的正则表达式,如果它与

>>> file  = open('/Users/shannonmcgregor/Desktop/npr.txt', 'r')
>>> for line in file:
...     if re.search('<a href=[^>]*(parallels|thetwo-way|a-marines)', line):
...             print line
>>文件=打开('/Users/cgregor/Desktop/npr.txt',r')
>>>对于文件中的行:
...     如果重新搜索(']*(平行于|双向| a-marines),行):
...             打印行
将输出为

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">

通过专门为解析
html
xml
文件[]而设计的工具

>>来自bs4导入组
>>>s=”“”
"""
>>>soup=BeautifulSoup(s)#或将文件直接传递到BS,如>>>soup=BeautifulSoup(打开('/Users/shannonmcgregor/Desktop/npr.txt'))
>>>atag=汤。查找所有('a'))
>>>链接=[i['href']表示atag中的i]
>>>进口稀土
>>>对于链接中的i:
如果重新匹配(r'.*(平行于|双向| a-海军陆战队)。*,i):
印刷品(一)
http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war
http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament
http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear
http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice
http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help

您的输入和预期输出是什么?使用HTML解析器代替,您可以将
/Users/shannonmcgregor/Desktop/npr.txt
文件的内容与预期输出一起发布吗?
re.match(r.*(parallels | two-ways | a-marines)。*',i)
在这种情况下更好地拼写
re.search(r'parallels | two-ways | a-marines',i)
<a href="?\'?((?=[^"\'>]*(?:thetwo\-way|parallels|a\-marines))[^"\'>]+)
>>> file  = open('/Users/shannonmcgregor/Desktop/npr.txt', 'r')
>>> for line in file:
...     if re.search('<a href=[^>]*(parallels|thetwo-way|a-marines)', line):
...             print line
<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">
>>> from bs4 import BeautifulSoup
>>> s = """<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">"""
>>> soup = BeautifulSoup(s) # or pass the file directly into BS like >>> soup = BeautifulSoup(open('/Users/shannonmcgregor/Desktop/npr.txt'))
>>> atag = soup.find_all('a')
>>> links = [i['href'] for i in atag]
>>> import re
>>> for i in links:
        if re.match(r'.*(parallels|thetwo-way|a-marines).*', i):
            print(i)


http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war
http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament
http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear
http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice
http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help