Python 仅当某些字符不'；不存在_Python_Regex_Url

Python 仅当某些字符不'；不存在

python regex url

Python 仅当某些字符不'；不存在,python,regex,url,Python,Regex,Url,所以，我的问题是：我有一个爬虫程序，可以去下载网页，并去除那些URL（用于将来的爬虫）。我的爬虫程序通过正则表达式中指定的URL白名单进行操作，因此它们大致如下： (http://www.example.com/subdirectory/)(.*?) (http://www.example.com/subdirectory/)(.*?) …这将允许以后对遵循该模式的URL进行爬网。我遇到的问题是，我想排除URL中的某些字符，以便（例如）地址，例如： (http://www.example.co

所以，我的问题是：

我有一个爬虫程序，可以去下载网页，并去除那些URL（用于将来的爬虫）。我的爬虫程序通过正则表达式中指定的URL白名单进行操作，因此它们大致如下：

(http://www.example.com/subdirectory/)(.*?) (http://www.example.com/subdirectory/)(.*?) …这将允许以后对遵循该模式的URL进行爬网。我遇到的问题是，我想排除URL中的某些字符，以便（例如）地址，例如：

(http://www.example.com/subdirectory/)(somepage?param=1¶m=5#print) (http://www.example.com/subdirectory/)（somepage？param=1¶m=5打印） …在上面的例子中，我希望能够排除具有？、#和=（以避免对这些页面进行爬网）特性的URL。我尝试了很多不同的方法，但似乎都做不好：

(http://www.example.com/)([^=\?#](.*?)) (http://www.example.com/)([^=\?#](.*?)) 如有任何帮助，我们将不胜感激

编辑：对不起，我应该提到这是用Python编写的，我通常对正则表达式相当精通（尽管这让我感到困惑）

编辑2:VoDurden的答案（下面接受的答案）几乎产生了正确的结果，它只需要表达式末尾的$字符就可以了，而且效果非常好-示例：

(http://www.example.com/)([^=\?#]*)$

(http://www.example.com/)（[^=\？\]*）$您需要将页面爬网到

？param=1¶m=5

因为通常情况下，param=1和param=2可以为您提供完全不同的网页

在wordpress网站上找到一个来确认这一点

像这样尝试一下，它将尝试在#char之前匹配

如果这样做，这将允许任何不包含您不想要的字符的URL

然而，扩展这种方法可能有点困难。更好的选择是让系统工作两层，即一组匹配正则表达式和一组阻塞正则表达式。那么只允许同时通过这两项的URL:s。我认为这个解决方案将更加透明和灵活。

这个表达式应该是您想要的：

(http://www.example.com/subdirectory/)([^=?#]*)$

[^=\？\\]将匹配除指定字符以外的任何字符

例如：

匹配
匹配
不匹配
不匹配

([^=?#]*)

from urlparse import urlparse

urls= [
    'http://www.example.com/subdirectory/',
    'http://www.example.com/subdirectory/index.php',
    'http://www.example.com/subdirectory/somepage?param=1&param=5#print',
    'http://www.example.com/subdirectory/index.php?param=1',
]

for url in urls:
    # in python 2.5+ you can use urlparse(url).query instead
    if not urlparse(url)[4]:
        print url

http://www.example.com/subdirectory/
http://www.example.com/subdirectory/index.php

from urlparse import urlparse

urls= [
    'http://www.example.com/subdirectory/',
    'http://www.example.com/subdirectory/index.php',
    'http://www.example.com/subdirectory/somepage?param=1&param=5#print',
    'http://www.example.com/subdirectory/index.php?param=1',
]

for url in urls:
    # in python 2.5+ you can use urlparse(url).query instead
    if not urlparse(url)[4]:
        print url

http://www.example.com/subdirectory/
http://www.example.com/subdirectory/index.php