Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/305.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在已爬网Javascript中获取相对URL的正则表达式_Python_Regex_Scrapy - Fatal编程技术网

Python 在已爬网Javascript中获取相对URL的正则表达式

Python 在已爬网Javascript中获取相对URL的正则表达式,python,regex,scrapy,Python,Regex,Scrapy,我有一个与Scrapy爬虫设置,并试图处理链接。问题是链接嵌入到Javascript中,我正在努力创建正则表达式。以下是我尝试处理的3个示例: javascript:openInIFrame('main','setup.phtml%3f.op%3d3800%26.who%3daaaaaaa%26.menuItemRefNo=118') javascript:window.open('overview.phtml?&.who=aaaaaa&.id=2','43425235','menubar=no

我有一个与Scrapy爬虫设置,并试图处理链接。问题是链接嵌入到Javascript中,我正在努力创建正则表达式。以下是我尝试处理的3个示例:

  • javascript:openInIFrame('main','setup.phtml%3f.op%3d3800%26.who%3daaaaaaa%26.menuItemRefNo=118')
  • javascript:window.open('overview.phtml?&.who=aaaaaa&.id=2','43425235','menubar=no,toolbar=no,location=no,resizeable=yes,maximize=yes')
  • javascript:openInIFrame('main',“page.phtml%3f.op%3d1499%26.who%3daaaaaaa%26.ifmod%3dtest和.menuItemRefNo=7”)
  • 每个的结果相对URL位于单引号/双引号之间:

  • setup.phtml%3f.op%3d3800%26.who%3daaaaaaa%26.menuItemRefNo=118
  • overview.phtml?&.who=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  • page.phtml%3f.op%3d1499%26.who%3daaaaaaa%26.ifmod%3dtest&menuItemRefNo=7

  • 我尝试了
    (.*)和
    ([“'))(?:(?=(\\?))\2.*?\1的变体,但似乎无法正确执行。我缺少什么?

    也许可以尝试以下操作:

    ['"].*phtml.*['"]
    
    试试这个

    import re
    
    url_regex = re.compile(r"(?:javascript:openInIFrame\('main',|javascript:window.open\()\s*(?:'|\")([^'\"]+)(?:'|\")")
    
    samples = [
      "javascript:openInIFrame('main', 'setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118')",
      "javascript:window.open('overview.phtml?&.who=AAAAAAAAAAAA&.id=2', '43425235', 'menubar=no,toolbar=no,location=no,resizable=yes,maximize=yes');",
      "javascript:openInIFrame('main', \"page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7\")"
    ]
    
    for sample in samples:
      md = url_regex.search(sample)
      if md:
        print md.group(1)
      else:
        print 'NO MATCH'
    
    对我来说,这将产生:

    setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118
    overview.phtml?&.who=AAAAAAAAAAAA&.id=2
    page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7
    

    诀窍是
    ([^'\“]+)
    。这将捕获一个或多个字符的任何序列,只要该字符不是双引号或单引号。因此,基本上,所有内容都是URL字符串的末尾,这正是URL。请注意,
    \“
    是必需的,因为正则表达式本身是用

    分隔的。您将需要两个正则表达式–一个用于
    窗口。一个用于
    打开,另一个用于
    openInIFrame
    。我可以根据其中是否有“.phtml”来获取引用的值吗?请告诉我=P如果您总是希望第一个单引号后跟
    phtml
    ,请确定
    '([^']+\.phtml[^']+)
    ,如果不需要处理反斜杠转义,那么URI将对其进行解码。