Perl 从URL列表中删除仅顶级目录URL？_Perl_Url

Perl 从URL列表中删除仅顶级目录URL？

perl url

Perl 从URL列表中删除仅顶级目录URL？,perl,url,Perl,Url,我有一个问题，我很难研究，因为我不知道如何在搜索引擎上正确提问我有一个URL列表。我希望有一些自动化的方法（首选Perl）来遍历列表并删除所有仅位于顶级目录的URL 例如，我可能有以下列表：在本例中，我希望从列表中删除example.com，因为它要么是top目录，要么它们引用top目录中的文件我在想怎么做。我的第一个想法是，计算前向斜杠的数目，如果有两个以上的斜杠，则从列表中删除URL。但是你有一个向前的斜杠，所以这不起作用任何想法或想法都将不胜感激类似这样： use URI::

我有一个问题，我很难研究，因为我不知道如何在搜索引擎上正确提问

我有一个URL列表。我希望有一些自动化的方法（首选Perl）来遍历列表并删除所有仅位于顶级目录的URL

例如，我可能有以下列表：

在本例中，我希望从列表中删除example.com，因为它要么是top目录，要么它们引用top目录中的文件

我在想怎么做。我的第一个想法是，计算前向斜杠的数目，如果有两个以上的斜杠，则从列表中删除URL。但是你有一个向前的斜杠，所以这不起作用

任何想法或想法都将不胜感激

类似这样：

use URI::Split qw( uri_split ); 
my $url = "http://www.foo.com/this/thingrighthere.html";
my ($scheme, $auth, $path, $query, $frag)  = uri_split( $url );
if (($path =~ tr/\///) > 1 ) {
    print "I care about this $url";
}

使用CPAN中的URI模块

这是一个已解决的问题。人们已经编写、测试和调试了处理这个问题的代码。每当您遇到其他人可能不得不处理的编程问题时，请查找为您解决该问题的现有代码。

您可以使用正则表达式来完成此操作，但让库为您解决该问题的工作量要小得多。在路径前后（查询、锚定、授权…），您不会被有趣的计划、逃逸和额外的东西所困扰。关于路径如何用path_segments（）表示，有一些技巧。有关详细信息，请参阅下面的评论和

我假设

http://www.example.com/foo/

被视为顶级目录。根据需要调整，但这是你必须考虑的事情

#!/usr/bin/env perl

use URI;
use File::Spec;

use strict;
use warnings;

use Test::More 'no_plan';

sub is_top_level_uri {
    my $uri = shift;

    # turn it into a URI object if it isn't already
    $uri = URI->new($uri) unless eval { $uri->isa("URI") };

    # normalize it
    $uri = $uri->canonical;

    # split the path part into pieces
    my @path_segments = $uri->path_segments;

    # for an absolute path, which most are, the absoluteness will be
    # represented by an empty string.  Also /foo/ will come out as two elements.
    # Strip that all out, it gets in our way for this purpose.
    @path_segments = grep { $_ ne '' } @path_segments;

    return @path_segments <= 1;
}

my @filtered_uris = (
  "http://www.example.com/hello.html",
  "http://www.example.com/",
  "http://www.example.com",
  "https://www.example.com/",
  "https://www.example.com/foo/#extra",
  "ftp://www.example.com/foo",
  "ftp://www.example.com/foo/",
  "https://www.example.com/foo/#extra",
  "https://www.example.com/foo/?extra",
  "http://www.example.com/hello.html#extra",
  "http://www.example.com/hello.html?extra",
  "file:///foo",
  "file:///foo/",
  "file:///foo.txt",
);

my @unfiltered_uris = (
  "http://www.foo.com/this/thingrighthere.html",
  "https://www.example.com/foo/bar",
  "ftp://www.example.com/foo/bar/",
  "file:///foo/bar",
  "file:///foo/bar.txt",
);

for my $uri (@filtered_uris) {
    ok is_top_level_uri($uri), $uri;
}

for my $uri (@unfiltered_uris) {
    ok !is_top_level_uri($uri), $uri;
}

#/usr/bin/env perl
使用URI；
使用File：：Spec；
严格使用；
使用警告；
使用测试：：更多“无计划”；
sub是顶级uri{
my$uri=shift；
#如果尚未将其转换为URI对象，请将其转换为URI对象
$uri=uri->new（$uri），除非eval{$uri->isa（“uri”）}；
#使其正常化
$uri=$uri->canonical；
#将路径部分拆分为多个部分
my@path\u segments=$uri->path\u segments；
#对于绝大多数人来说，绝对性是绝对的
#由空字符串表示。同时/foo/将显示为两个元素。
#把这些都去掉，这会妨碍我们的工作。
@path_segments=grep{$\une'}@path_segments；
返回@path_段，哪条路径是http://example.com/foo/
fall？注意后面的斜杠。谢谢你，安迪。我本以为会有一个解决方案，但我不知道如何提问才能找到它。谢谢你，施沃恩！我会根据我的目的修改这个。我非常感谢你花时间帮我解决这个问题。