Bash 如何删除具有相同域的行_Bash_Perl_Sed_Cmd

Bash 如何删除具有相同域的行

bash perl sed cmd

Bash 如何删除具有相同域的行,bash,perl,sed,cmd,Bash,Perl,Sed,Cmd,有100万行的大型txt文件。例如： http://e-planet.ru/hosting/ http://www.anelegantchaos.org/ http://site.ru/e3-den-vtoroj/ https://escrow.webmoney.ru/about.aspx http://e-planet.ru/feedback.html 如何清除具有相同域的行我需要一个干净的http://e-planet.ru/hosting/或http://e-planet.ru/fe

有100万行的大型txt文件。例如：

http://e-planet.ru/hosting/
http://www.anelegantchaos.org/
http://site.ru/e3-den-vtoroj/
https://escrow.webmoney.ru/about.aspx
http://e-planet.ru/feedback.html

如何清除具有相同域的行

我需要一个干净的

http://e-planet.ru/hosting/

或

http://e-planet.ru/feedback.html

严格使用；
use strict;
use warnings;

open my $in, '<', 'in.txt' or die $!;

my %seen;
while(<$in>){
    chomp;
    my ($domain) = /[http:|https]\/\/(.+?)\//g;
    $seen{$domain}++;
    print "$_\n" if $seen{$domain} == 1;
}

使用警告；
打开我的$in，“一开始我不明白你的问题。这是一个awk 1-liner：
awk -F'/' '!a[$3]++' myfile

测试输入：
http://e-planet.ru/hosting/
http://www.anelegantchaos.org/
http://site.ru/e3-den-vtoroj/
https://escrow.webmoney.ru/about.aspx
http://e-planet.ru/feedback.html
https://escrow.webmoney.ru/woopwoop
httpp://whatever.com/slk

输出：
http://e-planet.ru/hosting/
http://www.anelegantchaos.org/
http://site.ru/e3-den-vtoroj/
https://escrow.webmoney.ru/about.aspx
httpp://whatever.com/slk

这里，第二次出现http://e-planet.ru/
和https://escrow.webmoney.ru/
已删除
此脚本使用/
作为分隔符拆分行，并比较第三列（域）以查看是否存在重复项。如果它是唯一的，它将被打印。需要注意的是，只有当所有的URL前面都有whateverprotocol//
时，它才起作用。双斜杠很重要，因为这是第三栏成为域名的原因
对不起，我无法回复府谷邮报
我认为问题可能在于一行中有多个URL，因此请尝试以下方法：
use strict;
use warnings;

open my $in, '<', 'in.txt' or die $!;

my %seen;
while(<$in>){
    chomp;
    for (split /\s/) {
      my ($url) = /[http:|https]\/\/(.+?)\//g;
      $seen{$url}++;
      print "$_\n" if $seen{$url} == 1;
    }
}

使用严格；
使用警告；
打开我的$in，“如果你关心的只是那些URI的域，那么我建议你先过滤掉它们
然后是一个简单的排序过程，并指定您只需要唯一的条目：
perl -lne 'print $1 if m{//(.+?)/}' file | sort | uniq > uniq_domains.txt

sort file | uniq>file.new
当您遇到类似的域时，需要什么输出？两个都删除？保留第一个？只输出域？@Jayesh如何处理相同的域、不同的页面？不起作用，相同域的行仍在输出中。txt这有什么作用，而我的答案没有？你不能只是复制粘贴我的答案，只做一些小改动！它们应该作为注释添加到我的原始文档下面answer@fugu他不能，因为他的声望太低了。