如何解析选项卡分隔的数据文件并在Perl中对提取的数据进行分组？_Perl_Graph

如何解析选项卡分隔的数据文件并在Perl中对提取的数据进行分组？

perl graph

如何解析选项卡分隔的数据文件并在Perl中对提取的数据进行分组？,perl,graph,Perl,Graph,我是Perl的新手。我需要解析一个选项卡分隔的文本文件。例如： From name To name Timestamp Interaction a b Dec 2 06:40:23 IST 2000 comment c d Dec 1 10:40:23 IST 2001 like e a Dec 1

我是Perl的新手。我需要解析一个选项卡分隔的文本文件。例如：

From name   To name      Timestamp                 Interaction
a             b        Dec  2 06:40:23 IST 2000        comment
c             d        Dec  1 10:40:23 IST 2001          like
e             a        Dec  1 16:03:01 IST 2000         follow
b             c        Dec  2 07:50:29 IST 2002         share
a             c        Dec  2 08:50:29 IST 2001        comment
c             a        Dec 11 12:40:23 IST 2008          like
e             c        Dec  2 07:50:29 IST 2000         like
c             b        Dec 11 12:40:23 IST 2008        follow
b             a        Dec  2 08:50:29 IST 2001        share

解析之后，我需要根据用户交互创建组。在这个例子中

a<->b
b<->a
c<->a
a<->c
b<->c
c<->b

From Name

   a                X       <>       <>         X
   b                <>      X        <>         X
   c                <>      <>       X          X 
   d                X       <>       X          X

输出应为：

(a,c, d, e)
(x,y,z)

解析很容易。只要一杯就够了。然而，或者可能更好

对于连接，您可以使用模块。为了能够有效地使用该模块，您至少需要了解该模块的基础知识

请注意，a的定义如下：

如果有一条从图中的每个顶点到其他每个顶点的路径，则有向图称为强连通图。特别是，这意味着每个方向上的路径；从

到

的路径，以及从

到

的路径

有向图G的强连通分量是其最大强连通子图

但是，请注意，如果您有

ab

和

bc

，

，和

将形成一个强连接的组件，这意味着与在两个方向上相互作用的组的所有成员相比，这是一个较弱的要求

我们仍然可以使用它来减少搜索空间。一旦您有了候选组，您就可以检查每个组，看看它是否符合您对组的定义。如果某个候选组不符合您的要求，那么您可以检查少一个成员的所有子集。如果在这些组中找不到任何组，则可以查看所有少两个成员的子集，依此类推，直到达到最小组大小限制

下面的脚本使用了这个想法。然而，它很可能无法扩展。我强烈怀疑有人可能会想出一些SQL魔法，但我的思维太有限了

#!/usr/bin/env perl

use strict;
use warnings;

use Graph;
use Algorithm::ChooseSubsets;

use constant MIN_SIZE => 3;

my $interactions = Graph->new(
    directed => 1,
);

while (my $interaction = <DATA>) {
    last unless $interaction =~ /\S/;
    my ($from, $to) = split ' ', $interaction, 3;

    $interactions->add_edge($from, $to);
}

my @groups = map {
    is_group($interactions, $_) ? $_
                                : check_subsets($interactions, $_)
} grep @$_ >= MIN_SIZE, $interactions->strongly_connected_components;


print "Groups: \n";
print "[ @$_ ]\n" for @groups;

sub check_subsets {
    my ($graph, $candidate) = @_;

    my @groups;
    for my $size (reverse MIN_SIZE .. (@$candidate - 1)) {
        my $subsets = Algorithm::ChooseSubsets->new(
            set => $candidate,
            size => $size,
        );

        my $groups_found;
        while (my $subset = $subsets->next) {
            if (is_group($interactions, $subset)) {
                ++$groups_found;
                push @groups, $subset;
            }
        }
        last if $groups_found;
    }

    return @groups;
}

sub is_group {
    my ($graph, $candidate) = @_;

    for my $member (@$candidate) {
        for my $other (@$candidate) {
            next if $member eq $other;
            return unless $graph->has_edge($member, $other);
            return unless $graph->has_edge($other, $member);
        }
    }

    return 1;
}

__DATA__
a   c   Dec  2 06:40:23 IST 2000    comment
f   g   Dec  2 06:40:23 IST 2009    like
c   a   Dec  2 06:40:23 IST 2009    like
g   h   Dec  2 06:40:23 IST 2008    like
a   d   Dec  2 06:40:23 IST 2008    like
r   t   Dec  2 06:40:23 IST 2007    share
d   a   Dec  2 06:40:23 IST 2007    share
t   u   Dec  2 06:40:23 IST 2006    follow
a   e   Dec  2 06:40:23 IST 2006    follow
k   l   Dec  2 06:40:23 IST 2009    like
e   a   Dec  2 06:40:23 IST 2009    like
j   k   Dec  2 06:40:23 IST 2003    like
c   d   Dec  2 06:40:23 IST 2003    like
l   j   Dec  2 06:40:23 IST 2002    like
d   c   Dec  2 06:40:23 IST 2002    like
m   n   Dec  2 06:40:23 IST 2005    like
c   e   Dec  2 06:40:23 IST 2005    like
m   l   Dec  2 06:40:23 IST 2011    like
e   c   Dec  2 06:40:23 IST 2011    like
h   j   Dec  2 06:40:23 IST 2010    like
d   e   Dec  2 06:40:23 IST 2010    like
o   p   Dec  2 06:40:23 IST 2009    like
e   d   Dec  2 06:40:23 IST 2009    like
p   q   Dec  2 06:40:23 IST 2000    comment
q   p   Dec  2 06:40:23 IST 2009    like
a   p   Dec  2 06:40:23 IST 2008    like
p   a   Dec  2 06:40:23 IST 2007    share
l   p   Dec  2 06:40:23 IST 2003    like
j   l   Dec  2 06:40:23 IST 2002    like
t   r   Dec  2 06:40:23 IST 2000    comment
r   h   Dec  2 06:40:23 IST 2009    like
j   f   Dec  2 06:40:23 IST 2008    like
g   d   Dec  2 06:40:23 IST 2007    share
w   q   Dec  2 06:40:23 IST 2003    like
o   y   Dec  2 06:40:23 IST 2002    like
x   y   Dec  2 06:40:23 IST 2000    comment
y   x   Dec  2 06:40:23 IST 2009    like
x   z   Dec  2 06:40:23 IST 2008    like
z   x   Dec  2 06:40:23 IST 2007    share
y   z   Dec  2 06:40:23 IST 2003    like
z   y   Dec  2 06:40:23 IST 2002    like

#/usr/bin/env perl
严格使用；
使用警告；
使用图形；
使用算法：：选择子集；
使用常量MIN_SIZE=>3；
my$interactions=图表->新建(
定向=>1，
);
while（my$interaction=）{
最后，除非$interaction=~/\S/；
my（$from，$to）=拆分“”，$interaction，3；
$interactions->add_edge（$from，$to）；
}
我的@groups=map{
是组（$interactions，$）$_
：检查子集（$interactions，$）
}grep@$\u>=最小大小，$interactions->强连接组件；
打印“组：\n”；
为@groups打印“[@$\]\n”；
子检查_子集{
我的（$graph，$candidate）=@；
我的@组；
对于我的$size（反向最小尺寸…@$candidate-1））{
my$subsets=算法：：选择subsets->new(
set=>$candidate，
大小=>$size，
);
我找到的$groups；
while（my$subset=$subset->next）{
if（is_组（$interactions，$subset））{
++$U发现的组；
推送@groups$subset；
}
}
如果找到$groups\u，则为最后一个；
}
返回@组；
}
小组{
我的（$graph，$candidate）=@；
我的$会员（@$候选人）{
对于我的$other（@$候选人）{
下一个if$member eq$other；
除非$graph->has_edge（$member，$other），否则返回；
除非$graph->has_edge（$other，$member），否则返回；
}
}
返回1；
}
__资料__
a c Dec 2 06:40:23 IST 2000评论
2009年12月2日06:40:23
2009年12月2日06:40:23
g h Dec 2 06:40:23是2008年的哪一天
a d Dec 2 06:40:23是2008年吗
r t Dec 2 06:40:23 IST 2007股票
d a Dec 2 06:40:23 IST 2007股票
2006年12月2日06:40:23
a e Dec 2 06:40:23 IST 2006后续
k l Dec 2 06:40:23是2009年吗
2009年12月2日06:40:23
j k Dec 2 06:40:23是2003年吗
c d Dec 2 06:40:23类似于
l j Dec 2 06:40:23是2002年吗
华盛顿12月2日06:40:23
m n Dec 2 06:40:23是2005年吗
c e Dec 2 06:40:23是2005年吗
2011年12月2日06:40:23
EC 12月2日06:40:23类似于2011年
h j Dec 2 06:40:23类似于2010年
日期：2010年12月2日06:40:23
o p Dec 2 06:40:23这是2009年12月2日吗
东德12月2日06:40:23是2009年吗
p q Dec 2 06:40:23 IST 2000评论
q p Dec 2 06:40:23是2009年12月2日吗
a 12月2日06:40:23是2008年吗
p a Dec 2 06:40:23 IST 2007股票
l p Dec 2 06:40:23是2003年吗
j l Dec 2 06:40:23是2002年吗
t r Dec 2 06:40:23 IST 2000评论
r h Dec 2 06:40:23是2009年12月2日吗
j f Dec 2 06:40:23类似于2008年12月2日
g d Dec 2 06:40:23 IST 2007股票
w q Dec 2 06:40:23是2003年吗
o y Dec 2 06:40:23是2002年吗
x y Dec 2 06:40:23 IST 2000评论
y x Dec 2 06:40:23是2009年12月2日吗
x z Dec 2 06:40:23是2008年12月2日吗
z x Dec 2 06:40:23 IST 2007股票
y z Dec 2 06:40:23是2003年吗
z y Dec 2 06:40:23是2002年吗

输出：

Groups: [ y z x ] [ e d a c ] 组： [y z x]

这里至少有三个不同的问题（读取文件、从中提取数据、构建数据）。哪一个给你带来了麻烦？你试过什么？它怎么没有像你期望的那样工作？昆汀：谢谢你的快速回复。我在结构化数据方面陷入了困境。@sarnold我对perl是新手。如何处理用户和表单组之间的交互。您在Perl或数据本身的结构方面有问题吗？我看不到您想要的数据的详细结构。您希望创建“基于用户交互的组”，交互将显示在示例结果中。您的数据有“from”和“to”，但您的输出有“ab”，这对我来说意味着“a”与“b”进行交互，“b”与“a”进行相同的交互，但a->b是一个注释，aUnur：谢谢您提供了一些提示。但我们如何用图形逻辑地表示呢。我不熟悉graph和perl。我试试看。创建“gro中的每个用户”组的标准

#!/usr/bin/env perl

use strict;
use warnings;

use Graph;
use Algorithm::ChooseSubsets;

use constant MIN_SIZE => 3;

my $interactions = Graph->new(
    directed => 1,
);

while (my $interaction = <DATA>) {
    last unless $interaction =~ /\S/;
    my ($from, $to) = split ' ', $interaction, 3;

    $interactions->add_edge($from, $to);
}

my @groups = map {
    is_group($interactions, $_) ? $_
                                : check_subsets($interactions, $_)
} grep @$_ >= MIN_SIZE, $interactions->strongly_connected_components;


print "Groups: \n";
print "[ @$_ ]\n" for @groups;

sub check_subsets {
    my ($graph, $candidate) = @_;

    my @groups;
    for my $size (reverse MIN_SIZE .. (@$candidate - 1)) {
        my $subsets = Algorithm::ChooseSubsets->new(
            set => $candidate,
            size => $size,
        );

        my $groups_found;
        while (my $subset = $subsets->next) {
            if (is_group($interactions, $subset)) {
                ++$groups_found;
                push @groups, $subset;
            }
        }
        last if $groups_found;
    }

    return @groups;
}

sub is_group {
    my ($graph, $candidate) = @_;

    for my $member (@$candidate) {
        for my $other (@$candidate) {
            next if $member eq $other;
            return unless $graph->has_edge($member, $other);
            return unless $graph->has_edge($other, $member);
        }
    }

    return 1;
}

__DATA__
a   c   Dec  2 06:40:23 IST 2000    comment
f   g   Dec  2 06:40:23 IST 2009    like
c   a   Dec  2 06:40:23 IST 2009    like
g   h   Dec  2 06:40:23 IST 2008    like
a   d   Dec  2 06:40:23 IST 2008    like
r   t   Dec  2 06:40:23 IST 2007    share
d   a   Dec  2 06:40:23 IST 2007    share
t   u   Dec  2 06:40:23 IST 2006    follow
a   e   Dec  2 06:40:23 IST 2006    follow
k   l   Dec  2 06:40:23 IST 2009    like
e   a   Dec  2 06:40:23 IST 2009    like
j   k   Dec  2 06:40:23 IST 2003    like
c   d   Dec  2 06:40:23 IST 2003    like
l   j   Dec  2 06:40:23 IST 2002    like
d   c   Dec  2 06:40:23 IST 2002    like
m   n   Dec  2 06:40:23 IST 2005    like
c   e   Dec  2 06:40:23 IST 2005    like
m   l   Dec  2 06:40:23 IST 2011    like
e   c   Dec  2 06:40:23 IST 2011    like
h   j   Dec  2 06:40:23 IST 2010    like
d   e   Dec  2 06:40:23 IST 2010    like
o   p   Dec  2 06:40:23 IST 2009    like
e   d   Dec  2 06:40:23 IST 2009    like
p   q   Dec  2 06:40:23 IST 2000    comment
q   p   Dec  2 06:40:23 IST 2009    like
a   p   Dec  2 06:40:23 IST 2008    like
p   a   Dec  2 06:40:23 IST 2007    share
l   p   Dec  2 06:40:23 IST 2003    like
j   l   Dec  2 06:40:23 IST 2002    like
t   r   Dec  2 06:40:23 IST 2000    comment
r   h   Dec  2 06:40:23 IST 2009    like
j   f   Dec  2 06:40:23 IST 2008    like
g   d   Dec  2 06:40:23 IST 2007    share
w   q   Dec  2 06:40:23 IST 2003    like
o   y   Dec  2 06:40:23 IST 2002    like
x   y   Dec  2 06:40:23 IST 2000    comment
y   x   Dec  2 06:40:23 IST 2009    like
x   z   Dec  2 06:40:23 IST 2008    like
z   x   Dec  2 06:40:23 IST 2007    share
y   z   Dec  2 06:40:23 IST 2003    like
z   y   Dec  2 06:40:23 IST 2002    like

Groups: [ y z x ] [ e d a c ]