Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/77.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Sql 如何在R中合并网络流量数据流对的行?_Sql_R_Network Traffic - Fatal编程技术网

Sql 如何在R中合并网络流量数据流对的行?

Sql 如何在R中合并网络流量数据流对的行?,sql,r,network-traffic,Sql,R,Network Traffic,我有很多丝绸流数据,我想做一些数据挖掘。看起来目标IP列与下一行数据的源IP列相匹配。如何在R中合并源id行和目标id行?我为您提供了一些简化的网络流量数据: id sip dip notes 1 20 30 20 is talking to 30 2 20 31 20 is talking to 31 3 20 32 20 is talking to 32 4 30 20 30 i

我有很多丝绸流数据,我想做一些数据挖掘。看起来目标IP列与下一行数据的源IP列相匹配。如何在R中合并源id行和目标id行?我为您提供了一些简化的网络流量数据:

id    sip    dip    notes
1     20     30     20 is talking to 30
2     20     31     20 is talking to 31
3     20     32     20 is talking to 32
4     30     20     30 is responding to 20
5     31     20     31 is responding to 20
6     32     20     32 is responding to 20
7     20     32     20 is talking to 32 again
8     20     30     20 is talking to 30 again
9     32     20     32 is responding to 20 again
10    20     31     20 is talking to 31 again
11    31     20     31 is responding to 20 again
12    30     20     30 is responding to 20 again
13    21     30     21 is talking to 30
14    30     21     30 is responding to 21
我想合并这些行,使它们看起来像这样:

id_S    sip_S    dip_S    notes_S                      id_D    sip_D    dip_D    notes_D
1       20       30       20 is talking to 30          4       30       20       30 is responding to 20
2       20       31       20 is talking to 31          5       31       20       31 is responding to 20
3       20       32       20 is talking to 32          6       32       20       32 is responding to 20
7       20       32       20 is talking to 32 again    9       32       20       32 is responding to 20 again
8       20       30       20 is talking to 30 again    12      30       20       30 is responding to 20 again
10      20       31       20 is talking to 31 again    11      31       20       31 is responding to 20 again
13      21       30       21 is talking to 30          14      30       21       30 is responding to 21
我有超过一百万行的数据。在SQL Express中执行此操作需要几天时间和大量磁盘空间:

WITH flowtest_merged AS(
SELECT
    s.id AS id_S,
    s.sip AS sip_S,
    s.dip AS dip_S,
    s.notes AS notes_S,
    d.id AS id_D,
    d.sip AS sip_D,
    d.dip AS dip_D,
    d.notes AS notes_D,
    ROW_NUMBER() OVER(PARTITION BY s.id ORDER BY d.id) AS RN
FROM
    flowtest AS s INNER JOIN
    flowtest AS d ON
    s.dip = d.sip AND /* The source id is talking to the destination id */
    s.sip = d.dip AND /* The destination id is responding to the source id */
    s.id < d.id AND /* The source id is the initiator of the exchange */
    s.sip < 30 /* shorthand for "I'm selecting the internal ip range here" */
)
SELECT
    id_S,
    sip_S,
    dip_S,
    notes_S,
    id_D,
    sip_D,
    dip_D,
    notes_D
FROM flowtest_merged
WHERE (RN = 1)
当我做出微弱的合并尝试时:

> flowtest_merged <- merge(
+     flowtest[,setdiff(colnames(flowtest), "dip")],
+     flowtest[,setdiff(colnames(flowtest), "sip")],
+     by.x = "sip",
+     by.y = "dip",
+     all = FALSE,
+     suffixes = c("_S", "_D"))
换句话说,我不会像我想的那样,将一行与另一行合并。如何将源id行与其目标id行合并

塔克斯

戴夫

编辑:以下是第一对匹配项:

UID|SIP|DIP|PROTOCOL|SPORT|DPORT|PACKETS|BYTES|FLAGS|STIME|DURATION|ETIME|SENSOR|FLOWTYPE|ICMP_TYPE|ICMP_CODE|APPLICATION|INPUT|OUTPUT|TIMEOUT|CONTINUATION|INIT_FLAGS|SESSION_FLAGS|BLACKLIST|WHITELIST|NORMALIZED_DOMAIN|COUNTRY
720109425873|3232248427|3232248333|17|57554|53|1|70|0|2013-01-01 00:00:15.046|0|2013-01-01 00:00:15.046|THERMOPYLAE|6|||0|0|0|0|0|0|0|N|Y|erath.mechesrx.net|NULL
...
720107126014|3232248333|3232248427|17|53|57868|2|238|0|2013-01-01 00:02:15.827|0|2013-01-01 00:02:15.827|THERMOPYLAE|6|||0|0|0|0|0|0|0|N|Y|NULL|NULL

我发现了两个原因,即您可能会获得太多匹配行:

  • 您只选择了
    sip
    /
    dip
    作为匹配标准,而它应该是(
    sip,dip
    )/(
    dip,sip
    )。使用
    by.x=c('sip','dip')
    和相应的
    by.y

  • “正在讲话”行也与“再次响应”行匹配,“再次讲话”行也与“响应”行匹配。这稍微有点难解决,让我介绍一下
    plyr中的
    arrange(dataframe,…)
    ,它对数据帧进行了优雅的排序

  • 让我们
    安排
    您的数据,以便相同对等方之间的相关通信相邻,并按此顺序分配ID

    library(plyr)
    flowtest_arranged <- arrange(flowtest, pmin(sip, dip), pmax(sip, dip), id)
    flowtest_arranged$nid <- seq_along(flowtest_arranged$id)
    flowtest_arranged$nid.lag <- flowtest_arranged$nid - 1
    
    库(data.table)
    #将数据集拆分为“对话”和响应部分
    #这需要几秒钟才能输入数百万条条目
    
    a你不能看notes字段中的内容,该字段只是放在那里供你参考。@DaveBabbitt,是否有任何方法可以像罗兰在这里尝试处理你的样本数据那样分割数据?你能辨别它们是成对的唯一方法是sip/dip开关相隔几秒钟。但是,如果你绘制两个属性的散点图,你可以看到主对角线上完美的双边对称,因此你知道每次都会发生这种情况。如果你查看SQL,你可以看到我有一个内部ip范围来启动交换。有时,在实际数据中,他们只是在内部交谈,但错误不是
    a,而是表明您没有加载包(代码的第一行)。您可能必须先安装它。另外,如果您可以通过IP将启动器和响应程序分开,您也可以这样做。我必须承认,整个练习的目的对我来说并不明显。你不能将notes字段用于任何事情-我创建了该字段供你参考。你是否尝试过
    by.x=c('sip','dip')
    对我提供的流程测试数据进行测试?当我尝试时,我在fix.by(by.x,x)中得到了
    错误:'by'必须指定唯一有效的列
    。“R:合并两个数据帧”文档显示的by.x和by.y参数仅以一列作为输入。@DaveBabbitt:我有,请参见我的编辑。您使用的是哪个版本的R?R版本3.0.1(2013-05-16)-“好运动”
    > flowtest_merged
       sip id_S                      notes_S id_D                      notes_D
    1   20    1          20 is talking to 30    5       31 is responding to 20
    2   20    1          20 is talking to 30    6       32 is responding to 20
    3   20    1          20 is talking to 30   11 31 is responding to 20 again
    4   20    1          20 is talking to 30    4       30 is responding to 20
    5   20    1          20 is talking to 30    9 32 is responding to 20 again
    6   20    1          20 is talking to 30   12 30 is responding to 20 again
    7   20    2          20 is talking to 31    5       31 is responding to 20
    8   20    2          20 is talking to 31    6       32 is responding to 20
    9   20    2          20 is talking to 31   11 31 is responding to 20 again
    10  20    2          20 is talking to 31    4       30 is responding to 20
    11  20    2          20 is talking to 31    9 32 is responding to 20 again
    12  20    2          20 is talking to 31   12 30 is responding to 20 again
    13  20    3          20 is talking to 32    5       31 is responding to 20
    14  20    3          20 is talking to 32    6       32 is responding to 20
    15  20    3          20 is talking to 32   11 31 is responding to 20 again
    16  20    3          20 is talking to 32    4       30 is responding to 20
    17  20    3          20 is talking to 32    9 32 is responding to 20 again
    18  20    3          20 is talking to 32   12 30 is responding to 20 again
    19  20    8    20 is talking to 30 again    5       31 is responding to 20
    20  20    8    20 is talking to 30 again    6       32 is responding to 20
    21  20    8    20 is talking to 30 again   11 31 is responding to 20 again
    22  20    8    20 is talking to 30 again    4       30 is responding to 20
    23  20    8    20 is talking to 30 again    9 32 is responding to 20 again
    24  20    8    20 is talking to 30 again   12 30 is responding to 20 again
    25  20   10    20 is talking to 31 again    5       31 is responding to 20
    26  20   10    20 is talking to 31 again    6       32 is responding to 20
    27  20   10    20 is talking to 31 again   11 31 is responding to 20 again
    28  20   10    20 is talking to 31 again    4       30 is responding to 20
    29  20   10    20 is talking to 31 again    9 32 is responding to 20 again
    30  20   10    20 is talking to 31 again   12 30 is responding to 20 again
    31  20    7    20 is talking to 32 again    5       31 is responding to 20
    32  20    7    20 is talking to 32 again    6       32 is responding to 20
    33  20    7    20 is talking to 32 again   11 31 is responding to 20 again
    34  20    7    20 is talking to 32 again    4       30 is responding to 20
    35  20    7    20 is talking to 32 again    9 32 is responding to 20 again
    36  20    7    20 is talking to 32 again   12 30 is responding to 20 again
    37  21   13          21 is talking to 30   14       30 is responding to 21
    38  30    4       30 is responding to 20    1          20 is talking to 30
    39  30    4       30 is responding to 20    8    20 is talking to 30 again
    40  30    4       30 is responding to 20   13          21 is talking to 30
    41  30   14       30 is responding to 21    1          20 is talking to 30
    42  30   14       30 is responding to 21    8    20 is talking to 30 again
    43  30   14       30 is responding to 21   13          21 is talking to 30
    44  30   12 30 is responding to 20 again    1          20 is talking to 30
    45  30   12 30 is responding to 20 again    8    20 is talking to 30 again
    46  30   12 30 is responding to 20 again   13          21 is talking to 30
    47  31    5       31 is responding to 20    2          20 is talking to 31
    48  31    5       31 is responding to 20   10    20 is talking to 31 again
    49  31   11 31 is responding to 20 again    2          20 is talking to 31
    50  31   11 31 is responding to 20 again   10    20 is talking to 31 again
    51  32    9 32 is responding to 20 again    3          20 is talking to 32
    52  32    9 32 is responding to 20 again    7    20 is talking to 32 again
    53  32    6       32 is responding to 20    3          20 is talking to 32
    54  32    6       32 is responding to 20    7    20 is talking to 32 again
    >
    
    UID|SIP|DIP|PROTOCOL|SPORT|DPORT|PACKETS|BYTES|FLAGS|STIME|DURATION|ETIME|SENSOR|FLOWTYPE|ICMP_TYPE|ICMP_CODE|APPLICATION|INPUT|OUTPUT|TIMEOUT|CONTINUATION|INIT_FLAGS|SESSION_FLAGS|BLACKLIST|WHITELIST|NORMALIZED_DOMAIN|COUNTRY
    720109425873|3232248427|3232248333|17|57554|53|1|70|0|2013-01-01 00:00:15.046|0|2013-01-01 00:00:15.046|THERMOPYLAE|6|||0|0|0|0|0|0|0|N|Y|erath.mechesrx.net|NULL
    ...
    720107126014|3232248333|3232248427|17|53|57868|2|238|0|2013-01-01 00:02:15.827|0|2013-01-01 00:02:15.827|THERMOPYLAE|6|||0|0|0|0|0|0|0|N|Y|NULL|NULL
    
    library(plyr)
    flowtest_arranged <- arrange(flowtest, pmin(sip, dip), pmax(sip, dip), id)
    flowtest_arranged$nid <- seq_along(flowtest_arranged$id)
    flowtest_arranged$nid.lag <- flowtest_arranged$nid - 1
    
    merge(flowtest_arranged, flowtest_arranged, by.x=c('sip', 'dip', 'nid.lag'),
          by.y=c('dip', 'sip', 'nid'))
    
       sip dip nid.lag id.x                      notes.x nid id.y
    1   20  30       2    8    20 is talking to 30 again   3    4
    2   20  31       6   10    20 is talking to 31 again   7    5
    3   20  32      10    7    20 is talking to 32 again  11    6
    4   30  20       1    4       30 is responding to 20   2    1
    5   30  20       3   12 30 is responding to 20 again   4    8
    6   30  21      13   14       30 is responding to 21  14   13
    7   31  20       5    5       31 is responding to 20   6    2
    8   31  20       7   11 31 is responding to 20 again   8   10
    9   32  20      11    9 32 is responding to 20 again  12    7
    10  32  20       9    6       32 is responding to 20  10    3
                         notes.y nid.lag
    1     30 is responding to 20       1
    2     31 is responding to 20       5
    3     32 is responding to 20       9
    4        20 is talking to 30       0
    5  20 is talking to 30 again       2
    6        21 is talking to 30      12
    7        20 is talking to 31       4
    8  20 is talking to 31 again       6
    9  20 is talking to 32 again      10
    10       20 is talking to 32       8
    Warning message:
    In merge.data.frame(flowtest_arranged, flowtest_arranged, by.x = c("sip",  :
      column name ‘nid.lag’ is duplicated in the result
    
    library(data.table)
    #split your dataset in "talking"  and responding part
    #this will need some seconds for several million entries
    a <- data.table(df[grep('*talk*',df$notes),],key=c("sip","dip"))
    b <- data.table(df[grep('*responding*',df$notes),],key=c("dip","sip"))
    #create a second id for each couple
    a[,id2:=seq_len(.N),by=key(a)]
    b[,id2:=seq_len(.N),by=key(b)]
    
    #merge
    setnames(b,c("sip","dip"),c("dip","sip"))
    merge(a,b,by=c("sip","dip","id2"),all=TRUE)
    
    #    sip dip id2 id.x                   notes.x id.y                      notes.y
    # 1:  20  30   1    1       20 is talking to 30    4       30 is responding to 20
    # 2:  20  30   2    8 20 is talking to 30 again   12 30 is responding to 20 again
    # 3:  20  31   1    2       20 is talking to 31    5       31 is responding to 20
    # 4:  20  31   2   10 20 is talking to 31 again   11 31 is responding to 20 again
    # 5:  20  32   1    3       20 is talking to 32    6       32 is responding to 20
    # 6:  20  32   2    7 20 is talking to 32 again    9 32 is responding to 20 again
    # 7:  21  30   1   13       21 is talking to 30   14       30 is responding to 21