Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/sql/76.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Sql 是否为大型边缘数据帧中的每个节点获取顶部X%的边缘?_Sql_Pandas_Performance_Dataframe_Limit - Fatal编程技术网

Sql 是否为大型边缘数据帧中的每个节点获取顶部X%的边缘?

Sql 是否为大型边缘数据帧中的每个节点获取顶部X%的边缘?,sql,pandas,performance,dataframe,limit,Sql,Pandas,Performance,Dataframe,Limit,我有一个大熊猫数据帧“dfTagTuple”,大约有5.600.000行,如: index Source Target Weight 0 a b 2.0 1 a d 1.2 2 a b 2.0 3 a d 1.2 4 a b 2.0 5 a d 1.2 6 a b 2.0 7 a d

我有一个大熊猫数据帧“dfTagTuple”,大约有5.600.000行,如:

index Source Target Weight
0     a      b      2.0
1     a      d      1.2
2     a      b      2.0
3     a      d      1.2
4     a      b      2.0
5     a      d      1.2
6     a      b      2.0
7     a      d      1.2
8     b      d      0.3
9     b      d      0.3
10    b      d      0.3
11    b      d      0.3
12    b      d      0.3
13    b      d      0.3
14    c      l      0.8
以及来自源/目标(~91.000)的唯一值列表

对于该唯一列表中的每个值,我需要column.Source==value的行,如

df = dfTagTuple.loc[dfTagTuple["Source"] == "a"]
然后我需要将行数限制在前X(比率,这里为0.2=20%)的百分比,因此权重最高的节点,将它们添加到列表/数据帧中,然后从endresult构建数据帧

=对于每个节点,保留前X%的连接

边是无向的,所以a=b=a。我可以稍后过滤重复项/Gephi会处理的

最终结果应该是:

index Source Target Weight
0     a      b      2.0 # keep, Source.a=7x, top20=7*0,2=1, highest Weight
8     b      d      1.3 # keep, Source.b=8x, top20=8*0,2=2, highest Weight
10    b      f      0.5 # keep, Source.b=8x, top20=8*0,2=2, highest Weight
16    c      l      0.8 # keep, Source.c=1x, top20=1*0,2=0=1, highest Weight
但是它运行得非常慢。我需要一种方法来加速这个过程

如果有人知道如何在SQL中工作,我还可以将我的数据帧推到SQLite中,并推到“每个源代码的最大X值”中

迄今为止的代码:

keepRows = []
ratio = 0.2

dfTagTupleNodes = dfTagTuple["Source"].to_frame()
dfTagTupleNodes.drop_duplicates(inplace=True)

for row in dfTagTupleNodes.itertuples():
    df = dfTagTuple.loc[dfTagTuple["Source"] == row.Source]
    df.sort_values(by=['Weight'], ascending=False, inplace=True)
    keepRowAmount = int((len(df.index) * ratio))
    if keepRowAmount == 0:
        keepRowAmount = 1
    dfKeep = df[:keepRowAmount]

    for edge in dfKeep.itertuples():
       keepRows.append([edge.Source, edge.Target, edge.Weight])

dfTagTupleTopX = pd.DataFrame(keepRows, columns=["Source", "Target", "Weight"])

你想保留前X,但我看不到任何排序,应该有一个,不是吗?对,对不起。在代码示例中添加了df.sort_values()行。