代码运行时内存问题（Python、Networkx）_Python_Networkx

代码运行时内存问题（Python、Networkx）

python

代码运行时内存问题（Python、Networkx）,python,networkx,Python,Networkx,我编写了一个代码来生成一个有379613734条边的图但是由于内存问题，代码无法完成。当它通过6200万条线路时，将占用大约97%的服务器内存。所以我杀了它你有办法解决这个问题吗我的代码如下： import os, sys import time import networkx as nx G = nx.Graph() ptime = time.time() j = 1 for line in open("./US_Health_Links.txt", 'r'): #for lin

我编写了一个代码来生成一个有379613734条边的图

但是由于内存问题，代码无法完成。当它通过6200万条线路时，将占用大约97%的服务器内存。所以我杀了它

你有办法解决这个问题吗

我的代码如下：

import os, sys
import time
import networkx as nx


G = nx.Graph()

ptime = time.time()
j = 1

for line in open("./US_Health_Links.txt", 'r'):
#for line in open("./test_network.txt", 'r'):
    follower = line.strip().split()[0]
    followee = line.strip().split()[1]

    G.add_edge(follower, followee)

    if j%1000000 == 0:
        print j*1.0/1000000, "million lines done", time.time() - ptime
        ptime = time.time()
    j += 1

DG = G.to_directed()
#       P = nx.path_graph(DG)
Nn_G = G.number_of_nodes()
N_CC = nx.number_connected_components(G)
LCC = nx.connected_component_subgraphs(G)[0]
n_LCC = LCC.nodes()
Nn_LCC = LCC.number_of_nodes()
inDegree = DG.in_degree()
outDegree = DG.out_degree()
Density = nx.density(G)
#       Diameter = nx.diameter(G)
#       Centrality = nx.betweenness_centrality(PDG, normalized=True, weighted_edges=False)
#       Clustering = nx.average_clustering(G)

print "number of nodes in G\t" + str(Nn_G) + '\n' + "number of CC in G\t" + str(N_CC) + '\n' + "number of nodes in LCC\t" + str(Nn_LCC) + '\n' + "Density of G\t" + str(Density) + '\n'
#       sys.exit()
#   j += 1

1000    1001
1000245    1020191
1000    10267352
1000653    10957902
1000    11039092
1000    1118691
10346    11882
1000    1228281
1000    1247041
1000    12965332
121340    13027572
1000    13075072
1000    13183162
1000    13250162
1214    13326292
1000    13452672
1000    13844892
1000    14061830
12340    1406481
1000    14134703
1000    14216951
1000    14254402
12134   14258044
1000    14270791
1000    14278978
12134    14313332
1000    14392970
1000    14441172
1000    14497568
1000    14502775
1000    14595635
1000    14620544
1000    14632615
10234    14680596
1000    14956164
10230    14998341
112000    15132211
1000    15145450
100    15285998
1000    15288974
1000    15300187
1000    1532061
1000    15326300

边缘数据如下所示：

import os, sys
import time
import networkx as nx


G = nx.Graph()

ptime = time.time()
j = 1

for line in open("./US_Health_Links.txt", 'r'):
#for line in open("./test_network.txt", 'r'):
    follower = line.strip().split()[0]
    followee = line.strip().split()[1]

    G.add_edge(follower, followee)

    if j%1000000 == 0:
        print j*1.0/1000000, "million lines done", time.time() - ptime
        ptime = time.time()
    j += 1

DG = G.to_directed()
#       P = nx.path_graph(DG)
Nn_G = G.number_of_nodes()
N_CC = nx.number_connected_components(G)
LCC = nx.connected_component_subgraphs(G)[0]
n_LCC = LCC.nodes()
Nn_LCC = LCC.number_of_nodes()
inDegree = DG.in_degree()
outDegree = DG.out_degree()
Density = nx.density(G)
#       Diameter = nx.diameter(G)
#       Centrality = nx.betweenness_centrality(PDG, normalized=True, weighted_edges=False)
#       Clustering = nx.average_clustering(G)

print "number of nodes in G\t" + str(Nn_G) + '\n' + "number of CC in G\t" + str(N_CC) + '\n' + "number of nodes in LCC\t" + str(Nn_LCC) + '\n' + "Density of G\t" + str(Density) + '\n'
#       sys.exit()
#   j += 1

1000    1001
1000245    1020191
1000    10267352
1000653    10957902
1000    11039092
1000    1118691
10346    11882
1000    1228281
1000    1247041
1000    12965332
121340    13027572
1000    13075072
1000    13183162
1000    13250162
1214    13326292
1000    13452672
1000    13844892
1000    14061830
12340    1406481
1000    14134703
1000    14216951
1000    14254402
12134   14258044
1000    14270791
1000    14278978
12134    14313332
1000    14392970
1000    14441172
1000    14497568
1000    14502775
1000    14595635
1000    14620544
1000    14632615
10234    14680596
1000    14956164
10230    14998341
112000    15132211
1000    15145450
100    15285998
1000    15288974
1000    15300187
1000    1532061
1000    15326300

最后，有没有人有分析Twitter链接数据的经验？对于我来说，获取一个有向图并计算节点的平均/中值指数和出度是相当困难的。有什么帮助或想法吗？

以下是一些想法：

您可以使用整数而不是字符串来命名节点：与问题中的方法（使用字符串）相比，这将节省内存：
事实上，对于多个数字标识符，整数比字符串占用的内存少得多

让程序运行即使内存已满，因为您的操作系统将简单地开始交换：您有大约3亿个节点，因此我猜它们可能需要几GB的内存；即使您的计算机进行交换，它也可能能够处理这么多节点（特别是使用整数标记的节点节省内存）

首先，你应该考虑是否可以添加更多的RAM。对内存使用进行一些估计，或者根据您拥有的数据进行计算，或者通过读取不同大小的数据的子样本来测量事物的规模。几GB内存的适度成本可能会为您节省大量时间和麻烦

第二，考虑是否需要实际构建整个图表。例如，您可以通过迭代文件并计数来确定顶点的数量及其度数-一次只需在内存中保留一行，加上计数，这将比图形小得多。了解度后，在查找最大的连接组件时，可以从图中忽略度为1的任何顶点，然后更正忽略的节点。您正在进行数据分析，而不是实现一些通用算法：学习有关数据的简单内容，以实现更复杂的分析
就我所知，你试图用有向图做的事情毫无意义。调用
G.to_directed（）
不会为边提供任何有意义的方向，因此
DG.in_degree（）
和
DG.out_degree（）
都是从
G.degree（）
得到的。如果您关心indegrees和outdegrees之间的差异，则需要从一开始就将该图构建为有向图。感谢您的评论。我不知道那件事。我想查看网络的基本统计信息，如#CC（连接组件的#）、LCC（%）（最大连接组件中节点的分数）、in度平均值（Med）、Out度平均值（Med）、直径和聚类系数。这是我第一次使用networkx。networkx教程不错，但它既不是网络分析教程，也不是python教程；出于学习目的，从一个小型网络开始显然更容易。对于基本统计数据，可以更容易地尝试类似的方法。他们在处理内存问题上的建议应该特别恰当。+1：关于测量内存需求和购买更多内存的好建议；关于数据分析的有趣评论。