Python 在numpy中连接数据阵列内的所有阵列
我用3个数组的所有可能组合生成了一个ndarray,如下所示:Python 在numpy中连接数据阵列内的所有阵列,python,numpy,Python,Numpy,我用3个数组的所有可能组合生成了一个ndarray,如下所示: countries = ["AF"... "Zw"] names = ["name1",... "nameN"] var_type = ['var1', 'var2', 'var3'] combinations = np.array(np.meshgrid(names, var_type,countries)).T.reshape(-1, 3) arr
countries = ["AF"... "Zw"]
names = ["name1",... "nameN"]
var_type = ['var1', 'var2', 'var3']
combinations = np.array(np.meshgrid(names, var_type,countries)).T.reshape(-1, 3)
array([
"name1-var1-AF",
"name1-var2-AF",
"name1-var3-AF",
...,
"nameN-var1-ZW",
"nameN-var2-ZW",
"nameN-var3-ZW"
])
columns = []
for column in combinations:
columns.append(str('-'.join(column)))
它给出了一个具有以下结果的数据集:
array([
['name1', 'var1', 'AF'],
['name1', 'var2', 'AF'],
['name1', 'var3', 'AF'],
...,
['nameN', 'var1', 'ZW'],
['nameN', 'var2', 'ZW'],
['nameN', 'var3', 'ZW']
])
我想加入每个单独的子数组,得到一个新的数组,合并后的值如下:
countries = ["AF"... "Zw"]
names = ["name1",... "nameN"]
var_type = ['var1', 'var2', 'var3']
combinations = np.array(np.meshgrid(names, var_type,countries)).T.reshape(-1, 3)
array([
"name1-var1-AF",
"name1-var2-AF",
"name1-var3-AF",
...,
"nameN-var1-ZW",
"nameN-var2-ZW",
"nameN-var3-ZW"
])
columns = []
for column in combinations:
columns.append(str('-'.join(column)))
但到目前为止,我在谷歌唯一喜欢的方式是这样的for循环:
countries = ["AF"... "Zw"]
names = ["name1",... "nameN"]
var_type = ['var1', 'var2', 'var3']
combinations = np.array(np.meshgrid(names, var_type,countries)).T.reshape(-1, 3)
array([
"name1-var1-AF",
"name1-var2-AF",
"name1-var3-AF",
...,
"nameN-var1-ZW",
"nameN-var2-ZW",
"nameN-var3-ZW"
])
columns = []
for column in combinations:
columns.append(str('-'.join(column)))
有没有一种更矢量化的方法来实现这一点???
numpy
不快速编译用于处理字符串的代码-除了适用于任何dtype
的基本数组操作之外。甚至np.char
函数也使用基本的python字符串方法
In [12]: countries = ["AF","Zw"]
...: names = ["name1","name2", "nameN"]
...: var_type = ['var1', 'var2', 'var3']
...: combinations = np.array(np.meshgrid(names, var_type,countries)).T.reshape(-1, 3)
In [14]: ['-'.join(row) for row in _]
Out[14]:
['name1-var1-AF',
'name1-var2-AF',
'name1-var3-AF',
'name2-var1-AF',
...
'nameN-var3-Zw']
这基本上是一个列表操作。在列表上迭代更快
In [18]: timeit ['-'.join(row) for row in combinations]
62.3 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [19]: timeit ['-'.join(row) for row in combinations.tolist()]
6.55 µs ± 31.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [20]: %%timeit alist = combinations.tolist()
...: ['-'.join(row) for row in alist]
...:
...:
2.88 µs ± 3.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
如果我包括创建组合所花费的时间:
In [29]: %%timeit
...: combinations = np.array(np.meshgrid(names, var_type,countries)).T.reshape(-1, 3)
...: ['-'.join(row) for row in combinations]
...:
...:
164 µs ± 925 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
另一方面,使用itertools.product
:
In [30]: timeit ['-'.join(tup) for tup in product(names, var_type, countries)]
4.17 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
这种情况下,
numpy
没有帮助。确切的数据类型是什么?你能确切地说明原始数组是如何构造的吗?例如,MCVE?编辑了关于如何构造数组的问题,组合中的列是否始终具有一致的字符数?还是武断?在后一种情况下,使用循环。它可能会更改,因为名称具有不同的长度,因此您无法执行太多操作。您可以通过这种方式使用线性索引和映射位置,但这比只运行python循环要慢得多。Numpy最适合于大小一致的元素,而这些元素不是。