Python 在NaN中扩展_corr函数
数据文件为 我只是想计算两个数据帧的列之间的成对相关性:Python 在NaN中扩展_corr函数,python,pandas,Python,Pandas,数据文件为 我只是想计算两个数据帧的列之间的成对相关性: In [7]: import os In [8]: import pandas as pd In [9]: import numpy as np In [10]: from pandas import Series, DataFrame In [12]: blog_dat = pd.read_table("blogdata.txt", index_col="Blog") In [13]: blog_dat = blog_dat.
In [7]: import os
In [8]: import pandas as pd
In [9]: import numpy as np
In [10]: from pandas import Series, DataFrame
In [12]: blog_dat = pd.read_table("blogdata.txt", index_col="Blog")
In [13]: blog_dat = blog_dat.astype(float)
In [14]: all(blog_dat.notnull())
Out[14]: True
In [15]: x = DataFrame(np.random.randn(99*4).reshape((99, 4)))
In [16]: pd.expanding_corr(blog_dat.iloc[:, :4], blog_dat.iloc[:, :4], pairwise=True)[-1, :, :]
Out[16]:
china kids music yahoo
china 1.000000 0.053069 0.026599 0.246957
kids 0.053069 1.000000 0.409978 0.094636
music 0.026599 0.409978 1.000000 0.055923
yahoo 0.246957 0.094636 0.055923 1.000000
In [17]: pd.expanding_corr(blog_dat.iloc[:, :4], x, pairwise=True)[-1, :, :]
/usr/local/lib/python3.4/site-packages/pandas/core/index.py:1240: RuntimeWarning: unorderable types: str() < int(), sort order is undefined for incomparable objects
"incomparable objects" % e, RuntimeWarning)
/usr/local/lib/python3.4/site-packages/pandas/core/index.py:1240: RuntimeWarning: unorderable types: int() < str(), sort order is undefined for incomparable objects
"incomparable objects" % e, RuntimeWarning)
/usr/local/lib/python3.4/site-packages/pandas/core/index.py:1254: RuntimeWarning: unorderable types: str() > int(), sort order is undefined for incomparable objects
"incomparable objects" % e, RuntimeWarning)
/usr/local/lib/python3.4/site-packages/pandas/core/index.py:1254: RuntimeWarning: unorderable types: int() > str(), sort order is undefined for incomparable objects
"incomparable objects" % e, RuntimeWarning)
Out[17]:
0 1 2 3
china NaN NaN NaN NaN
kids NaN NaN NaN NaN
music NaN NaN NaN NaN
yahoo NaN NaN NaN NaN
[7]中的:导入操作系统
在[8]中:导入熊猫作为pd
在[9]中:将numpy作为np导入
[10]中:来自熊猫导入系列,数据帧
在[12]中:blog\u dat=pd.read\u表(“blogdata.txt”,index\u col=“blog”)
在[13]中:blog_dat=blog_dat.astype(float)
在[14]中:all(blog_dat.notnull())
Out[14]:对
在[15]中:x=DataFrame(np.random.randn(99*4).重塑((99,4)))
在[16]中:pd.expansing_corr(blog_dat.iloc[:,:4],blog_dat.iloc[:,:4],pairwise=True)[-1,:,:]
出[16]:
中国儿童音乐雅虎
中国1.0000000.053069 0.026599 0.246957
儿童0.053069 1.0000000.409978 0.094636
音乐0.026599 0.409978 1.0000000.055923
雅虎0.246957 0.094636 0.055923 1.000000
在[17]:pd.expansing_corr(blog_dat.iloc[:,:4],x,pairwise=True)[-1,:,:]
/usr/local/lib/python3.4/site packages/pandas/core/index.py:1240:RuntimeWarning:unorderable类型:str()int(),不可比较对象的排序顺序未定义
不可比较对象“%e,RuntimeWarning)
/usr/local/lib/python3.4/site packages/pandas/core/index.py:1254:RuntimeWarning:unorderable类型:int()>str(),不可比较对象的排序顺序未定义
不可比较对象“%e,RuntimeWarning)
出[17]:
0 1 2 3
中国南部
孩子们楠楠楠楠楠楠
音乐楠楠
雅虎南
即使我给
x
指定索引和列名,南也不会消失。让x
和blog\u dat
拥有相同的索引
:
import pandas as pd
import numpy as np
np.random.seed(1)
blog_dat = pd.read_table("data", sep='\s+')
x = pd.DataFrame(np.random.randn(4*4).reshape((4, 4)),
index=blog_dat.index)
pd.expanding_corr(blog_dat.iloc[:, :4], x, pairwise=True)[-1, :, :]
屈服
0 1 2 3
china 0.684896 0.260795 -0.990586 0.281298
kids 0.077209 -0.871448 0.702822 0.241313
music -0.203808 0.071436 0.581267 -0.783753
yahoo -0.630744 0.373339 -0.060623 0.258728
仅仅给出
x
任何索引名是不够的;它们必须与blog数据的索引相匹配,实际上只有索引需要与blog数据同步。但我无法理解为什么有必要这样做。熊猫的许多操作都是基于索引的。两个序列中数据点的相关性与整数索引位置不匹配(NumPy会这样做)。相反,数据点按索引对齐。如果索引不匹配,则数据点彼此完全丢失,相关性未知,因此返回NaN。