在Python中读取带有panda的文件时忽略空数据帧
我有这样一个txt文件:在Python中读取带有panda的文件时忽略空数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,我有这样一个txt文件: `Empty DataFrame Columns: [0, 1, 2, 3, 4] Index: [] Empty DataFrame Columns: [0, 1, 2, 3, 4] Index: [] 0 1 2 \ 46 RNA/4v6p.csv,46AA/U/551 RNA/4v6p.csv,4
`Empty DataFrame
Columns: [0, 1, 2, 3, 4]
Index: []
Empty DataFrame
Columns: [0, 1, 2, 3, 4]
Index: []
0 1 2 \
46 RNA/4v6p.csv,46AA/U/551 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
47 RNA/4v6p.csv,46AA/G/550 RNA/4v6p.csv,46AA/C/34 RNA/4v6p.csv,46WW_cis
48 RNA/4v6p.csv,46AA/A/553 RNA/4v6p.csv,46AA/U/30 RNA/4v6p.csv,46WW_cis
49 RNA/4v6p.csv,46AA/U/552 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
50 RNA/4v6p.csv,46AA/U/1199 RNA/4v6p.csv,46AA/G/1058 RNA/4v6p.csv,46WW_cis
3 4
46 NaN NaN
47 NaN NaN
48 NaN NaN
49 NaN NaN
50 NaN NaN`
我想把它读入一个有3列的数组。目前,我尝试使用pd.read\u csv(self.filename,delim\u whitespace=True)
,但在尝试读取空数据帧时,这会给我带来很多错误。如何使程序忽略此部分
编辑:
若我的文件中并没有空的数据帧,那个么最佳的解决方案就是。该文件是在许多文件中搜索的结果,其中一些文件是空的。我认为我已经通过提供一个异常过滤了空文件,这样在空文件中搜索的效果就不会存储在结果中。我想我做得不对。有人能纠正我吗
from numpy import numpy.mean as nm
def find_same_direction_chain(self, results):
separation= lambda x: pd.Series([i for i in x.split('/')])
left_chain=self.data[0].apply(separation)
right_chain=self.data[1].apply(separation)
i=1
try:
while i<len(self.data[:])-5:
if nm(left_chain[2][i:i+3])>=nm(left_chain[2][i+2:i+5]) and nm(right_chain[2][i:i+3])>=nm(right_chain[2][i+2:i+5]) and len(self.data[:])>0:
if nm(left_chain[2][i+2:i+5])>=nm(left_chain[2][i+4:i+7]) and nm(right_chain[2][i+2:i+5])>=nm(right_chain[2][i+4:i+7]):
results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))
else: pass
i+=1
except ValueError:
results.bin.append(self.filename)
except TypeError:
results.data_structure_error.append(self.filename)
从numpy导入numpy.mean作为nm
def查找相同方向链(自身、结果):
分离=λx:pd.系列([i代表x.split('/')中的i)
左链=self.data[0]。应用(分隔)
右链=self.data[1]。应用(分隔)
i=1
尝试:
而i=nm(左链[2][i+2:i+5])和nm(右链[2][i:i+3])>=nm(右链[2][i+2:i+5])和len(自身数据[:])>0:
如果nm(左链[2][i+2:i+5])>=nm(左链[2][i+4:i+7])和nm(右链[2][i+2:i+5])>=nm(右链[2][i+4:i+7]):
results.chains.append(str(self.filename+“,“+str(i)+self.data[0:3][i:i+5]))
其他:通过
i+=1
除值错误外:
results.bin.append(self.filename)
除类型错误外:
results.data\u structure\u error.append(self.filename)
您可以使用:
import pandas as pd
import io
temp=u"""Empty DataFrame
Columns: [0, 1, 2, 3, 4]
Index: []
Empty DataFrame
Columns: [0, 1, 2, 3, 4]
Index: []
0 1 2 \
46 RNA/4v6p.csv,46AA/U/551 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
47 RNA/4v6p.csv,46AA/G/550 RNA/4v6p.csv,46AA/C/34 RNA/4v6p.csv,46WW_cis
48 RNA/4v6p.csv,46AA/A/553 RNA/4v6p.csv,46AA/U/30 RNA/4v6p.csv,46WW_cis
49 RNA/4v6p.csv,46AA/U/552 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
50 RNA/4v6p.csv,46AA/U/1199 RNA/4v6p.csv,46AA/G/1058 RNA/4v6p.csv,46WW_cis
3 4
46 NaN NaN
47 NaN NaN
48 NaN NaN
49 NaN NaN
50 NaN NaN"""
或在以下位置使用skiprows
的解决方案:
编辑:
您可以尝试更改(我没有样本数据,因此未经测试):
致:
我想我不能使用skiprows,因为在我的文件中,空的数据帧部分被不规则地放置。好的,尝试第一个不使用skiprows
的解决方案。但最好是在写入文件之前过滤空的DataFrames
,例如打印[df for df in dfs if len(df)>0]
(dfs是数据帧的列表
)这可能是我所需要的,尽管当数据帧中的某些元素的条件得到满足时,我会将它们保存到一个列表中,比如:results.chains.append(str(self.filename+”,“+str(I)+self.data[0:3][I:I+5])
,然后我会将这个列表保存到一个文件中,其中的打开(“chains.txt”,“a+”,作为f:
f.write(“\n.join”)(自我结果链)
所以我想知道,为什么我的文件中有空的数据帧?它们是如何到达的?我想这真的很难帮助你,因为这是未完成的代码nm
,测试数据丢失。但是如果结果是数据帧的列表,请尝试用附加检查代码,以及df
的空位置>如果len(self.data[0:3][i:i+5])>0:results.chains.append(str(self.filename+”,“+str(i)+self.data[0:3][i:i+5]),则添加,
但中没有数据,则无法测试。
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), delim_whitespace=True, names=range(7))
#remove rows with NaN in columns 0 - 3
df = df.dropna(subset=[0,1,2,3])
#remove rows where first column contains text 'Columns'
df = df[~df.iloc[:,0].str.contains('Columns')]
#shift first row
df.iloc[0,:] = df.iloc[0,:].shift(-3)
#set first column to index
df = df.set_index(df.iloc[:,0])
#remove unnecessary columns
df = df.drop([0,4,5,6], axis=1)
print df
1 2 3
0
46 RNA/4v6p.csv,46AA/U/551 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
47 RNA/4v6p.csv,46AA/G/550 RNA/4v6p.csv,46AA/C/34 RNA/4v6p.csv,46WW_cis
48 RNA/4v6p.csv,46AA/A/553 RNA/4v6p.csv,46AA/U/30 RNA/4v6p.csv,46WW_cis
49 RNA/4v6p.csv,46AA/U/552 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
50 RNA/4v6p.csv,46AA/U/1199 RNA/4v6p.csv,46AA/G/1058 RNA/4v6p.csv,46WW_cis
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), delim_whitespace=True, names=range(7), skiprows=6)
#remove rows with NaN
df = df.dropna(subset=[0,1,2,3])
#shift first row
df.iloc[0,:] = df.iloc[0,:].shift(-3)
#set first column to index
df = df.set_index(df.iloc[:,0])
#remove unnecessary columns
df = df.drop([0,4,5,6], axis=1)
print df
1 2 3
0
46 RNA/4v6p.csv,46AA/U/551 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
47 RNA/4v6p.csv,46AA/G/550 RNA/4v6p.csv,46AA/C/34 RNA/4v6p.csv,46WW_cis
48 RNA/4v6p.csv,46AA/A/553 RNA/4v6p.csv,46AA/U/30 RNA/4v6p.csv,46WW_cis
49 RNA/4v6p.csv,46AA/U/552 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
50 RNA/4v6p.csv,46AA/U/1199 RNA/4v6p.csv,46AA/G/1058 RNA/4v6p.csv,46WW_cis
results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))
if len(self.data[0:3][i:i+5]) > 0:
results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))