Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/281.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何按行从数据帧中删除空单元格_Python_Pandas_Dataframe - Fatal编程技术网

Python 如何按行从数据帧中删除空单元格

Python 如何按行从数据帧中删除空单元格,python,pandas,dataframe,Python,Pandas,Dataframe,我有以下格式的csv数据 ab aback abandon abate Class ab NaN abandon NaN A NaN aback NaN NaN A NaN aback abandon NaN B ab NaN NaN abate C NaN NaN abandon abate C 我想删除NaN单元格并按如下方式重新排列数据 ab abandon A aback A ab

我有以下格式的csv数据

ab   aback  abandon  abate  Class
ab   NaN    abandon  NaN    A
NaN  aback  NaN      NaN    A
NaN  aback  abandon  NaN    B
ab   NaN    NaN      abate  C
NaN  NaN    abandon  abate  C
我想删除NaN单元格并按如下方式重新排列数据

ab  abandon A
aback   A   
aback   abandon B
ab  abate   C
abandon abate   C
处理后的表单中不需要标题。我尝试了许多线程,例如,等等,但它们都提供了列式解决方案

。 它有空单元格,当我使用dataframe显示它时,所有空单元格都显示为NaN 这是密码

import pandas as pd

df = pd.read_csv('C:/Users/ABRAR/Google Drive/Tourism Project/Small_sample.csv', low_memory=False)
print(df) 
输出:

         ab   aback
14    access        
18    accept        
23    access        
24      able  accept
47  accepted        

也许我误解了您的目标,但是使用一些python代码很容易做到这一点

#!/usr/bin/env python

new_lines = []
with open('data.csv', 'r') as csv:
    # skip the first line
    csv.readline()
    for line in csv.readlines():
        words = line.strip().split()
        new_words = [w for w in words if w != 'NaN']
        new_lines.append(' '.join(new_words))

for l in new_lines:
     print(l)

熊猫

df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
pd.DataFrame(*list(map(
            list,
            zip(*[(v[m], i) for v, m, i in
                  zip(df.values, df.notnull().values, df.index)
                  if m.any()])
        ))).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
%timeit df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')
100 loops, best of 3: 7.21 ms per loop

%%timeit
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')
1000 loops, best of 3: 1.29 ms per loop

%%timeit
pd.DataFrame(*list(map(
            list,
            zip(*[(v[m], i) for v, m, i in
                  zip(df.values, df.notnull().values, df.index)
                  if m.any()])
        ))).fillna('')
1000 loops, best of 3: 1.44 ms per loop

%%timeit
d1 = df.apply(lambda x: sorted(x.values.astype(str)), axis=1).replace('nan','')
d1 = d1.drop(d1.index[d1.eq('').all(axis=1)])
d1.drop(d1.columns[d1.eq('').all()],axis=1)
10 loops, best of 3: 20.1 ms per loop

numpy

df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
pd.DataFrame(*list(map(
            list,
            zip(*[(v[m], i) for v, m, i in
                  zip(df.values, df.notnull().values, df.index)
                  if m.any()])
        ))).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
%timeit df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')
100 loops, best of 3: 7.21 ms per loop

%%timeit
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')
1000 loops, best of 3: 1.29 ms per loop

%%timeit
pd.DataFrame(*list(map(
            list,
            zip(*[(v[m], i) for v, m, i in
                  zip(df.values, df.notnull().values, df.index)
                  if m.any()])
        ))).fillna('')
1000 loops, best of 3: 1.44 ms per loop

%%timeit
d1 = df.apply(lambda x: sorted(x.values.astype(str)), axis=1).replace('nan','')
d1 = d1.drop(d1.index[d1.eq('').all(axis=1)])
d1.drop(d1.columns[d1.eq('').all()],axis=1)
10 loops, best of 3: 20.1 ms per loop

我无法理解的头晕目眩的理解

df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
pd.DataFrame(*list(map(
            list,
            zip(*[(v[m], i) for v, m, i in
                  zip(df.values, df.notnull().values, df.index)
                  if m.any()])
        ))).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
%timeit df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')
100 loops, best of 3: 7.21 ms per loop

%%timeit
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')
1000 loops, best of 3: 1.29 ms per loop

%%timeit
pd.DataFrame(*list(map(
            list,
            zip(*[(v[m], i) for v, m, i in
                  zip(df.values, df.notnull().values, df.index)
                  if m.any()])
        ))).fillna('')
1000 loops, best of 3: 1.44 ms per loop

%%timeit
d1 = df.apply(lambda x: sorted(x.values.astype(str)), axis=1).replace('nan','')
d1 = d1.drop(d1.index[d1.eq('').all(axis=1)])
d1.drop(d1.columns[d1.eq('').all()],axis=1)
10 loops, best of 3: 20.1 ms per loop
定时

df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
pd.DataFrame(*list(map(
            list,
            zip(*[(v[m], i) for v, m, i in
                  zip(df.values, df.notnull().values, df.index)
                  if m.any()])
        ))).fillna('')

            0           1
14     access            
18     accept            
23     access            
24       able      accept
47   accepted            
58       able  acceptable
60     access            
69  abundance            
78    academy            
87     access            
93     accept            
%timeit df.dropna(how='all').apply(lambda x: pd.Series(x.dropna().values), 1).fillna('')
100 loops, best of 3: 7.21 ms per loop

%%timeit
v = df.values
i, j = np.where(df.notnull().values)
split_idx = np.where(np.append(False, i[1:] != i[:-1]))[0]
pd.DataFrame(np.split(v[i, j], split_idx), pd.unique(i)).fillna('')
1000 loops, best of 3: 1.29 ms per loop

%%timeit
pd.DataFrame(*list(map(
            list,
            zip(*[(v[m], i) for v, m, i in
                  zip(df.values, df.notnull().values, df.index)
                  if m.any()])
        ))).fillna('')
1000 loops, best of 3: 1.44 ms per loop

%%timeit
d1 = df.apply(lambda x: sorted(x.values.astype(str)), axis=1).replace('nan','')
d1 = d1.drop(d1.index[d1.eq('').all(axis=1)])
d1.drop(d1.columns[d1.eq('').all()],axis=1)
10 loops, best of 3: 20.1 ms per loop

感谢@Perional提出上述建议。最后,我做了如下的事情

new_lines = []
with open('data.csv', 'r') as csv:
    # skip the first line
    csv.readline()
    for line in csv.readlines():
        words = line.strip().split(',')
        new_words = [w for w in words if w and w.strip()]
        #skip the empty lines
        if len(new_words) != 0:
            new_lines.append(','.join(new_words))
df = pd.DataFrame(new_lines)
df.to_csv('results.csv', sep=',')
@Scott的解决方案很优雅,但我不知道,它总是抛出memoryError异常。

还有一件事,我不想要结果文件中的行号。如果有人帮我的话。尽管如此,我还是使用Excel删除了该列:)

如果该行包含某个值(在本例中为“Amine”),则以下代码会删除该行:

具体来说:这将创建一个名为“df”的新数据框,其中包括“Name”列中单元格值不等于“Amine”的所有行

要删除某些列中包含“Nan”的行,此代码将非常有用:

df[pd.notnull(df.Name)]

您只是想要一个没有
NaN
s的列表,还是想要一个数据帧?如果您想要一个数据帧,那么第二行是一个问题,因为它只包含两个元素。所有其他的都有三个元素。上面的输出是必需的,而不是数据帧。每个结构都适合我。不要考虑这个样本中元素的数量。实际文件包含大量元素。可以删除Nan值,但重构工作如何?当我将其应用于文件时,我收到如下输出,,,,ab,,,AI将其用于您在问题中显示的模式。您使用的是实际的逗号分隔csv吗?在这种情况下,请尝试使用split(',')。但是对于真正的csv,你最好使用csv模块。你能将它应用到我在原始帖子中附加的文件中吗?当我运行它时,我得到了一个错误message@Abrar请参阅编辑。。。分解它是因为我认为您正在尝试按行和列压缩数据集。@Scott我有一个问题,当我使用原始文件时,我得到了内存错误HRM。。。可能需要分两块或两批完成。我将查看是否对此进行了优化。让我们在读取\u csv后除去所有空行。。。。df=df[~df.isnull().all(轴=1)]