Python 3.x 获取数据帧中不带标头的列的长度_Python 3.x_Pandas

Python 3.x 获取数据帧中不带标头的列的长度

python-3.x pandas

Python 3.x 获取数据帧中不带标头的列的长度,python-3.x,pandas,Python 3.x,Pandas,我有一个新手问题我有一个熊猫数据框，源是一个逗号分隔的csv文件。该文件没有标题我需要知道每一行的列的len是多少，然后我需要删除len高于某个值的行，例如5 我所拥有的： 1,2,3,4,5,6 1,2,3 9,6,8 1,2,3,5,6 期望输出： 1,2,3 9,6,8 我搜索了一些问题和答案，如：但据我所知，它总是使用一些列名来进行筛选，因为在文件中没有标题，而且每行的列数也不同，所以我不知道怎么做你能帮忙吗提前谢谢如果将参数header=None传递到pandas.r

我有一个新手问题

我有一个熊猫数据框，源是一个逗号分隔的csv文件。该文件没有标题

我需要知道每一行的列的len是多少，然后我需要删除len高于某个值的行，例如5

我所拥有的：

1,2,3,4,5,6

1,2,3

9,6,8

1,2,3,5,6

期望输出：

1,2,3

9,6,8

我搜索了一些问题和答案，如：

但据我所知，它总是使用一些列名来进行筛选，因为在文件中没有标题，而且每行的列数也不同，所以我不知道怎么做

你能帮忙吗

提前谢谢

如果将参数

header=None

传递到

pandas.read\u csv（）

，则列名是从0索引的整数。因此，如果您有以下“file.csv”：

您可以使用以下代码将其读入数据帧：

import pandas as pd

df = pd.read_csv("file.csv", header=None, dtype="Int64")

如果执行

print（df）

，您的结果将是：

   0  1  2    3    4    5
0  1  2  3    4    5    6
1  1  2  3  NaN  NaN  NaN
2  9  6  8  NaN  NaN  NaN
3  1  2  3    5    6  NaN

   0  1  2
1  1  2  3
2  9  6  8

现在，如果要删除所有具有大于或等于五个非NaN值的行，则应使用以下代码：

for index, row in df.iterrows():
    if sum(row.notnull()) >= 5:
        df.drop(index, inplace=True)

df.dropna(axis=1, how="all", inplace=True)

如果要执行

print（df）

，您的新结果将是：

   0  1  2    3    4    5
0  1  2  3    4    5    6
1  1  2  3  NaN  NaN  NaN
2  9  6  8  NaN  NaN  NaN
3  1  2  3    5    6  NaN

   0  1  2
1  1  2  3
2  9  6  8

现在，如果您想在删除较长行的情况下覆盖file.csv，只需执行以下操作：

df.to_csv("file.csv", header=False, index=False)

我认为有三种可能做到这一点

读取文件两次（第一次计算字段，第二次应用skiprows方法将其读入）

将其读入内存，过滤掉无效行，然后使用

StringIO

将其读入所有列（或num desired columns+1），然后只允许多余列包含

NaN

以下示例使用变量

len_threshold

，该变量应设置为一行所允许的列数，以及

您的_文件名

，该变量应包含csv文本文件的名称

方法1：读取文件两次为了方便，你可以用熊猫来做。像这样：

# read the rows into one text column
df= pd.read_csv(your_file_name, names=['text'], sep='\n')
# count the separators
counts= df['text'].str.count(',')
# now all rows which have more or less than two separators are skipped
rows_to_skip= counts[counts > len_threshold].index.get_level_values(0).to_list()
pd.read_csv(your_file_name, names=list(range(len_threshold)), index_col=False, skiprows=rows_to_skip)

请注意，要应用此方法，您应该确保字段不包含分隔符，因为它不会检查逗号是否位于带引号的文本中

方法2：重新输入内存/变量：逐行读取数据方法3：读取数据帧，然后过滤掉不正确的行这是最简单的方法

# read the whole dataframe with all columns
df= pd.read_csv(your_file_name, header=None, index_col=False)
# define an indexer that considers all rows to be good which
# have nothing else in the access rows as `NaN`
if len(df.columns) > len_threshold:
    good_rows= df.iloc[:, len_threshold:].isna().all(axis='columns')
    df.drop(df[~good_rows].index.get_level_values(0), inplace=True)
    df.drop(df.columns[3:], axis='columns', inplace=True)

因此，只要字段为空，此方法可能还允许行具有多余的字段分隔符。在上面的版本中，它还允许行的列数少于3列。例如，如果第三列始终包含有效行中的内容，则很容易排除太短的行。您只需将“good_rows”行更改为：

谢谢你的多个答案！我使用方法1：读取文件两次，效果很好！

# read the whole dataframe with all columns
df= pd.read_csv(your_file_name, header=None, index_col=False)
# define an indexer that considers all rows to be good which
# have nothing else in the access rows as `NaN`
if len(df.columns) > len_threshold:
    good_rows= df.iloc[:, len_threshold:].isna().all(axis='columns')
    df.drop(df[~good_rows].index.get_level_values(0), inplace=True)
    df.drop(df.columns[3:], axis='columns', inplace=True)

    good_rows= df.iloc[:, len_threshold:].isna().all(axis='columns') & ~df.iloc[:, 2].isna()