Python 基于链式拆分的熊猫过滤数据帧_Python_Pandas

Python 基于链式拆分的熊猫过滤数据帧

python pandas

Python 基于链式拆分的熊猫过滤数据帧,python,pandas,Python,Pandas,我有一个pandas数据框，其中包含一列（列名文件名）和文件名。文件名类似于： long_file1_name_0.jpg long_file2_name_1.jpg long_file3_name_0.jpg ... 要进行筛选，我需要执行以下操作（比如'select_string=“0”）：但我被扔了这个： Traceback (most recent call last): File "/file/location/dir/lib/python3.7/site-packa

我有一个pandas数据框，其中包含一列（列名

文件名

）和文件名。文件名类似于：

long_file1_name_0.jpg
long_file2_name_1.jpg
long_file3_name_0.jpg
...

要进行筛选，我需要执行以下操作（比如'select_string=“0”）：

但我被扔了这个：

Traceback (most recent call last):
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "python_file.py", line 118, in <module>
    main()
  File "inference.py", line 57, in main
    _=some_function(config_dict=config_dict, logger=logger, select_string=config_dict['global']['select_string'])
  File "/file/location/dir/etc/fprint/dataloaders.py", line 31, in some_function2
    logger=logger, select_string=select_string)
  File "/file/location/dir/etc/fprint/preprocess.py", line 25, in df_preprocess
    df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
    return self._get_value(key)
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 991, in _get_value
    loc = self.index.get_loc(label)
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    raise KeyError(key) from err
KeyError: 0

回溯（最近一次呼叫最后一次）：
文件“/File/location/dir/lib/python3.7/site packages/pandas/core/index/base.py”，第2889行，在get_loc中
返回自我。引擎。获取定位（铸造钥匙）
文件“pandas/_libs/index.pyx”，第70行，在pandas._libs.index.IndexEngine.get_loc中
文件“pandas/_libs/index.pyx”，第97行，在pandas._libs.index.IndexEngine.get_loc中
文件“pandas/_libs/hashtable_class_helper.pxi”，第1032行，在pandas._libs.hashtable.Int64HashTable.get_项中
文件“pandas/_libs/hashtable_class_helper.pxi”，第1039行，在pandas._libs.hashtable.Int64HashTable.get_项中
关键错误：0
上述异常是以下异常的直接原因：
回溯（最近一次呼叫最后一次）：
文件“python_File.py”，第118行，在
main（）
文件“inference.py”，第57行，在main中
_=some_函数（config_dict=config_dict，logger=logger，select_string=config_dict['global']['select_string']））
文件“/File/location/dir/etc/fprint/dataloaders.py”，第31行，在一些函数2中
记录器=记录器，选择\u字符串=选择\u字符串）
文件“/File/location/dir/etc/fprint/preprocess.py”，df_preprocess中的第25行
df_fp=df_fp[~df_fp[“文件名”]。str.split（“.jpg”）[0]。split（“”）[-1]==选择_字符串]
文件“/File/location/dir/lib/python3.7/site packages/pandas/core/series.py”，第882行，在__
返回自我。获取值（键）
文件“/File/location/dir/lib/python3.7/site-packages/pandas/core/series.py”，第991行，输入值
loc=自索引获取位置（标签）
文件“/File/location/dir/lib/python3.7/site packages/pandas/core/index/base.py”，第2891行，在get_loc中
从err中升起钥匙错误（钥匙）
关键错误：0

我想它不喜欢我用链子拴住劈叉，但我隐约记得我在某个时候做过这件事，它确实起了作用。。所以，我很困惑为什么它会抛出这个错误

PS：我确实知道如何使用

.contains

求解，但我想使用这种比较字符串的方法

任何指针都会很棒

假设所有行都包含

.jpg

，如果不包含，请将其改为仅

select_string=str(0) #select string should be of type str

本部分：

df_fp["filenames"].str.split(".jpg")[0]

返回数据帧的第一行，而不是列表的第一个元素

您要查找的是

expand

（它将在

split

参数之后为列表中的每个元素创建一个新列）：

df[df['filenames'].str.split('.jpg', expand=True)[0].str.split('_', expand=True)[1] == '0']

或者，您可以通过应用：

df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']

但是

包含在这里肯定更合适。
这里有另一种方法，使用.str.extract（）
：
现在，创建一个布尔掩码。squence（）
方法确保我们有一个系列，因此掩码将工作：
mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
          .astype(int)
          .eq(0)
          .squeeze())

print(df.loc[mask])

                filename
0  long_file1_name_0.jpg
2  long_file3_name_0.jpg

是否所有行都包含.jpg
，请验证？df_fp=df_fp[~df_fp.filenames.str.rstrip（“.jpg”）.str.split（“.jpg”）.str.get（-1）=select_string]
或更直接的df_fp=df_fp[~df_fp.filenames.str.contains（select_string+.jpg'，regex=False）]
@RichieV谢谢您的帮助。我可以问一下regex=False
在这里做了什么吗？pandas默认在.str
访问器的所有方法上都使用regex，从您的数据看，如果您默认让它使用regex似乎不会有什么影响，但如果它在列中，它也会匹配一个包含“0[任何其他字符2 34&*@#]的字符串。jpg”
（传递给正则表达式的选择字符串中的。充当通配符）
df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']

import pandas as pd

df = pd.DataFrame({'filename': ['long_file1_name_0.jpg',
                                'long_file2_name_1.jpg',
                                'long_file3_name_0.jpg',
                                'long_file3_name_33.jpg',]
                  })

mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
          .astype(int)
          .eq(0)
          .squeeze())

print(df.loc[mask])

                filename
0  long_file1_name_0.jpg
2  long_file3_name_0.jpg