Python 从数据帧中按部分字符串选择_Python_String_Pandas_Dataframe

Python 从数据帧中按部分字符串选择

python string pandas dataframe

Python 从数据帧中按部分字符串选择,python,string,pandas,dataframe,Python,String,Pandas,Dataframe,我有一个DataFrame，有4列，其中2列包含字符串值。我想知道是否有一种方法可以根据与特定列的部分字符串匹配来选择行换句话说，一个函数或lambda函数 re.search(pattern, cell_in_question) 返回布尔值。我熟悉df[df['A']==“hello world”]的语法，但似乎找不到一种方法来实现部分字符串匹配，比如说“hello” 有人能给我指出正确的方向吗？下面是我最后为部分字符串匹配所做的。如果有人有更有效的方法，请让我知道 def string

我有一个

DataFrame

，有4列，其中2列包含字符串值。我想知道是否有一种方法可以根据与特定列的部分字符串匹配来选择行

换句话说，一个函数或lambda函数

re.search(pattern, cell_in_question)

返回布尔值。我熟悉

df[df['A']==“hello world”]

的语法，但似乎找不到一种方法来实现部分字符串匹配，比如说

“hello”

有人能给我指出正确的方向吗？

下面是我最后为部分字符串匹配所做的。如果有人有更有效的方法，请让我知道

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

基于github的问题，看起来您很快就能做到以下几点：

df[df['A'].str.contains("hello")]

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

更新：在pandas 0.8.1及更高版本中提供

快速提示：如果要根据索引中包含的部分字符串进行选择，请尝试以下操作：

df[df['A'].str.contains("hello")]

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

我尝试了上述建议的解决方案：

df[df[“A”].str.contains（“你好，英国”）]

并得到一个错误：

ValueError:无法使用包含NA/NaN值的数组进行掩码

您可以将NA值转换为

False

，如下所示：

df[df[“A”].str.contains（“你好|英国”，na=False）]

假设您有以下

数据帧

：

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

始终可以在lambda表达式中使用

in

运算符来创建过滤器

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

这里的诀窍是使用

apply

中的

axis=1

选项将元素逐行传递给lambda函数，而不是逐列传递。

如果有人想知道如何执行相关问题：“按部分字符串选择列”

使用：

要通过部分字符串匹配选择行，请传递

axis=0

进行筛选：

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)

如何通过数据帧中的部分字符串进行选择？这篇文章是写给想读的读者的

在字符串列中搜索子字符串（最简单的情况）
搜索多个子字符串（类似于）
匹配文本中的一个完整单词（例如，“blue”应匹配“天空是蓝色的”，但不匹配“bluejay”）
匹配多个整词
了解“ValueError:无法使用包含NA/NaN值的向量进行索引”背后的原因

…并希望了解更多关于哪些方法应优于其他方法的信息

（旁白：我已经看到很多关于类似主题的问题，我想把这个留在这里会很好。）

友好免责声明，这篇文章很长

基本子串搜索可用于执行子字符串搜索或基于正则表达式的搜索。除非显式禁用，否则搜索默认为基于正则表达式

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

下面是一个基于正则表达式的搜索示例

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

有时不需要正则表达式搜索，因此指定

regex=False

将其禁用

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

就性能而言，正则表达式搜索比子字符串搜索慢：

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

如果不需要，请避免使用基于正则表达式的搜索

寻址
ValueError
s
有时，对结果执行子字符串搜索和筛选将导致

这通常是因为对象列中存在混合数据或NAN

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

任何不是字符串的东西都不能应用字符串方法，因此结果是NaN（自然）。在这种情况下，指定

na=False

忽略非字符串数据

s.str.contains('foo|bar', na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

如何一次将其应用于多个列？
答案在问题中。使用：

下面的所有解决方案都可以使用列方式

apply

方法“应用”到多个列（在我的书中这是可以的，只要你没有太多的列）

如果您有一个包含混合列的数据框，并且只想选择对象/字符串列，请查看

多个子串搜索 这最容易通过使用正则表达式或管道的正则表达式搜索来实现

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

您还可以创建术语列表，然后将其合并：

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

有时，明智的做法是避开术语，以防它们具有可以解释为的字符。如果您的术语包含以下任何字符

. ^ $ * + ? { } [ ] \ | ( )

然后，您需要使用以下命令来逃避它们：

re.escape

具有转义特殊字符的效果，因此可以按字面意思处理它们

re.escape(r'.foo^')
# '\\.foo\\^'

匹配整个单词 默认情况下，子字符串搜索将搜索指定的子字符串/模式，而不管它是否为完整单词。为了只匹配完整的单词，我们需要在这里使用正则表达式，特别是，我们的模式需要指定单词边界（

\b

）

比如说,

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window

现在考虑，

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

v/s

多个整词搜索 与上面类似，只是我们在连接的模式中添加了一个单词边界（

\b

）

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

其中

如下所示

p
# '\\b(?:foo|baz)\\b'

一个很好的选择：使用！因为你可以！它们通常比字符串方法快一点，因为字符串方法很难矢量化，并且通常有循环实现

而不是

df1[df1['col'].str.contains('foo', regex=False)]

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

在列表组件中使用

中的运算符
df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

而不是
df1[df1['col'].str.contains('foo', regex=False)]

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

在列表comp中使用（缓存正则表达式）+
df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

如果“col”有nan，那么
df1[df1['col'].str.contains(regex_pattern, na=False)]

使用

部分字符串匹配的更多选项：。
除了str.contains
和列表理解之外，您还可以使用以下选项
np.char.find


仅支持子字符串搜索（读取：无正则表达式）
np.矢量化
这是一个围绕循环的包装器，但与大多数pandasstr
方法相比，开销更小
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

可能的正则表达式解决方案：
regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

DataFrame.query


通过python engi支持字符串方法
regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

df.filter(regex=".*STRING_YOU_LOOK_FOR.*")

df[df['A'].str.find("hello") != -1]

df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]

df[df['A'].str.contains("hello", case=False)]

df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

searchfor = '.*cat.*hat.*|.*the.*dog.*'

df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)

    col1             col2           TrueFalse
0   cat andhat       1000.0         True
1   hat              2000000.0      False
2   the small dog    1000.0         True
3   fog              330000.0       False
4   pet 3            30000.0        False

mask = df['ENTITY'].str.contains('DM')

df = df.loc[~(mask)].copy(deep=True)

df[df['A'].astype(str).str.contains("Hello|Britain")]