Python 熊猫：什么'；这是找到空格的最快方法；列值中的异常字符？_Python_Pandas_Jupyter Notebook_Data Cleaning

Python 熊猫：什么'；这是找到空格的最快方法；列值中的异常字符？

python pandas jupyter-notebook

Python 熊猫：什么'；这是找到空格的最快方法；列值中的异常字符？,python,pandas,jupyter-notebook,data-cleaning,Python,Pandas,Jupyter Notebook,Data Cleaning,在Pandas中，我有一列colone，它最初在每个单元格中包含逗号分隔的值 ['a, b, e, g, o', 'a, b, d', 'a, b, c, f, g', 'a, b, c, f', 'a, c, e', 'a, b, c, o', 'b, c, h, n', 'a, b, c, g, o', 'a, b, c, f', 'a, b, c, g, h, o', 'b', 'a, b, f, m', 'a, b, c, g, h', 'a, b, d, f, g', 'a, c, n

在Pandas中，我有一列colone，它最初在每个单元格中包含逗号分隔的值

['a, b, e, g, o', 'a, b, d', 'a, b, c, f, g', 'a, b, c, f', 'a, c, e', 'a, b, c, o', 'b, c, h, n', 'a, b, c, g, o', 'a, b, c, f', 'a, b, c, g, h, o', 'b', 'a, b, f, m', 'a, b, c, g, h', 'a, b, d, f, g', 'a, c, n', 'j', 'b, c, f', 'a, b, g, l', 'b', 'b', 'a, b, d, e ', 'a, b, c', 'a, b, e, g', 'a, b, c, d, f, g', 'd, k, l', 'a, b, c, f, g ', 'a, b, c, f', 'a, b, c, d,  g', 'b, d, e', 'b, d', 'a', 'b, o', 'c, o', 'b, c, o', 'c', 'a, g, i', 'b, c, n', 'a, b', 'b, c, o, n', 'b, c, h', 'a, b, c, f, g, h', 'a, b, c, d', 'a, b, d', 'a, e, g', 'a, b, c, e, g, k, m', 'b, c, o', 'a, b, f, k', 'd, l', 'a, b, l', 'a, b, c', 'a', 'c, d, g, l', 'b, d, e, o', 'b, d', 'a, b, c, d, e, f, o', 'b', 'a, b, c, f', 'b, c, g', 'b, c, g, k', 'a', 'c', 'b, c, o', 'b, c, n, o']

我使用了

str.split（'，）.explode（）.value_counts（）.reset_index（）

来获得单个字母的计数。但在结果表中，一些字母出现两次，可能是因为字符串包含尾随空格不幸的是，这些内容在结果表的Jupyter笔记本显示中不可见，因为它们只是空白。

用这个

col_one_list = df["letter"].tolist()
print (col_one_list)

给了我一张所有计算值的列表。在这个列表中，我能够找到一个尾随空格（“g”）。但我怎么能做得更好呢

['b', 'a', 'c', 'g', 'd', 'f', 'o', 'e', 'n', 'h', 'l', 'k', 'm', 'j', 'g ', ' g', 'e ', 'i']

您可以将空格替换为

，然后继续

split-explode-value\u计数

，也可以使用

get\u假人

：

s.str.replace('\s+', '').str.get_dummies(',').sum()

输出：

a    36
b    49
c    35
d    15
e     9
f    13
g    18
h     5
i     1
j     1
k     4
l     5
m     2
n     5
o    13
dtype: int64

您可以将空格替换为

，然后继续

split-explode-value\u计数

，也可以使用

get\u假人

：

s.str.replace('\s+', '').str.get_dummies(',').sum()

输出：

a    36
b    49
c    35
d    15
e     9
f    13
g    18
h     5
i     1
j     1
k     4
l     5
m     2
n     5
o    13
dtype: int64

我会查看分解的序列，看看哪些值后面有空格：

letter_series = pd.Series(['b', 'a', 'c', 'g', 'd', 'f', 'o', 'e', 'n', 'h', 'l', 'k', 'm', 'j', 'g ', ' g', 'e ', 'i'])    
letter_series.str.endswith(' ')

或者查看哪些值长于一个字符

letter_series.str.len()

我会查看分解的序列，看看哪些值后面有空格：

letter_series = pd.Series(['b', 'a', 'c', 'g', 'd', 'f', 'o', 'e', 'n', 'h', 'l', 'k', 'm', 'j', 'g ', ' g', 'e ', 'i'])    
letter_series.str.endswith(' ')

或者查看哪些值长于一个字符

letter_series.str.len()

您可以使用

str.findall（“\w”）

而不是

str.split

。您可以使用

str.findall（“\w”）

而不是

str.split

。我喜欢第二种方法，因为它可以捕获任何其他不可见字符（如果存在）。我喜欢第二种方法，因为它可以捕获任何其他不可见字符（如果存在）