Python 在非ASCII字符上拆分数据帧列
这是包含数据和非ascii字符的列Python 在非ASCII字符上拆分数据帧列,python,pandas,non-ascii-characters,Python,Pandas,Non Ascii Characters,这是包含数据和非ascii字符的列 Summary 1 United Kingdom - �â��Global Consumer Technology - �â��American Express United Kingdom - �â��VP Technology - Founder - �â��Hogarth Worldwide Aberdeen - �â��SeniorCore Analysis Specialist - �â��COREX Group London, - �â��ED
Summary 1
United Kingdom - ��Global Consumer Technology - ��American Express
United Kingdom - ��VP Technology - Founder - ��Hogarth Worldwide
Aberdeen - ��SeniorCore Analysis Specialist - ��COREX Group
London, - ��ED, Equit Technology, London - ��Morgan Stanley
United Kingdom - ��Chief Officer, Group Technology - ��BP
如何将它们拆分并保存在不同的列中
我使用的代码是:
import io
import pandas as pd
df = pd.read_csv("/home/vipul/Desktop/dataminer.csv", sep='\s*\+.*?-\s*')
df = df.reset_index()
df.columns = ["First Name", "Last Name", "Email", "Profile URL", "Summary 1", "Summary 2"]
df.to_csv("/home/vipul/Desktop/new.csv")
比如说,您在一系列中有一列,如下所示:
s
0 United Kingdom - ��Global Consumer Technolog...
1 United Kingdom - ��VP Technology - Founder -...
2 Aberdeen - ��SeniorCore Analysis Specialist ...
3 London, - ��ED, Equit Technology, London - �...
4 United Kingdom - ��Chief Officer, Group Tech...
Name: Summary 1, dtype: object
选项1展开后,可以使用
str.split
对非ascii字符进行拆分:
s.str.split(r'-\s*[^\x00-\x7f]+', expand=True)
0 1 2
0 United Kingdom Global Consumer Technology American Express
1 United Kingdom VP Technology - Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group
3 London, ED, Equit Technology, London Morgan Stanley
4 United Kingdom Chief Officer, Group Technology BP
选项2
str.extractall
+unstack
:
s.str.extractall('([\x00-\x7f]+)')[0].str.rstrip(r'- ').unstack()
match 0 1 2
0 United Kingdom Global Consumer Technology American Express
1 United Kingdom VP Technology - Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group
3 London, ED, Equit Technology, London Morgan Stanley
4 United Kingdom Chief Officer, Group Technology BP
比如说,您在一系列中有一列,如下所示:
s
0 United Kingdom - ��Global Consumer Technolog...
1 United Kingdom - ��VP Technology - Founder -...
2 Aberdeen - ��SeniorCore Analysis Specialist ...
3 London, - ��ED, Equit Technology, London - �...
4 United Kingdom - ��Chief Officer, Group Tech...
Name: Summary 1, dtype: object
选项1展开后,可以使用
str.split
对非ascii字符进行拆分:
s.str.split(r'-\s*[^\x00-\x7f]+', expand=True)
0 1 2
0 United Kingdom Global Consumer Technology American Express
1 United Kingdom VP Technology - Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group
3 London, ED, Equit Technology, London Morgan Stanley
4 United Kingdom Chief Officer, Group Technology BP
选项2
str.extractall
+unstack
:
s.str.extractall('([\x00-\x7f]+)')[0].str.rstrip(r'- ').unstack()
match 0 1 2
0 United Kingdom Global Consumer Technology American Express
1 United Kingdom VP Technology - Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group
3 London, ED, Equit Technology, London Morgan Stanley
4 United Kingdom Chief Officer, Group Technology BP
另一种方法:
a
0 United Kingdom - ��Global Consumer Technolog...
1 United Kingdom - ��VP Technology - Founder -...
2 Aberdeen - ��SeniorCore Analysis Specialist ...
3 London, - ��ED, Equit Technology, London - �...
4 United Kingdom - ��Chief Officer, Group Tech...
使用此函数可以使用内置函数提取assci字符(其中Unicode代码点优于128)
def extract_ascii(x):
string_list = filter(lambda y : ord(y) < 128, x)
return ''.join(string_list)
结果如下:
0 1 2 3
0 United Kingdom Global Consumer Technology American Express None
1 United Kingdom VP Technology Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group None
3 London, ED, Equit Technology, London Morgan Stanley None
4 United Kingdom Chief Officer, Group Technology BP None
另一种方法:
a
0 United Kingdom - ��Global Consumer Technolog...
1 United Kingdom - ��VP Technology - Founder -...
2 Aberdeen - ��SeniorCore Analysis Specialist ...
3 London, - ��ED, Equit Technology, London - �...
4 United Kingdom - ��Chief Officer, Group Tech...
使用此函数可以使用内置函数提取assci字符(其中Unicode代码点优于128)
def extract_ascii(x):
string_list = filter(lambda y : ord(y) < 128, x)
return ''.join(string_list)
结果如下:
0 1 2 3
0 United Kingdom Global Consumer Technology American Express None
1 United Kingdom VP Technology Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group None
3 London, ED, Equit Technology, London Morgan Stanley None
4 United Kingdom Chief Officer, Group Technology BP None
你的CSV和加载它的代码没有太多的共同点…CSV中的巨大数据我只给出了一列!你的CSV和加载它的代码没有太多的共同点…CSV中的巨大数据我只给出了一列@COLDSPEED split在我的系统上不起作用。还有其他方法吗?@VipulRao你能看到我的编辑吗?为什么我的答案不适用于你的机器/@coldspeed是的,我看到了编辑,但它在我的电脑上不起作用熊猫是最新的,而且还安装了python3。有人能帮我吗。@VipulRao我该怎么帮你?我能把sudo放进你的机器里帮你写代码吗?Cmon,请试着至少弄清楚为什么它不起作用。你对你的最后一个问题做了同样的事情。@coldspeed是的,我会解决它,非常感谢你。@coldspeed split在我的系统上不起作用。还有其他方法吗。@VipulRao你能看到我的编辑吗?为什么我的答案不适用于你的机器/@coldspeed是的,我看到了编辑,但它在我的电脑上不起作用熊猫是最新的,而且还安装了python3。有人能帮我吗。@VipulRao我该怎么帮你?我能把sudo放进你的机器里帮你写代码吗?Cmon,请试着至少弄清楚为什么它不起作用。你对你的最后一个问题也做了同样的事情。@coldspeed是的,我会想出来的,非常感谢你。