Python 在非ASCII字符上拆分数据帧列_Python_Pandas_Non Ascii Characters

Python 在非ASCII字符上拆分数据帧列

python pandas

Python 在非ASCII字符上拆分数据帧列,python,pandas,non-ascii-characters,Python,Pandas,Non Ascii Characters,这是包含数据和非ascii字符的列 Summary 1 United Kingdom - �â��Global Consumer Technology - �â��American Express United Kingdom - �â��VP Technology - Founder - �â��Hogarth Worldwide Aberdeen - �â��SeniorCore Analysis Specialist - �â��COREX Group London, - �â��ED

这是包含数据和非ascii字符的列

Summary 1

United Kingdom - �â��Global Consumer Technology - �â��American Express 
United Kingdom - �â��VP Technology - Founder - �â��Hogarth Worldwide
Aberdeen - �â��SeniorCore Analysis Specialist - �â��COREX Group
London, - �â��ED, Equit Technology, London - �â��Morgan Stanley
United Kingdom - �â��Chief Officer, Group Technology - �â��BP

如何将它们拆分并保存在不同的列中

我使用的代码是：

import io
import pandas as pd

df = pd.read_csv("/home/vipul/Desktop/dataminer.csv", sep='\s*\+.*?-\s*')
df = df.reset_index()
df.columns = ["First Name", "Last Name", "Email", "Profile URL", "Summary 1", "Summary 2"]

df.to_csv("/home/vipul/Desktop/new.csv")

比如说，您在一系列中有一列，如下所示：

s

0    United Kingdom - �â��Global Consumer Technolog...
1    United Kingdom - �â��VP Technology - Founder -...
2    Aberdeen - �â��SeniorCore Analysis Specialist ...
3    London, - �â��ED, Equit Technology, London - �...
4    United Kingdom - �â��Chief Officer, Group Tech...
Name: Summary 1, dtype: object

选项1
展开后，可以使用

str.split

对非ascii字符进行拆分：

s.str.split(r'-\s*[^\x00-\x7f]+', expand=True)

                 0                                 1                  2
0  United Kingdom        Global Consumer Technology    American Express
1  United Kingdom           VP Technology - Founder   Hogarth Worldwide
2        Aberdeen    SeniorCore Analysis Specialist         COREX Group
3         London,      ED, Equit Technology, London      Morgan Stanley
4  United Kingdom   Chief Officer, Group Technology                  BP

选项2

str.extractall

unstack

：

s.str.extractall('([\x00-\x7f]+)')[0].str.rstrip(r'- ').unstack()

match               0                                1                  2
0      United Kingdom       Global Consumer Technology   American Express
1      United Kingdom          VP Technology - Founder  Hogarth Worldwide
2            Aberdeen   SeniorCore Analysis Specialist        COREX Group
3             London,     ED, Equit Technology, London     Morgan Stanley
4      United Kingdom  Chief Officer, Group Technology                 BP

比如说，您在一系列中有一列，如下所示：

s

0    United Kingdom - �â��Global Consumer Technolog...
1    United Kingdom - �â��VP Technology - Founder -...
2    Aberdeen - �â��SeniorCore Analysis Specialist ...
3    London, - �â��ED, Equit Technology, London - �...
4    United Kingdom - �â��Chief Officer, Group Tech...
Name: Summary 1, dtype: object

选项1
展开后，可以使用

str.split

对非ascii字符进行拆分：

s.str.split(r'-\s*[^\x00-\x7f]+', expand=True)

                 0                                 1                  2
0  United Kingdom        Global Consumer Technology    American Express
1  United Kingdom           VP Technology - Founder   Hogarth Worldwide
2        Aberdeen    SeniorCore Analysis Specialist         COREX Group
3         London,      ED, Equit Technology, London      Morgan Stanley
4  United Kingdom   Chief Officer, Group Technology                  BP

选项2

str.extractall

unstack

：

s.str.extractall('([\x00-\x7f]+)')[0].str.rstrip(r'- ').unstack()

match               0                                1                  2
0      United Kingdom       Global Consumer Technology   American Express
1      United Kingdom          VP Technology - Founder  Hogarth Worldwide
2            Aberdeen   SeniorCore Analysis Specialist        COREX Group
3             London,     ED, Equit Technology, London     Morgan Stanley
4      United Kingdom  Chief Officer, Group Technology                 BP

另一种方法：

a
0   United Kingdom - �â��Global Consumer Technolog...
1   United Kingdom - �â��VP Technology - Founder -...
2   Aberdeen - �â��SeniorCore Analysis Specialist ...
3   London, - �â��ED, Equit Technology, London - �...
4   United Kingdom - �â��Chief Officer, Group Tech...

使用此函数可以使用内置函数提取assci字符（其中Unicode代码点优于128）

def extract_ascii(x):
    string_list = filter(lambda y : ord(y) < 128, x)
    return ''.join(string_list)

结果如下：

             0          1                              2           3
0   United Kingdom  Global Consumer Technology  American Express    None
1   United Kingdom  VP Technology   Founder Hogarth Worldwide
2   Aberdeen    SeniorCore Analysis Specialist  COREX Group None
3   London, ED, Equit Technology, London    Morgan Stanley  None
4   United Kingdom  Chief Officer, Group Technology BP  None

另一种方法：

a
0   United Kingdom - �â��Global Consumer Technolog...
1   United Kingdom - �â��VP Technology - Founder -...
2   Aberdeen - �â��SeniorCore Analysis Specialist ...
3   London, - �â��ED, Equit Technology, London - �...
4   United Kingdom - �â��Chief Officer, Group Tech...

使用此函数可以使用内置函数提取assci字符（其中Unicode代码点优于128）

def extract_ascii(x):
    string_list = filter(lambda y : ord(y) < 128, x)
    return ''.join(string_list)

结果如下：

             0          1                              2           3
0   United Kingdom  Global Consumer Technology  American Express    None
1   United Kingdom  VP Technology   Founder Hogarth Worldwide
2   Aberdeen    SeniorCore Analysis Specialist  COREX Group None
3   London, ED, Equit Technology, London    Morgan Stanley  None
4   United Kingdom  Chief Officer, Group Technology BP  None

你的CSV和加载它的代码没有太多的共同点…CSV中的巨大数据我只给出了一列！你的CSV和加载它的代码没有太多的共同点…CSV中的巨大数据我只给出了一列@COLDSPEED split在我的系统上不起作用。还有其他方法吗？@VipulRao你能看到我的编辑吗？为什么我的答案不适用于你的机器/@coldspeed是的，我看到了编辑，但它在我的电脑上不起作用熊猫是最新的，而且还安装了python3。有人能帮我吗。@VipulRao我该怎么帮你？我能把sudo放进你的机器里帮你写代码吗？Cmon，请试着至少弄清楚为什么它不起作用。你对你的最后一个问题做了同样的事情。@coldspeed是的，我会解决它，非常感谢你。@coldspeed split在我的系统上不起作用。还有其他方法吗。@VipulRao你能看到我的编辑吗？为什么我的答案不适用于你的机器/@coldspeed是的，我看到了编辑，但它在我的电脑上不起作用熊猫是最新的，而且还安装了python3。有人能帮我吗。@VipulRao我该怎么帮你？我能把sudo放进你的机器里帮你写代码吗？Cmon，请试着至少弄清楚为什么它不起作用。你对你的最后一个问题也做了同样的事情。@coldspeed是的，我会想出来的，非常感谢你。