Python 在换行符之间创建空间\n
背景Python 在换行符之间创建空间\n,python,regex,pandas,text,replace,Python,Regex,Pandas,Text,Replace,背景 df['New_Text'] = df['Text'].replace(r'\n', ' \n ', regex=True) 我有以下df,其中包含一个Text列,该列已使用nltkSpaceTokenizer标记化,以保留\n import pandas as pd text =[list(['\n[PROV', 'REPORT]\nPerson', 'Name:', '\n','John', 'Dear\nProgram', 'Date:', '1/11/2000', '10:4
df['New_Text'] = df['Text'].replace(r'\n', ' \n ', regex=True)
我有以下
df
,其中包含一个Text
列,该列已使用nltk
SpaceTokenizer
标记化,以保留\n
import pandas as pd
text =[list(['\n[PROV', 'REPORT]\nPerson', 'Name:', '\n','John', 'Dear\nProgram', 'Date:', '1/11/2000', '10:42', 'AM\nMR']),
list(['\nToday', 'Name:', '\n','James', 'Jay\nProgram', 'Date:', '3/11/2000', '1:45', 'PM\nmissing']),
list(['\n[NEWS', 'REPORT]\nPerson', 'Name:', '\n','Jane', 'Doe\nProgram', 'Date:', '3/11/2000', '1:45', 'PM\nMR']),
list(['\n[PROV', 'REPORT]\nPerson', 'Name:', '\n','Amy', 'Army\nProgram', 'Date:', '10/1/2000', '11:45', 'AM\nMR'])]
df = pd.DataFrame({'Text' : text,
'ID': [1,2,3,4],
'P_ID': ['A','B','C','D'],
})
df
ID P_ID Text
0 1 A [\n[PROV, REPORT]\nPerson, Name:, \n, John, Dear\nProgram, Date:, 1/11/2000, 10:42, AM\nMR]
1 2 B [\nToday, Name:, \n, James, Jay\nProgram, Date:, 3/11/2000, 1:45, PM\nmissing]
2 3 C [\n[NEWS, REPORT]\nPerson, Name:, \n, Jane, Doe\nProgram, Date:, 3/11/2000, 1:45, PM\nMR]
3 4 D [\n[PROV, REPORT]\nPerson, Name:, \n, Amy, Army\nProgram, Date:, 10/1/2000, 11:45, AM\nMR]
使用以下代码
df['Text'].values
df['New_Text'].values
输出
给出以下输出
array([ list(['\n[PROV', 'REPORT]\nPerson', 'Name:', '\n', 'John', 'Dear\nProgram', 'Date:', '1/11/2000', '10:42', 'AM\nMR']),
list(['\nToday', 'Name:', '\n', 'James', 'Jay\nProgram', 'Date:', '3/11/2000', '1:45', 'PM\nmissing']),
list(['\n[NEWS', 'REPORT]\nPerson', 'Name:', '\n', 'Jane', 'Doe\nProgram', 'Date:', '3/11/2000', '1:45', 'PM\nMR']),
list(['\n[PROV', 'REPORT]\nPerson', 'Name:', '\n', 'Amy', 'Army\nProgram', 'Date:', '10/1/2000', '11:45', 'AM\nMR'])], dtype=object)
array([list(['\n', '[PROV', 'REPORT]', '\n' ,'Person', 'Name:', '\n', 'John', 'Dear', '\n', 'Program', 'Date:', '1/11/2000', '10:42', 'AM', '\n', 'MR']),
list(['\n', 'Today', 'Name:', '\n', 'James', 'Jay', '\n', 'Program', 'Date:', '3/11/2000', '1:45', 'PM','\n', 'missing']),
list(['\n', '[NEWS', 'REPORT]','\n', 'Person', 'Name:', '\n', 'Jane', 'Doe', '\n', 'Program', 'Date:', '3/11/2000', '1:45', 'PM', '\n', 'MR']),
list(['\n', '[PROV', 'REPORT]', '\n', 'Person', 'Name:', '\n', 'Amy', 'Army', '\n', 'Program', 'Date:', '10/1/2000', '11:45', 'AM', '\n', 'MR'])], dtype=object)
目标
1) 分开\n
(这样\n[PROV
变成\n
[PROV
和报告]\n人员\n
人员和Doe\n程序变成Doe
\n
程序
等)
2) 创建新列
尝试过
df['New_Text'] = df['Text'].replace(r'\n', ' \n ', regex=True)
所需输出
使用以下代码
df['Text'].values
df['New_Text'].values
我想要以下输出
array([ list(['\n[PROV', 'REPORT]\nPerson', 'Name:', '\n', 'John', 'Dear\nProgram', 'Date:', '1/11/2000', '10:42', 'AM\nMR']),
list(['\nToday', 'Name:', '\n', 'James', 'Jay\nProgram', 'Date:', '3/11/2000', '1:45', 'PM\nmissing']),
list(['\n[NEWS', 'REPORT]\nPerson', 'Name:', '\n', 'Jane', 'Doe\nProgram', 'Date:', '3/11/2000', '1:45', 'PM\nMR']),
list(['\n[PROV', 'REPORT]\nPerson', 'Name:', '\n', 'Amy', 'Army\nProgram', 'Date:', '10/1/2000', '11:45', 'AM\nMR'])], dtype=object)
array([list(['\n', '[PROV', 'REPORT]', '\n' ,'Person', 'Name:', '\n', 'John', 'Dear', '\n', 'Program', 'Date:', '1/11/2000', '10:42', 'AM', '\n', 'MR']),
list(['\n', 'Today', 'Name:', '\n', 'James', 'Jay', '\n', 'Program', 'Date:', '3/11/2000', '1:45', 'PM','\n', 'missing']),
list(['\n', '[NEWS', 'REPORT]','\n', 'Person', 'Name:', '\n', 'Jane', 'Doe', '\n', 'Program', 'Date:', '3/11/2000', '1:45', 'PM', '\n', 'MR']),
list(['\n', '[PROV', 'REPORT]', '\n', 'Person', 'Name:', '\n', 'Amy', 'Army', '\n', 'Program', 'Date:', '10/1/2000', '11:45', 'AM', '\n', 'MR'])], dtype=object)
问题
我如何实现我想要的输出?奇怪的结构,但可以实现一些,和
你能找到一些更好的方法来加载/解析你的文本吗
,它现在看起来很奇怪,可能会节省很多工作line@SamMason文本已标记化。我想这就是为什么它的格式是这样的。当你说它看起来“非常奇怪”时,这就是你所指的吗?这是什么样的标记化?如果你正在做任何类型的NLP,我不希望在“单词”<代码> StuteOnEngisher 的中间有断线来保存<代码> \n>代码>