Python 转换我的数据帧,其中每一行包含每个句子的元组列表
我想阅读python中的Python 转换我的数据帧,其中每一行包含每个句子的元组列表,python,regex,dataframe,Python,Regex,Dataframe,我想阅读python中的.dat文件,我尝试了不同的阅读方法,最后得出了以下代码: datContent = open("..\\data\\train.dat.abs", 'r') MyList=[] for line in datContent: print(line) 将打开此表单中的内容: 1 Should O 2 students O 3 be O 4 taught O 5 to O 6 compete O 7 o
.dat
文件,我尝试了不同的阅读方法,最后得出了以下代码:
datContent = open("..\\data\\train.dat.abs", 'r')
MyList=[]
for line in datContent:
print(line)
将打开此表单中的内容:
1 Should O
2 students O
3 be O
4 taught O
5 to O
6 compete O
7 or O
8 to O
9 cooperate O
10 ? O
------------------> THIS SHOWS, STARTING OF THE NEXT SENTENCES
1 It O
2 is O
3 always O
4 said O
5 that O
6 competition O
7 can O
8 effectively O
9 promote O
10 the O
11 development O
12 of O
13 economy O
14 . O
但是我想提取第一列和第二列作为元组列表:
[(Should, O), (students,O), (be,O), (taught O), (to,O), (compete,O), (or,O), (to,O), (cooperate,O), (? O)]
每个句子(句子以原始格式用空格签名)是数据帧的一行。我试过分开。
我已通过以下方式完成此项工作:
datContent = open("..\\data\\train.dat.abs", 'r', encoding='utf-8' )
MyList=[]
for line in datContent:
a=line.split()
print(a)
结果是:
['1', 'Should', 'O']
['2', 'students', 'O']
['3', 'be', 'O']
['4', 'taught', 'O']
['5', 'to', 'O']
['6', 'compete', 'O']
['7', 'or', 'O']
['8', 'to', 'O']
['9', 'cooperate', 'O']
['10', '?', 'O']
[]
['1', 'It', 'O']
['2', 'is', 'O']
['3', 'always', 'O']
['4', 'said', 'O']
['5', 'that', 'O']
['6', 'competition', 'O']
['7', 'can', 'O']
['8', 'effectively', 'O']
['9', 'promote', 'O']
['10', 'the', 'O']
['11', 'development', 'O']
['12', 'of', 'O']
['13', 'economy', 'O']
['14', '.', 'O']
正如我告诉你的,我想保存:
[(Should, O), (students,O), (be,O), (taught O), (to,O), (compete,O), (or,O), (to,O), (cooperate,O), (? O)]
作为一行数据框(基本上是上面每个列表的第2、3项),如您所见,[]
分隔发送的
df
等等。试试这个:
有关更多信息,请参阅
#form: abc['row1'], abc['row2'] ...
def getRowContainer(data):
rowContainer={}
rowData=[]
rowCount=1
dataSet=re.findall(r'(?:^\d{1,14}\s+([a-zA-Z0-9?!.,]{1,20})\s+([^\s]+))|^-{1,20}>',data,flags=re.MULTILINE)
for item in (dataSet):
if item[0]=='':
rowCount+=1
rowData=[]
continue
rowData.append(item)
rowContainer[f'row{rowCount}']=rowData
return rowContainer
rows=getRowContainer(data)
for x in range(1,len(rows)+1):
print (f'row {x}')
print (rows[f'row{x}'])
我对您的输入数据截图如下:
data='''
1 Should O
2 students O
3 be O
4 taught O
5 to O
6 compete O
7 or O
8 to O
9 cooperate O
10 ? O
------------------> THIS SHOWS, STARTING OF THE NEXT SENTENCES
1 It O
2 is O
3 always O
4 said O
5 that O
6 competition O
7 can O
8 effectively O
9 promote O
10 the O
11 development O
12 of O
13 economy O
14 . O'''
我得到的输出:
row 1
[('Should', 'O'), ('students', 'O'), ('be', 'O'), ('taught', 'O'), ('to', 'O'), ('compete', 'O'), ('or', 'O'), ('to', 'O'), ('cooperate', 'O'), ('?', 'O')]
row 2
[('It', 'O'), ('is', 'O'), ('always', 'O'), ('said', 'O'), ('that', 'O'), ('competition', 'O'), ('can', 'O'), ('effectively', 'O'), ('promote', 'O'), ('the', 'O'), ('development', 'O'), ('of', 'O'), ('economy', 'O'), ('.', 'O')]
简单地说,解决方案是用所需的数据列表分隔临时列表中的每一行,然后将每个临时列表追加到MyList中,最后形成数据框,如下所示:
import pandas as pd
datContent = open("..\\data\\train.dat.abs", 'r', encoding='utf-8' )
MyList = []
tmp_list = []
for line in datContent:
a = line.split()
if len(a) == 0: # space between sentences
MyList.append(tmp_list)
tmp_list = []
continue
tmp_list.append((a[1], a[2]))
if len(tmp_list) > 0: # to append the last sentence if not space.
MyList.append(tmp_list)
df = pd.DataFrame({'sentence': MyList})
print(df)
import pandas as pd
datContent = open("..\\data\\train.dat.abs", 'r', encoding='utf-8' )
MyList = []
tmp_list = []
for line in datContent:
a = line.split()
if len(a) == 0: # space between sentences
MyList.append(tmp_list)
tmp_list = []
continue
tmp_list.append((a[1], a[2]))
if len(tmp_list) > 0: # to append the last sentence if not space.
MyList.append(tmp_list)
df = pd.DataFrame({'sentence': MyList})
print(df)