Python 从数据中提取特定信息_Python_Python 3.x_Nltk_Stanford Nlp_Information Retrieval

Python 从数据中提取特定信息

python python-3.x stanford-nlp

Python 从数据中提取特定信息,python,python-3.x,nltk,stanford-nlp,information-retrieval,Python,Python 3.x,Nltk,Stanford Nlp,Information Retrieval,如何转换数据格式，如： James Smith was born on November 17, 1948 变成 ("James Smith", DOB, "November 17, 1948") 不必依赖字符串的位置索引我试过以下方法 from nltk import word_tokenize, pos_tag new = "James Smith was born on November 17, 1948" sentences = word_tokenize(new) senten

如何转换数据格式，如：

James Smith was born on November 17, 1948

变成

("James Smith", DOB, "November 17, 1948")

不必依赖字符串的位置索引

我试过以下方法

from nltk import word_tokenize, pos_tag

new = "James Smith was born on November 17, 1948"
sentences = word_tokenize(new)
sentences = pos_tag(sentences)
grammar = "Chunk: {<NNP*><NNP*>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentences)
print(result)

从nltk导入单词标记，位置标记
new=“詹姆斯·史密斯出生于1948年11月17日”
句子=单词\标记化（新）
句子=位置标记（句子）
grammar=“Chunk:{}”
cp=nltk.RegexpParser（语法）
结果=cp.parse（句子）
打印（结果）

如何进一步获取所需fromat中的输出。

使用“was born on”拆分字符串，然后修剪空格并分配给name和dob

您可以始终使用正则表达式。出生于\b\S（\S+）\S（\S+）\S（\S+）\S（\S+）\S（\S+）上的正则表达式将匹配并返回上述字符串格式的数据

以下是实际行动：

python中的正则表达式：

import re

regex = r"(\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)"
test_str = "James Smith was born on November 17, 1948"

matches = re.search(regex, test_str)

# group 0 in a regex is the input string

print(matches.group(1)) # James
print(matches.group(2)) # Smith
print(matches.group(3)) # November
print(matches.group(4)) # 17
print(matches.group(5)) # 1948