Python 单独的字典和文本_Python_Dictionary_Nlp

Python 单独的字典和文本

python dictionary nlp

Python 单独的字典和文本,python,dictionary,nlp,Python,Dictionary,Nlp,我有很多这样的句子（这是一句话，不是很多句话）：我想将文本和词典分开，如： 1. Hello , I , am, fine. 2. {'type': 'bold', 'text': 'Multi class f1 score'} 3. {'type': 'mention', 'text': '@Abhishek'} 4. Singh you can continue with the deep learning specialization from Andrew Ng. It is very

我有很多这样的句子（这是一句话，不是很多句话）：

我想将文本和词典分开，如：

1. Hello , I , am, fine.
2. {'type': 'bold', 'text': 'Multi class f1 score'}
3. {'type': 'mention', 'text': '@Abhishek'}
4. Singh you can continue with the deep learning specialization from Andrew Ng. It is very much informative and lots to learn and its very smplified and for the certificate you can apply for financial aid option..the courses will be available in 15 days

按

“，”

拆分将不会有帮助，因为这将导致两个问题：

字典键和值对不会被分开，看起来像

{'type'：'提及'text'：'@Abhishek'}

我将从第1部分中删除所有的

，

请注意，文本可能也包含utf-8编码形式的表情符号

如何做到这一点？

尝试使用正则表达式。首先从字符串中提取字典部分，然后提取单引号中的部分。

您可以使用正则表达式按需要的方式拆分内容：

import re

string = "'Hello , I , am, fine.' ,{'type': 'bold', 'text': 'Multi class f1 score'}, {'type': 'mention', 'text': '@Abhishek'}, ' Singh you can continue with the deep learning specialization from Andrew Ng. It is very much informative and lots to learn and its very smplified and for the certificate you can apply for financial aid option..the courses will be available in 15 days'"
results = re.findall(r"'(.*?)'|({.*?})", string)
results = [item for elem in results for item in elem if len(item)] # Clean empty records
for e in results:
    print(e)

这将返回：

Hello , I , am, fine.
{'type': 'bold', 'text': 'Multi class f1 score'}
{'type': 'mention', 'text': '@Abhishek'}
 Singh you can continue with the deep learning specialization from Andrew Ng. It is very much informative and lots to learn and its very smplified and for the certificate you can apply for financial aid option..the courses will be available in 15 days

你试过或尝试过什么？我试着对以json格式存储数据的电报聊天进行分析。比如-{“id”：9860，“类型”：“消息”，“日期”：“2020-06-11T01:01:25”，“发件人”：“A.”，“发件人”：1072244642，“文本”：[“mohak pl check out”，{“类型”：“链接”，“文本”：“https:\/\/www.kaggle.com\/abilashivs\/kernel3e217ae073”}。我有一个列名text，其中包含符合此词典的值，现在我想从中提取文本。但是当我尝试时，我没有得到想要的结果。。请尝试使用text=“'Hey'，{'type'：'提及您的姓名'，'text'：'Krish'，'user\u id'：935251183}“，”\n我有疑问。\n如果我们有一个不平衡的数据集。目标变量有是/否值。\nYes-9500\nNo-500\n我们有10个功能。\n现在有4个功能有大约3000个空值。\n\n既然我们最终必须平衡数据集，我们是否应该删除3000条记录？\n我们也不应该删除列，然后平衡数据集。“”通过删除转义序列得到了解决方案

Hello , I , am, fine.
{'type': 'bold', 'text': 'Multi class f1 score'}
{'type': 'mention', 'text': '@Abhishek'}
 Singh you can continue with the deep learning specialization from Andrew Ng. It is very much informative and lots to learn and its very smplified and for the certificate you can apply for financial aid option..the courses will be available in 15 days