Python 自然语言处理-数据提取_Python_Nltk_Data Extraction_Information Extraction

Python 自然语言处理-数据提取

python

Python 自然语言处理-数据提取,python,nltk,data-extraction,information-extraction,Python,Nltk,Data Extraction,Information Extraction,我需要帮助处理日内交易/摇摆交易/投资建议的非结构化数据。我以CSV的形式保存了非结构化数据以下是需要从中提取数据的3个示例段落： Anand Rathi的Chandan Taparia对印度煤炭有限公司（Coal India Ltd.）进行了一次买入看涨，该公司与当日目标价格为338卢比。当前市场印度煤炭有限公司的价格为325.15。Chandan Taparia建议将止损保持在318卢比 Kotak Securities Limited向Engineers India Ltd.发出了

我需要帮助处理日内交易/摇摆交易/投资建议的非结构化数据。我以

CSV

的形式保存了非结构化数据

以下是需要从中提取数据的3个示例段落：

Anand Rathi的Chandan Taparia对印度煤炭有限公司（Coal India Ltd.）进行了一次买入看涨，该公司与当日目标价格为338卢比。当前市场印度煤炭有限公司的价格为325.15。Chandan Taparia建议将止损保持在318卢比

Kotak Securities Limited向Engineers India Ltd.发出了一个买入通知，其中 目标价格为335卢比。工程师印度有限公司的当前市场价格为266.05卢比分析师给工程师的一年印度有限公司的价格达到规定的目标。工程师印度享有碳氢化合物咨询部门健康的市场份额。它喜欢与少数主要石油和天然气公司建立了丰富的合作关系，如 HPCL、BPCL、印度石油天然气公司和国际奥委会。该公司已做好充分准备，将从一项新计划中获益碳氢化合物行业基础设施支出的恢复

独立分析师库纳尔·博特拉（Kunal Bothra）对Ceat有限公司进行了一次看涨期权，并以 目标价格为1150卢比。Ceat有限公司的当前市场价格为1199.6卢比分析师给出的时间段为1-3天 当Ceat有限公司的价格能够达到规定的目标时。库纳尔·博特拉保持在1240卢比的止损
从段落中提取4个信息是一个挑战：每项建议都有不同的框架，但本质上是不同的

目标价格

止损价

现行价格
持续时间
而且，并非所有的建议都会提供所有的信息-每个建议至少都有目标价格
我试图使用正则表达式，但不是很成功，有人能告诉我如何使用
nltk
提取这些信息吗
到目前为止，我在清理数据时使用的代码：

import pandas as pd import re #etanalysis_final.csv has 4 columns with #0th Column having data time #1st Column having a simple hint like 'Sell Ceat Ltd. target Rs 1150 : Kunal Bothra,Sell Ceat Ltd. at a price target of Rs 1150 and a stoploss at Rs 1240 from entry point', not all the hints are same, I can rely on it for recommender, Buy or Sell, which stock. #4th column has the detailed recommendation given. df = pd.read_csv('etanalysis_final.csv',encoding='ISO-8859-1') df.DATE = pd.to_datetime(df.DATE) df.dropna(inplace=True) df['RECBY'] = df['C1'].apply(lambda x: re.split(':|\x96',x)[-1].strip()) df['ACT'] = df['C1'].apply(lambda x: x.split()[0].strip()) df['STK'] = df['C1'].apply(lambda x: re.split('\.|\,|:| target| has| and|Buy|Sell| with',x)[1]) #Getting the target price - not always correct df['TGT'] = df['C4'].apply(lambda x: re.findall('\d+.', x)[0]) #Getting the stop loss price - not always correct df['STL'] = df['C4'].apply(lambda x: re.findall('\d+.\d+', x)[-1])

这是一个很难回答的问题，因为4条信息中的每一条都有不同的可能被写入。这是一种可能有效的天真方法，尽管需要验证。我将针对目标执行此示例，但您可以将其扩展到任何：

CONTEXT = 6 def is_float(x): try: float(x) return True except ValueError: return False def get_target_price(s): words = s.split() n = words.index('target') words_in_range = words[n-CONTEXT:n+CONTEXT] return float(list(filter(is_float, words_in_range))[0]) # returns any instance of a float
这是一个简单的方法让你开始，但你可以把额外的检查，使这更安全。可能需要改进的事项：

确保找到建议浮动的索引之前的索引为Rs

如果在上下文范围中找不到浮点，请展开上下文

如果存在歧义，即目标的多个实例或上下文范围中的多个浮点等，则添加用户验证
我得到了解决方案：
这里的代码只包含问题的解决方案部分。使用库可以大大改进该解决方案

from nltk import word_tokenize periods = ['year',"year's", 'day','days',"day's", 'month', "month's", 'week',"week's", 'intra-day', 'intraday'] stop = ['target', 'current', 'stop', 'period', 'stoploss'] def extractinfo(row): if 'intra day' in row.lower(): row = row.lower().replace('intra day', 'intra-day') tks = [ w for w in word_tokenize(row) if any([w.lower() in stop, isfloat(w)])] tgt = '' crt = '' stp = '' prd = '' if 'target' in tks: if len(tks[tks.index('target'):tks.index('target')+2]) == 2: tgt = tks[tks.index('target'):tks.index('target')+2][-1] if 'current' in tks: if len(tks[tks.index('current'):tks.index('current')+2]) == 2: crt = tks[tks.index('current'):tks.index('current')+2][-1] if 'stop' in tks: if len(tks[tks.index('stop'):tks.index('stop')+2]) == 2: stp = tks[tks.index('stop'):tks.index('stop')+2][-1] prdd = set(periods).intersection(tks) if 'period' in tks: pdd = tks[tks.index('period'):tks.index('period')+3] prr = set(periods).intersection(pdd) if len(prr) > 0: if len(pdd) > 2: prd = ' '.join(pdd[-2::1]) elif len(pdd) == 2: prd = pdd[-1] elif len(prdd) > 0: prd = list(prdd)[0] return (crt, tgt, stp, prd)

解决方案相对来说是不言自明的-其他人请让我知道。
您正在为一个简单的
查找（“Rs（\d+）的目标价格”）做大量的正则表达式。
并非所有时候目标价格都是可用的
Rs的目标价格
有时是
目标500
有时是
目标500
等等，但不是在你提供的数据中。。。不管怎样，自然语言处理是很难做到正确的。而且你似乎还没有真正尝试过使用它。当然，我从来没有使用过自然语言处理。确实尝试过ipython-不值得一提。ipython与
nltk
无关。你的文章有被关闭的风险，因为它太宽泛了，除非你尝试用它来解决问题。这就是我的观点