Java 如何使用NLP关联相似的消息_Java_Nlp_Opennlp

Java 如何使用NLP关联相似的消息

java nlp

Java 如何使用NLP关联相似的消息,java,nlp,opennlp,Java,Nlp,Opennlp,我有几条推特需要处理。我试图找到一些对人有伤害的信息。我如何通过NLP实现这一目标 I bought my son a toy gun I shot my neighbor with a gun I don't like this gun I would love to own this gun This gun is a very good buy Feel like shooting myself with a gun 在上面的句子中，第二个、第六个是我想要找到的。如果问题仅限于枪支和射击

我有几条推特需要处理。我试图找到一些对人有伤害的信息。我如何通过NLP实现这一目标

I bought my son a toy gun
I shot my neighbor with a gun
I don't like this gun
I would love to own this gun
This gun is a very good buy
Feel like shooting myself with a gun

在上面的句子中，第二个、第六个是我想要找到的。

如果问题仅限于枪支和射击，那么您可以使用依赖项解析器（如斯坦福解析器）查找动词及其（介词）对象，从动词开始，并在解析树中跟踪其依赖项。例如，在2和6中，它们都是“用枪射击”

然后，您可以使用“射击”（“杀死”、“谋杀”、“受伤”等）和“枪”（“武器”、“步枪”等）的同义词（近义词）列表来检查它们是否出现在每个句子中的这种模式（动词-介词-名词）

还有其他表达相同想法的方式，例如“我买了一把枪来射我的邻居”，其中依赖关系不同，您也需要检测这些类型的依赖关系。

vpekar的所有建议都很好。下面是一些python代码，它们至少会解析句子，并查看它们是否包含用户定义的有害词集中的动词。注意：大多数“有害词”可能有多种含义，其中许多可能与有害无关。这种方法并不试图消除词义的歧义

（此代码假定您有NLTK和Stanford CoreNLP）

这将产生以下结果：

此消息可能表示伤害：['PRP'，'I']，['VBD'，'shot']，['PRP$'，'my']，['NN'，'neighbor']，['IN'，'with']，['DT'，'a']，['NN'，'gun']，['NN'，'gun']，['NN'，'neighbor']]

此消息可能表示伤害：['NNP'，'Feel']，['IN'，'like']，['VBG'，'shooting']，['PRP'，'myels']，['IN'，'with']，['DT'，'a']，['NN'，'gun']，['.'，'

我想看看SenticNet

它提供了一个开源知识库和解析器，为文本片段分配情感价值。使用该库，您可以训练它识别您感兴趣的语句

这方面有很多研究。开始阅读一些关于分类和语义处理的论文或书籍章节可能是个好主意。你都不用担心。让国家安全局来处理吧。

import os
import subprocess
from xml.dom import minidom
from nltk.corpus import wordnet as wn

def StanfordCoreNLP_Plain(inFile):
    #Create the startup info so the java program runs in the background (for windows computers)
    startupinfo = None
    if os.name == 'nt':
        startupinfo = subprocess.STARTUPINFO()
        startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
    #Execute the stanford parser from the command line
    cmd = ['java', '-Xmx1g','-cp', 'stanford-corenlp-1.3.5.jar;stanford-corenlp-1.3.5-models.jar;xom.jar;joda-time.jar', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit,pos', '-file', inFile]
    output = subprocess.Popen(cmd, stdout=subprocess.PIPE, startupinfo=startupinfo).communicate()
    outFile = file(inFile[(str(inFile).rfind('\\'))+1:] + '.xml')
    xmldoc = minidom.parse(outFile)
    itemlist = xmldoc.getElementsByTagName('sentence')
    Document = []
    #Get the data out of the xml document and into python lists
    for item in itemlist:
        SentNum = item.getAttribute('id')
        sentList = []
        tokens = item.getElementsByTagName('token')
        for d in tokens:
            word = d.getElementsByTagName('word')[0].firstChild.data
            pos = d.getElementsByTagName('POS')[0].firstChild.data
            sentList.append([str(pos.strip()), str(word.strip())])
        Document.append(sentList)
    return Document

def FindHarmSentence(Document):
    #Loop through sentences in the document.  Look for verbs in the Harm Words Set.
    VerbTags = ['VBN', 'VB', 'VBZ', 'VBD', 'VBG', 'VBP', 'V']
    HarmWords = ("shoot", "kill")
    ReturnSentences = []
    for Sentence in Document:
        for word in Sentence:
            if word[0] in VerbTags:
                try:
                    wordRoot = wn.morphy(word[1],wn.VERB)
                    if wordRoot in HarmWords:
                        print "This message could indicate harm:" , str(Sentence)
                        ReturnSentences.append(Sentence)
                except: pass
    return ReturnSentences

#Assuming your input is a string, we need to put the strings in some file.
Sentences = "I bought my son a toy gun. I shot my neighbor with a gun. I don't like this gun. I would love to own this gun. This gun is a very good buy. Feel like shooting myself with a gun."
ProcessFile = "ProcFile.txt"
OpenProcessFile = open(ProcessFile, 'w')
OpenProcessFile.write(Sentences)
OpenProcessFile.close()

#Sentence split, tokenize, and part of speech tag the data using Stanford Core NLP
Document = StanfordCoreNLP_Plain(ProcessFile)

#Find sentences in the document with harm words
HarmSentences = FindHarmSentence(Document)