Python 如何在我的话语列表中准确地挑出正确的字符串？_Python_String_Numpy_Pandas_Nltk

Python 如何在我的话语列表中准确地挑出正确的字符串？

python string numpy pandas

Python 如何在我的话语列表中准确地挑出正确的字符串？,python,string,numpy,pandas,nltk,Python,String,Numpy,Pandas,Nltk,因此，我正在编写一个python脚本，使用NumPy和Pandas以及NLTK从CHILDES数据库的Providence语料库中获取话语作为参考，我的脚本的想法是为语料库中的每个孩子填充一个数据框，其中包括他们的名字、包含我要寻找的语言特征的话语（否定类型）、他们说话时的年龄以及他们说话时的MLU 太好了现在，用户将能够在数据帧中填充此信息后进入，并将每个话语标记为特定类别，控制台将打印出他们将在两侧标记一行上下文的话语（如果他们只是看到孩子说“不”，很难在没有看到妈妈之前说了什么或有人之

因此，我正在编写一个python脚本，使用NumPy和Pandas以及NLTK从CHILDES数据库的Providence语料库中获取话语

作为参考，我的脚本的想法是为语料库中的每个孩子填充一个数据框，其中包括他们的名字、包含我要寻找的语言特征的话语（否定类型）、他们说话时的年龄以及他们说话时的MLU

太好了

现在，用户将能够在数据帧中填充此信息后进入，并将每个话语标记为特定类别，控制台将打印出他们将在两侧标记一行上下文的话语（如果他们只是看到孩子说“不”，很难在没有看到妈妈之前说了什么或有人之后说了什么的情况下说出他们的意思）

所以我的诀窍是获取上下文行。我在程序中用其他方法设置了它，以实现这一切，但我想让您看看其中一种方法的一部分，用于最初填充数据帧，如行所示： “if line==line_context:”，为我提供了大约91个误报

我知道为什么，因为我正在对每个文件进行一行一行的临时复制，以便对于最终有否定的每个语句，该语句在子数据帧中的索引将用作三个字符串列表的HashMap（或Python中的dict）中的键（嗯，字符串列表，因为这是儿童点播者给我的句子），话语，它前面的话语，它后面的话语

所以我有一个错误的行“if line==line_context”来检查它在遍历字符串列表（文件的逐行语句副本，或“line_context”）时，是否与“line”或正在遍历的孩子的语句行对齐，以便稍后我可以得到匹配的索引

问题是，这些“句子”中有许多是相同的字符序列，（['no']本身显示了很多！）因此我的程序将看到它是相同的，看到它有一个否定，并将其保存到数据帧，但它会在每次找到['no'的实例时保存它在我的文件副本中，这与该文件中的一行只有孩子的讲话相同，所以我得到了大约91个相同的额外实例

呸！不管怎样，有没有办法让我得到像“if line==line_context”这样的东西，在文件中选择一个['no']的实例，这样我就知道我在文件的两边都处于同一点上了？？？我使用的是NLTK CHILDESCorpusReader，它似乎没有用于这类内容的资源（否则，我就不必使用这种可笑的迂回方式来获得上下文信息！）

也许有一种方法，当我遍历我为每个文件制作的话语列表时，当一个话语与我也在遍历的孩子的话语匹配后，我可以更改和/或删除话语列表中的该项，以防止它再给我一个误报c.91次

谢谢

下面是le代码（我添加了一些额外的注释，希望能帮助您准确理解每一行应该做什么）：

考虑分离操作：首先从XML创建DataFrame（s），然后合并/CONTAT/整形，然后进行计算。目前还不清楚（至少对我来说）。你的问题是关于哪一部分的…混合使用xml解析和计算会让人困惑！尼克--这是一个已解决的问题吗？如果不是的话，我可以尝试一下。我解决了这个问题，但可能不是以一种有效的方式，因为自从我从事这个项目以来，我已经做了很多编程。

for file in value_corpus.fileids(): #iterates through the .xml files in the corpus_map
    for line_total in value_corpus.sents(fileids=file, speaker='ALL'): #creates a copy of the utterances by all speakers 
        utterance_list.append(line_total) #adds each line from the file to the list
    for line_context in utterance_list: #iterates through the newly created list
        for line in value_corpus.sents(fileids=file, speaker='CHI'): #checks through the original file's list of children's utterances
            if line == line_context: #tries to make sure that for each child's utterance, I'm at the point in the embedded for loop where the utterance in my utterance_list and the utterance in the file of child's sentences is the same exact sentence BUGGY(many lines are the same --> false positives)
                for type in syntax_types: #iterates through the negation syntactic types
                    if type in line: #if the line contains a negation
                        value_df.iat[i,5] = type #populates the "Syntactic Type" column
                        value_df.iat[i,3] = line #populates the "Utterance" column
                        MLU = str(value_corpus.MLU(fileids=file, speaker='CHI'))
                        MLU = "".join(MLU)
                        value_df.iat[i,2] =  MLU #populates the "MLU" column
                        value_df.iat[i,1] = value_corpus.age(fileids=file, speaker='CHI',month=True) #populates the "Ages" column
                        utterance_index = utterance_list.index(line_context)
                        try:
                            before_line = utterance_list[utterance_index - 1]
                        except IndexError: #if no line before, doesn't look for context
                            before_line = utterance_list[utterance_index]
                        try:
                            after_line = utterance_list[utterance_index + 1]
                        except IndexError: #if no line after, doesn't look for context
                            after_line = utterance_list[utterance_index] 
                            value_dict[i] = [before_line, line, after_line]
                            i = i + 1 #iterates to next row in "Utterance" column of df