Python 计算文本中情态动词的总数

Python 计算文本中情态动词的总数,python,python-3.x,pandas,dictionary,Python,Python 3.x,Pandas,Dictionary,我正在尝试创建一个自定义单词集合,如以下类别所示: Modal Tentative Certainty Generalizing Can Anyhow Undoubtedly Generally May anytime Ofcourse Overall Might anything Definitely On the Whole Must hazy No doubt In ge

我正在尝试创建一个自定义单词集合,如以下类别所示:

Modal    Tentative    Certainty    Generalizing
Can      Anyhow       Undoubtedly  Generally
May      anytime      Ofcourse     Overall
Might    anything     Definitely   On the Whole
Must     hazy         No doubt     In general
Shall    hope         Doubtless    All in all
ought to hoped        Never        Basically
will     uncertain    always       Essentially
need     undecidable  absolute     Most
Be to    occasional   assure       Every
Have to  somebody     certain      Some
Would    someone      clear        Often
Should   something    clearly      Rarely
Could    sort         inevitable   None
Used to  sorta        forever      Always
我正在逐行读取CSV文件中的文本:

导入nltk
将numpy作为np导入
作为pd进口熊猫
从集合导入计数器,defaultdict
从nltk.tokenize导入单词\u tokenize
count=defaultdict(int)
标题列表=[“模态”、“暂定”、“确定性”、“泛化”]
categorydf=pd.read\u csv('Custom-Dictionary1.csv',name=header\u list)
def分析(文件):
df=pd.read\u csv(文件)
modals=str(categorydf['modal'])
暂定=str(类别DF[‘暂定’)
certainity=str(categorydf['certainity']))
泛化=str(categorydf['generalization'])
对于df[“文本”]中的文本:
tokenize_text=text.split()
对于标记化_文本中的w:
如果情态动词中有w:
计数[w]+=1
分析(“test1.csv”)
打印(总和(count.values())
打印(计数)
我想找到上表中以及test1.csv中每行中出现的情态/暂定/确定性动词的数量,但无法找到。这是用数字生成单词频率

19
defaultdict(<class 'int'>, {'to': 7, 'an': 1, 'will': 2, 'a': 7, 'all': 2})

我被卡住了,什么也得不到。如何继续?

我已经解决了您的初始CSV格式任务,如果需要,可以采用XML输入

我使用了非常奇特的解决方案,这就是为什么解决方案可能有点复杂,但运行速度非常快,适合大数据,甚至千兆字节

它使用已排序的单词表,也对文本进行排序以计数单词,并在表中进行排序搜索,因此工作在时间复杂度
O(n logn)

它在第一行输出原始文本行,然后在查找行中按排序顺序列出每个在表格中找到的单词(计数、模态、(TableRow、TableCol)),然后在未找到行中列出未在表格中找到的单词加上计数(该单词在文本中出现的次数)

在第一个解决方案之后,还有一个更简单(但速度较慢)的类似解决方案

第二个解决方案是在纯Python中实现的,只是为了简单起见,只使用标准Python模块
io
csv

它输出:

'will': (1, Modal)
'clear': (1, Certainty)
'generally': (1, Generalizing)
我正在阅读CSV的
StringIO
内容,这是为了方便起见,这样代码就可以包含所有内容,而不需要额外的文件,当然在您的情况下,您需要直接读取文件,因此您可以在下一个代码和下一个链接中执行相同的操作(名为
联机试用!
):


我已经解决了初始CSV格式的任务,如果需要,可以采用XML输入

我使用了非常奇特的解决方案,这就是为什么解决方案可能有点复杂,但运行速度非常快,适合大数据,甚至千兆字节

它使用已排序的单词表,也对文本进行排序以计数单词,并在表中进行排序搜索,因此工作在时间复杂度
O(n logn)

它在第一行输出原始文本行,然后在查找行中按排序顺序列出每个在表格中找到的单词(计数、模态、(TableRow、TableCol)),然后在未找到行中列出未在表格中找到的单词加上计数(该单词在文本中出现的次数)

在第一个解决方案之后,还有一个更简单(但速度较慢)的类似解决方案

第二个解决方案是在纯Python中实现的,只是为了简单起见,只使用标准Python模块
io
csv

它输出:

'will': (1, Modal)
'clear': (1, Certainty)
'generally': (1, Generalizing)
我正在阅读CSV的
StringIO
内容,这是为了方便起见,这样代码就可以包含所有内容,而不需要额外的文件,当然在您的情况下,您需要直接读取文件,因此您可以在下一个代码和下一个链接中执行相同的操作(名为
联机试用!
):


请您不仅以图片的形式,而且以纯文本的形式提供这两个表格,好吗?因此,我们可以使用这些数据来测试脚本的正确版本。如“高效意志”所示,您需要进行上下文分析,以确定这是情态动词还是名词。“can”和(在较小程度上)“must”也会遇到类似的问题。
str(categorydf['modal'])
不是检查
pd.Series中字符串的正确方法。我已将内容粘贴到谷歌表单中。这是链接:@HenryYik谢谢你指出。由于df返回对象,所以我将其转换为字符串。您是否可以提供这两个表,不仅以图像的形式,而且以纯文本的形式?因此,我们可以使用这些数据来测试脚本的正确版本。如“高效意志”所示,您需要进行上下文分析,以确定这是情态动词还是名词。“can”和(在较小程度上)“must”也会遇到类似的问题。
str(categorydf['modal'])
不是检查
pd.Series中字符串的正确方法。我已将内容粘贴到谷歌表单中。这是链接:@HenryYik谢谢你指出。由于df返回对象,所以我将其转换为字符串。感谢您回复@Arty。我仍在努力理解你的代码。当我收到关键错误:Text in:Text=[e['Text']表示e in Text]。@NEERAJKUMAR这取决于文本CSV文件的组织方式。如果你看一下我代码中的CSV示例,你会发现我的CSV是什么样子的。它有第一行文本,它给出了文本列的名称
e['Text']
基本上是按列名获取单元格的值。@NEERAJKUMAR我刚刚更新了代码以反映这一变化。我已经从第一行删除了“Text”,并将param添加到dict reader
fieldnames=['Text']
。现在它看起来像您的CSV文件。试试看!现在您可以在CSV中运行它,而无需任何更改。@NEERAJKUMAR顺便说一句,我希望您不要像我在示例中那样将整个CSV文本放入代码中,而只需要输入一个文件名。我把它转换成代码,这样你就不必在我的e
0 : Text: When LIWC was first developed, the goal was to devise an efficient will system
Found: "will": (1, Modal, (6, 0))
Non-Found: "an": 1, "developed,": 1, "devise": 1, "efficient": 1, "first": 1, "goal": 1, "liwc": 1, "system": 1, "the": 1, "to": 1, "was": 2, "when":
 1

1 : Text: Within a few years, it became clear that there are two very broad categories of words
Found: "clear": (1, Certainty, (10, 2))
Non-Found: "a": 1, "are": 1, "became": 1, "broad": 1, "categories": 1, "few": 1, "it": 1, "of": 1, "that": 1, "there": 1, "two": 1, "very": 1, "withi
n": 1, "words": 1, "years,": 1

2 : Text: Content words are generally nouns, regular verbs, and many adjectives and adverbs.
Found: "generally": (1, Generalizing, (0, 3))
Non-Found: "adjectives": 1, "adverbs.": 1, "and": 2, "are": 1, "content": 1, "many": 1, "nouns,": 1, "regular": 1, "verbs,": 1, "words": 1

3 : Text: They convey the content of a communication.
Found:
Non-Found: "a": 1, "communication.": 1, "content": 1, "convey": 1, "of": 1, "the": 1, "they": 1

4 : Text: To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”
Found:
Non-Found: "a": 1, "and": 2, "are:": 1, "back": 1, "content": 1, "dark": 1, "go": 1, "night”": 1, "phrase": 1, "stormy": 1, "the": 2, "to": 2, "was":
 1, "words": 1, "“dark,”": 1, "“it": 1, "“night.”": 1, "“stormy,”": 1
import io, csv

# Instead of io.StringIO(...) just read from filename.
tab = csv.DictReader(io.StringIO("""Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))

texts = csv.DictReader(io.StringIO("""
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
"""), fieldnames = ['Text'])

tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]

for text in texts:
    cnt, mod = {}, {}
    for word in text.lower().split():
        if word in tabi:
            cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
    print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))
'will': (1, Modal)
'clear': (1, Certainty)
'generally': (1, Generalizing)
import io, csv

tab = csv.DictReader(open('table.csv', 'r', encoding = 'utf-8-sig'))
texts = csv.DictReader(open('texts.csv', 'r', encoding = 'utf-8-sig'), fieldnames = ['Text'])

tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]

for text in texts:
    cnt, mod = {}, {}
    for word in text.lower().split():
        if word in tabi:
            cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
    print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))