Python 阿尔达·根西姆。如何使用每个文档的正确主题号更新Postgres数据库?

Python 阿尔达·根西姆。如何使用每个文档的正确主题号更新Postgres数据库?,python,postgresql,lda,gensim,Python,Postgresql,Lda,Gensim,我从数据库中获取不同的文档,并与LDA gensim核实,这些文档中有哪些潜在主题。这很有效。我想做的是在数据库中保存每个文档最可能的主题。我不确定什么是最好的解决办法。例如,我可以在开始时从数据库中提取每个文档的唯一id以及text_列,并以某种方式对其进行处理,以便在最后知道哪个id属于哪个主题号。或者我应该在最后一部分做,在那里我打印文档和它们的主题。但我不知道如何将它连接回数据库。通过将文本列与文档进行比较,并指定相应的主题号?如有任何评论,我将不胜感激 stop = stopwords

我从数据库中获取不同的文档,并与LDA gensim核实,这些文档中有哪些潜在主题。这很有效。我想做的是在数据库中保存每个文档最可能的主题。我不确定什么是最好的解决办法。例如,我可以在开始时从数据库中提取每个文档的唯一id以及text_列,并以某种方式对其进行处理,以便在最后知道哪个id属于哪个主题号。或者我应该在最后一部分做,在那里我打印文档和它们的主题。但我不知道如何将它连接回数据库。通过将文本列与文档进行比较,并指定相应的主题号?如有任何评论,我将不胜感激

stop = stopwords.words('english')

sql = """SELECT text_column FROM table where NULLIF(text_column, '') IS NOT NULL;"""
cur.execute(sql)
dbrows = cur.fetchall()
conn.commit()

documents = []
    for i in dbrows:
    documents = documents + list(i)

# remove all the words from the stoplist and tokenize
stoplist = stopwords.words('english')

additional_list = set("``;''".split(";"))

texts = [[word.lower() for word in document.split() if word.lower() not                 in stoplist and word not in string.punctuation and word.lower() not in additional_list] 
     for document in documents]

# remove words that appear less or equal of 2 times
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) <= 2)
texts = [[word for word in text if word not in tokens_once]
     for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
my_num_topics = 10

# lda itself
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=my_num_topics)
corpus_lda = lda[corpus]

# print the most contributing words for selected topics
for top in lda.show_topics(my_num_topics):
    print top

# print the most probable topic and the document
for l,t in izip(corpus_lda,documents):
    selected_topic = max(l,key=lambda item:item[1])
    if selected_topic[1] != 1/my_num_topics:
        selected_topic_number = selected_topic[0]
        print selected_topic
        print t

正如wildplasser所评论的,我只需要选择id和text_列。我以前尝试过,但由于我将数据添加到列表中的方式,它不适合进一步处理。下面的代码起作用,因此创建了一个带有id和一些最可能主题的表

stop = stopwords.words('english')

sql = """SELECT id, text_column FROM table where NULLIF(text_column, '') IS NOT NULL;"""
cur.execute(sql)
dbrows = cur.fetchall()
conn.commit()

documents = []
    for i in dbrows:
    documents.append(i)

# remove all the words from the stoplist and tokenize
stoplist = stopwords.words('english')

additional_list = set("``;''".split(";"))

texts = [[word.lower() for word in document[1].split() if word.lower() not                 in stoplist and word not in string.punctuation and word.lower() not in additional_list] 
 for document in documents]

# remove words that appear less or equal of 2 times
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) <= 2)
texts = [[word for word in text if word not in tokens_once]
 for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
my_num_topics = 10

# lda itself
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=my_num_topics)
corpus_lda = lda[corpus]

# print the most contributing words for selected topics
for top in lda.show_topics(my_num_topics):
    print top

# print the most probable topic and the document
lda_topics = []
for l,t in izip(corpus_lda,documents):
    selected_topic = max(l,key=lambda item:item[1])
    if selected_topic[1] != 1/my_num_topics:
        selected_topic_number = selected_topic[0]
        lda_topics.append((selected_topic[0],int(t[0])))

cur.execute("""CREATE TABLE table_topic (id bigint PRIMARY KEY, topic int);""")
for j in lda_topics:
    my_id = j[1]
    topic = j[0]
    cur.execute("INSERT INTO table_topic (id, topic) VALUES (%s, %s)", (my_id,topic))
    conn.commit()

正如wildplasser所评论的,我只需要选择id和text_列。我以前尝试过,但由于我将数据添加到列表中的方式,它不适合进一步处理。下面的代码起作用,因此创建了一个带有id和一些最可能主题的表

stop = stopwords.words('english')

sql = """SELECT id, text_column FROM table where NULLIF(text_column, '') IS NOT NULL;"""
cur.execute(sql)
dbrows = cur.fetchall()
conn.commit()

documents = []
    for i in dbrows:
    documents.append(i)

# remove all the words from the stoplist and tokenize
stoplist = stopwords.words('english')

additional_list = set("``;''".split(";"))

texts = [[word.lower() for word in document[1].split() if word.lower() not                 in stoplist and word not in string.punctuation and word.lower() not in additional_list] 
 for document in documents]

# remove words that appear less or equal of 2 times
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) <= 2)
texts = [[word for word in text if word not in tokens_once]
 for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
my_num_topics = 10

# lda itself
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=my_num_topics)
corpus_lda = lda[corpus]

# print the most contributing words for selected topics
for top in lda.show_topics(my_num_topics):
    print top

# print the most probable topic and the document
lda_topics = []
for l,t in izip(corpus_lda,documents):
    selected_topic = max(l,key=lambda item:item[1])
    if selected_topic[1] != 1/my_num_topics:
        selected_topic_number = selected_topic[0]
        lda_topics.append((selected_topic[0],int(t[0])))

cur.execute("""CREATE TABLE table_topic (id bigint PRIMARY KEY, topic int);""")
for j in lda_topics:
    my_id = j[1]
    topic = j[0]
    cur.execute("INSERT INTO table_topic (id, topic) VALUES (%s, %s)", (my_id,topic))
    conn.commit()

通常,您会选择PK以及数据库中的文本,如select id、text_列(来自表where…)。。。。在python中,您可以将键->值对放入id为键的dict或2元组的集合/数组中。谢谢!我只是在脑子里把事情复杂化了。在第一个循环中完美地使用documents.appendi。在documents=documents+list之前,当我向select查询添加id时,我开始拆分字母上的单词。此行不同,只是为了防止有人需要代码文本=[[word.lower for word in document[1]。如果word.lower不在停止列表中,word不在字符串中,则拆分。对于文档中的文档,标点符号和word.lower不在附加_列表中]通常您将与数据库中的文本一起选择PK,如从表中选择id、text_列。。。。在python中,您可以将键->值对放入id为键的dict或2元组的集合/数组中。谢谢!我只是在脑子里把事情复杂化了。在第一个循环中完美地使用documents.appendi。在documents=documents+list之前,当我向select查询添加id时,我开始拆分字母上的单词。此行不同,只是为了防止有人需要代码文本=[[word.lower for word in document[1].如果word.lower不在停止列表中,word不在字符串中,则拆分。对于文档中的文档,标点符号和word.lower不在附加列表中]