Python 根据频率重复单词生成文本文件_Python_Parsing_Text_Frequency_Word

Python 根据频率重复单词生成文本文件

python parsing text

Python 根据频率重复单词生成文本文件,python,parsing,text,frequency,word,Python,Parsing,Text,Frequency,Word,根据stackoverflow问题的标准，我知道这个问题可能不合适，但我已经做了几个月的编码实践来解析和分析文本，我以前从未做过编程，并从这个论坛得到了帮助我使用频率分析分析了多个xml文件，存储在mysqldb中。 [字，数] 我想通过根据频率重复单词来制作一个文本文件。（例如早餐，6=>早餐）在重复单词之间留出一个空格，并从最低（文本开头）到最高频率解析单词（“a”或“the”将是最频繁的部分，并到达文本内容的最后一部分）请允许我了解一些想法、库、编码示例。。多谢各位 import

根据stackoverflow问题的标准，我知道这个问题可能不合适，但我已经做了几个月的编码实践来解析和分析文本，我以前从未做过编程，并从这个论坛得到了帮助

我使用频率分析分析了多个xml文件，存储在mysqldb中。 [字，数]

我想通过根据频率重复单词来制作一个文本文件。（例如早餐，6=>早餐）在重复单词之间留出一个空格，并从最低（文本开头）到最高频率解析单词（“a”或“the”将是最频繁的部分，并到达文本内容的最后一部分）

请允许我了解一些想法、库、编码示例。。多谢各位

import math
import random
import requests
import collections
import string
import re
import MySQLdb as mdb
import xml.etree.ElementTree as ET
from xml.dom import minidom
from string import punctuation
from oauthlib import *
from operator import itemgetter
from collections import defaultdict
from functools import reduce
import requests, re
from xml.etree import ElementTree
from collections import Counter
from lxml import html




### MYSQL ###

db = mdb.connect(host="****", user="****", passwd="****", db="****")

cursor = db.cursor()
sql = "DROP TABLE IF EXISTS Table1"
cursor.execute(sql)
db.commit()
sql = "CREATE TABLE Table1(Id INT PRIMARY KEY AUTO_INCREMENT, keyword TEXT, frequency INT)"
cursor.execute(sql)
db.commit()



## XML PARSING
def main(n=1000):

    # A list of feeds to process and their xpath


    feeds = [
        {'url': 'http://www.nyartbeat.com/list/event_type_print_painting.en.xml', 'xpath': './/Description'},
        {'url': 'http://feeds.feedburner.com/FriezeMagazineUniversal?format=xml', 'xpath': './/description'},
        {'url': 'http://www.artandeducation.net/category/announcement/feed/', 'xpath': './/description'},
        {'url': 'http://www.blouinartinfo.com/rss/visual-arts.xml', 'xpath': './/description'},
        {'url': 'http://feeds.feedburner.com/ContemporaryArtDaily?format=xml', 'xpath': './/description'}
    ]



    # A place to hold all feed results
    results = []

    # Loop all the feeds
    for feed in feeds:
        # Append feed results together
        results = results + process(feed['url'], feed['xpath'])

    # Join all results into a big string
    contents=",".join(map(str, results))

    # Remove double+ spaces
    contents = re.sub('\s+', ' ', contents)

    # Remove everything that is not a character or whitespace
    contents = re.sub('[^A-Za-z ]+', '', contents)

    # Create a list of lower case words that are at least 8 characters
    words=[w.lower() for w in contents.split() if len(w) >=1 ]


    # Count the words
    word_count = Counter(words)

    # Clean the content a little
    filter_words = ['art', 'artist', 'artist']
    for word in filter_words:
        if word in word_count:
            del word_count[word]



# Add to DB
    for word, count in word_count.most_common(n):
                sql = """INSERT INTO Table1 (keyword, frequency) VALUES(%s, %s)"""
                cursor.execute(sql, (word, count))
                db.commit()

def process(url, xpath):
    """
    Downloads a feed url and extracts the results with a variable path
    :param url: string
    :param xpath: string
    :return: list
    """
    contents = requests.get(url)
    root = ElementTree.fromstring(contents.content)
    return [element.text.encode('utf8') if element.text is not None else '' for element in root.findall(xpath)]





if __name__ == "__main__":
    main()

假设

word\u count。您在for循环中使用的最常见（n）

将返回一个元组或一个包含

word

和

count

的列表，顺序如下：

让我们将其存储在变量中：

words = word_count.most_common(n)
# Ex: [('a',5),('apples',2),('the',4)]

使用itemgetter，按计数对其排序：

from operator import itemgetter
words = sorted(words, key = itemgetter(1))
# words = [('apples', 2), ('the', 4), ('a', 5)]

现在检查每个条目，并将其附加到列表中：

out = []
for word, count in words:
    out += [word]*count
# out = ['apples', 'apples', 'the', 'the', 'the', 'the', 'a', 'a', 'a', 'a', 'a']

下一行将使其成为一个长字符串：

final = " ".join(out)
# final = "apples apples the the the the a a a a a"

现在只需将其写入文件：

with open("filename.txt","w+") as f:
    f.write(final)

代码如下所示：

from operator import itemgetter

words = word_count.most_common(n)
words = sorted(words, key = itemgetter(1))

out = []
for word, count in words:
    out += [word]*count

final = " ".join(out)

with open("filename.txt","w+") as f:
    f.write(final)

我投票结束这个问题，因为这既不是一个代码编写也不是教程服务。我将添加一个我已经解析和分析过的代码。谢谢！