Python 通过刮取的数据进行解析的最佳方法_Python_Json_Web Scraping_Scrapy

Python 通过刮取的数据进行解析的最佳方法

python json web-scraping scrapy

Python 通过刮取的数据进行解析的最佳方法,python,json,web-scraping,scrapy,Python,Json,Web Scraping,Scrapy,我设法通过scrapy获取了大量数据，所有数据目前都存储为MongoDB中的JSON对象。我最想知道的是如何有效地解析和理解数据。我想将数据提取到各个小节中。例如，假设我将数据存储为： { "data": "category 1: test test \n category2: test test \n test test \n category 3: test test test \n category 4: this is data in category 4 " } 基本上，我想通过

我设法通过scrapy获取了大量数据，所有数据目前都存储为MongoDB中的JSON对象。我最想知道的是如何有效地解析和理解数据。我想将数据提取到各个小节中。例如，假设我将数据存储为：

{
  "data": "category 1: test test \n category2: test test \n test test \n category 3: test test test \n category 4: this is data in category 4 "
}

基本上，我想通过关键字来提取关键字后面的所有内容，直到下一个关键字。第1类（“测试”）之后的所有信息应存储在“第1类”下。没有真正的韵律或节奏的类别顺序，也没有数量的文字后，每个类别，但所有的类别都有

我想知道是否有任何库，我可以用来写一个脚本来做这件事，或任何工具，将自动为我做这件事。或者是一个指向资源的指针，在那里我可以学习如何做这样的事情。

这听起来像是一个足够具体的任务，你可能需要再做一次数据处理。是我在python中与Mongo数据库中的数据交互的首选库（也是mongodb本身推荐的）

要解析字符串本身，请阅读正则表达式，特别是方法：

我将创建一个关键字列表，然后在数据中查找这些关键字的索引（如果存在）。（我重新排列了关键字在数据中出现的顺序，以说明后面的一点）

在数据中找到的每个关键字的起始位置由其在kw_索引中的值给出，其中-1表示在数据中未找到该关键字

为了找到每个关键字的结束索引，我们从kw_index_排序列表中找到下一个起始索引，然后找出哪个关键字有该起始索引，然后获得下一个起始索引的值

data_by_category = {}
for j in range(len(keywords)):
    kw = keywords[j]

    if kw_indices[j] > -1:
        # The keyword was found in the data and we know where in the string it starts
        kw_start = kw_indices[j]
        sorted_index = kw_indices_sorted.index(kw_start)
        if sorted_index < len(kw_indices_sorted) - 1:
            # This index is not the last/largest value in the list of sorted indices
            # so there will be a next value.
            next_kw_start = kw_indices[kw_indices.index(kw_indices_sorted[sorted_index + 1])]
            kw_data = data[kw_start:next_kw_start]
        else:
            kw_data = data[kw_start:]

        # If you don't want the keyword included in the result you can strip it out here
        kw_data = kw_data.replace(kw + ':', '')
        data_by_category[kw] = kw_data
    else:
        # The keyword was not found in the data, enter an empty value for it or handle this 
        # however else you want.
        data_by_category[kw] = ''

print(data_by_category)

data_by_category={}
对于范围内的j（len（关键字））：
kw=关键词[j]
如果kw_指数[j]>-1：
#关键字是在数据中找到的，我们知道它在字符串中的起始位置
千瓦时启动=千瓦时指数[j]
排序索引=千瓦索引排序索引（千瓦开始）
如果已排序的索引


{'category 1'：'test-test\n'，'category2'：'test-test\n test-test\n'，'category 3'：'test-test\n'，'category 4'：'this is data in category 4'}
OP询问如何解析他的字符串。它存储在MongoDB中的事实与此无关。感谢您的建议！我同意我可能要做多次传球。我并不担心数据库部分，因为它与如何实际解析这些数据的问题不太相关。我觉得我可以通过python以非常愚蠢的方式（可能效率很低，无法处理所有数据）在本地完成这项工作，但我想知道是否有更好的方法来完成这项工作！然而，我用“类别”这个词作为例子。如果它们没有被命名为“x类”，它将不起作用。假设类别名为“姓名”、“年龄”、“分数”。有没有办法用正则表达式来解释不同类别的名称？基本上，找到表达式的实例，并获取正则表达式后面的所有内容，并将其存储到不同的正则表达式，如“name”。再次感谢你！是的，尽管正则表达式变得越复杂，但事物变得越脆弱。请仔细阅读，在您的情况下，这可能会将正则表达式更改为更像r'（name | age | score）：（.*）
d = {"data": "category 1: test test \n category 3: test test test \n category2: test test \n test test \n category 4: this is data in category 4 " }
keywords = ['category 1', 'category2', 'category 3', 'category 4']
kw_indices = [-1]*len(keywords)
data = d['data']

for i in range(len(keywords)):
    kw = keywords[i]
    if kw in data:
        kw_indices[i] = data.index(kw)

kw_indices_sorted = sorted(kw_indices)

data_by_category = {}
for j in range(len(keywords)):
    kw = keywords[j]

    if kw_indices[j] > -1:
        # The keyword was found in the data and we know where in the string it starts
        kw_start = kw_indices[j]
        sorted_index = kw_indices_sorted.index(kw_start)
        if sorted_index < len(kw_indices_sorted) - 1:
            # This index is not the last/largest value in the list of sorted indices
            # so there will be a next value.
            next_kw_start = kw_indices[kw_indices.index(kw_indices_sorted[sorted_index + 1])]
            kw_data = data[kw_start:next_kw_start]
        else:
            kw_data = data[kw_start:]

        # If you don't want the keyword included in the result you can strip it out here
        kw_data = kw_data.replace(kw + ':', '')
        data_by_category[kw] = kw_data
    else:
        # The keyword was not found in the data, enter an empty value for it or handle this 
        # however else you want.
        data_by_category[kw] = ''

print(data_by_category)