Python 提高数据帧(isin)处理性能

Python 提高数据帧(isin)处理性能,python,performance,pandas,Python,Performance,Pandas,我的Python脚本在处理大文件时出现OOO内存错误。(适用于小部分记录~10K) 我正在处理2个文件: companys.csv(文件大小19MB)~43K条记录 竞争对手_companies.csv(文件大小427MB)~450万条记录 在文件1中,我有一个名为uuid的字段 我需要比较: 如果文件1和文件2中的公司名称相同,请将文件1中的uuid字段复制到竞争对手公司数据框中 如果网站在文件1和文件2中相同,请将uuid字段从文件1复制到竞争对手的数据框中 当我在服务器(约30 GB

我的Python脚本在处理大文件时出现OOO内存错误。(适用于小部分记录~10K)

我正在处理2个文件:

  • companys.csv(文件大小19MB)~43K条记录
  • 竞争对手_companies.csv(文件大小427MB)~450万条记录
在文件1中,我有一个名为uuid的字段

我需要比较:

  • 如果文件1和文件2中的公司名称相同,请将文件1中的uuid字段复制到竞争对手公司数据框中

  • 如果网站在文件1和文件2中相同,请将uuid字段从文件1复制到竞争对手的数据框中

  • 当我在服务器(约30 GB RAM)中处理文件时,脚本会卡在以下行中:

    logging.info('Matching TLD.')
    match_tld = competitor_companies.tld.isin(companies.tld)
    
    然后脚本停止,我在
    /var/log/syslog
    中看到这一行:

    Out of memory: Kill process 177106 (company_generat) score 923 or sacrifice child
    
    Python代码:

    def MatchCompanies(
        companies: pandas.Dataframe,
        competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
      """Find Competitor companies in companies dataframe and generate a new list.
    
      Args:
        companies: A dataframe with company information from CSV file.
        competitor_companies: A dataframe with Competitor information from CSV file.
    
      Returns:
        A sequence of matched companies and their UUID.
    
      Raises:
        ValueError: No companies found.
      """
    
      if _IsEmpty(companies):
        raise ValueError('No companies found')
      # Clean up empty fields.
      companies = companies.fillna('')
      logging.info('Found: %d records.', len(competitor_companies))
      competitor_companies = competitor_companies.fillna('')
      # Create a column to define if we found a match or not.
      competitor_companies['match'] = False
      # Add Top Level Domain (tld) column to compare matching companies.
      companies.rename(columns={'website': 'tld'}, inplace=True)
      logging.info('Cleaning up company name.')
      companies.company_name = companies.company_name.apply(_NormalizeText)
      competitor_companies.company_name = competitor_companies.company_name.apply(
          _NormalizeText)
      # Create a new column since AppAnnie already contains TLD in company_url.
      competitor_companies.rename(columns={'company_url': 'tld'}, inplace=True)
      logging.info('Matching TLD.')
      match_tld = competitor_companies.tld.isin(companies.tld)
      logging.info('Matching Company Name.')
      match_company_name = competitor_companies.company_name.isin(
          companies.company_name)
      # Updates match column if TLD or company_name or similar companies matches.
      competitor_companies['match'] = match_tld | match_company_name
      # Extracts UUID for TLD matches.
      logging.info('Extracting UUID')
      merge_tld = competitor_companies.merge(
          companies[['tld', 'uuid']], on='tld', how='left')
      # Extracts UUID for company name matches.
      merge_company_name = competitor_companies.merge(
          companies[['company_name', 'uuid']], on='company_name', how='left')
      # Combines dataframes.
      competitor_companies['uuid'] = merge_tld['uuid'].combine_first(
          merge_company_name['uuid'])
      match_companies = len(competitor_companies[competitor_companies['match']])
      total_companies = len(competitor_companies)
      logging.info('Results found: %d out of %d', match_companies, total_companies)
      competitor_companies.drop('match', axis=1, inplace=True)
      competitor_companies.rename(columns={'tld': 'company_url'}, inplace=True)
      return competitor_companies
    
    以下是我读取文件的方式:

    def LoadDataSet(filename: str) -> pandas.Dataframe:
      """Reads CSV file where company information is stored.
    
      Header information exists in CSV file.
    
      Args:
        filename: Source CSV file. Header is present in file.
    
      Returns:
        A pandas dataframe with company information.
    
      Raises:
         FileError: Unable to read filename.
      """
      with open(filename) as input_file:
        data = input_file.read()
        dataframe = pandas.read_csv(
            io.BytesIO(data), header=0, low_memory=False, memory_map=True)
        return dataframe.where((pandas.notnull(dataframe)), None)
    
    寻找如何改进我的代码的建议

    运行时的Top命令结果:

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                             
    190875 myuser   20   0 4000944   2.5g 107532 R 100.7   8.5   5:01.93 company_generat   
    

    为什么不直接使用
    pd.merge

    您可以创建两个数据框,一个用于
    公司名称
    匹配,第二个用于
    网站
    匹配,然后在每个数据框上左合并
    竞争对手

    # Create 2 matching tables
    c_website = companies[['uuid', 'website']].rename(columns={'uuid': 'uuid_from_website'})
    c_name = companies[['uuid', 'company_name']].rename(columns={'uuid': 'uuid_from_name'})
    
    # Merge on each of these tables
    result = competitor_companies\
    .merge(c_website, how='left', on='website')\
    .merge(c_name, how='left', on='company_name')
    
    然后,您需要协调这两个值,例如,从uuid\u name优先考虑uuid\u:

    result['uuid'] = np.where(res.uuid_from_name.notnull(), res.uuid_from_name, res.uuid_from_website)
    del result['uuid_from_name']
    del result['uuid_from_website']
    

    它应该比使用
    pd.Series.isin快得多

    您是否尝试分析应用程序以查看是否存在内存泄漏?我还没有,但似乎是Pandas isin函数很慢,占用了大部分内存。