Python 使用正则表达式模式分析URL列表_Python_Regex_Python 3.x

Python 使用正则表达式模式分析URL列表

python regex python-3.x

Python 使用正则表达式模式分析URL列表,python,regex,python-3.x,Python,Regex,Python 3.x,我有一个很大的URL文本文件（>100万个URL）。URL表示跨多个不同域的产品页面我试图从每个URL解析SKU和产品名称，例如： www.amazon.com/totes-Mens-Mike-Duck-Boot/dp/B01HQR3ODE/ 手提包男式迈克鸭靴 B01HQR3ODE www.bestbuy.com/site/apple airpods white/5577872.p？skuId=5577872 苹果白粉 5577872 我已经为解析列表中所有域的URL的两个组件（

我有一个很大的URL文本文件（>100万个URL）。URL表示跨多个不同域的产品页面

我试图从每个URL解析SKU和产品名称，例如：

www.amazon.com/totes-Mens-Mike-Duck-Boot/dp/B01HQR3ODE/
- 手提包男式迈克鸭靴
- B01HQR3ODE
www.bestbuy.com/site/apple airpods white/5577872.p？skuId=5577872
- 苹果白粉
- 5577872

我已经为解析列表中所有域的URL的两个组件（产品名和SKU）找到了各自的正则表达式模式。这是近100种不同的模式

虽然我已经知道了如何一次测试一个URL/模式，但我很难知道如何构建一个脚本，该脚本将读取我的整个列表，然后根据相关的正则表达式模式遍历并解析每一行。有什么建议可以最好地解决这个问题吗

如果我的输入是一列（URL），那么我想要的输出是4列（URL、域、产品名称、SKU）。

虽然可以将所有这些都放在一个大型正则表达式中，但这可能不是最简单的方法。相反，我会使用双通道策略。根据适用于该域的正则表达式模式记录域名。在第一步中，使用一个适用于所有URL的正则表达式来检测行的域。然后使用发现的域在dict中查找适当的正则表达式，以提取该域的字段。

虽然可以将所有这些内容合并到一个大型正则表达式中，但这可能不是最简单的方法。相反，我会使用双通道策略。根据适用于该域的正则表达式模式记录域名。在第一步中，使用一个适用于所有URL的正则表达式来检测行的域。然后使用发现的域在dict中查找适当的正则表达式，以提取该域的字段。

由于从URL提取域名相当容易，因此可以将域名映射到该域的模式

像这样：

dict = {
'domain1.com': 'regex_pattern_for_domain1', 
'domain2.com': 'regex_pattern_for_domain2'
}

现在，您应该逐行阅读您的文件，并应用通用正则表达式来提取域名，您将使用它来获取特定的正则表达式

def extract_data(url, regex_pattern):
    # code to extract product name and SKU
    return ['product_id', 'sku'] 

def extract_domain(url):
    # apply general regex pattern to extract URL
    return 'domain name'

parsed_data = []
with open('urls.txt') as f:
    url = f.readline()
    domain = extract_domain(url) # call function that extracts domain from the URL
    domain_regex = dict[domain] # use dictionary to get the regex for the given domain
    data = extract_data(url, domain_regex) # call function to extract data from the given URL and regex for that domain
    data.append(domain)
    data.append(url)
    parsed_data.append(data) # append extracted data to the list, or save it to another file if it is too big to fit into memory.

因为从URL中提取域名相当容易，所以可以将域名映射到该域的模式

像这样：

dict = {
'domain1.com': 'regex_pattern_for_domain1', 
'domain2.com': 'regex_pattern_for_domain2'
}

现在，您应该逐行阅读您的文件，并应用通用正则表达式来提取域名，您将使用它来获取特定的正则表达式

def extract_data(url, regex_pattern):
    # code to extract product name and SKU
    return ['product_id', 'sku'] 

def extract_domain(url):
    # apply general regex pattern to extract URL
    return 'domain name'

parsed_data = []
with open('urls.txt') as f:
    url = f.readline()
    domain = extract_domain(url) # call function that extracts domain from the URL
    domain_regex = dict[domain] # use dictionary to get the regex for the given domain
    data = extract_data(url, domain_regex) # call function to extract data from the given URL and regex for that domain
    data.append(domain)
    data.append(url)
    parsed_data.append(data) # append extracted data to the list, or save it to another file if it is too big to fit into memory.

显示您当前的代码/regexps/etc为了澄清，我并没有要求有人为我编写代码，而是在寻找如何最好地实现这一点的指导。我目前遇到的只是一堆查询，我使用各种正则表达式模式以及re库中的匹配和子函数进行了测试。显示您当前的代码/regexps/etc为了澄清，我并没有要求有人为我编写代码，而是在寻找如何最好地实现这一点的指导。我目前所拥有的只是一堆查询，我已经使用各种正则表达式模式以及re库中的匹配和子函数进行了测试。