Python 索引器：列表索引超出正则表达式的范围_Python_Regex_Web Scraping

Python 索引器：列表索引超出正则表达式的范围

python regex web-scraping

Python 索引器：列表索引超出正则表达式的范围,python,regex,web-scraping,Python,Regex,Web Scraping,我正试图从这个链接中获取数据我得到了这个错误，我不明白哪里出了问题，因为我以前已经尝试过这个代码，它成功了 import re import requests import csv import json with open("selog.csv", "w", newline="") as f: writer = csv.writer(f) writer.writerow(["id", "Type", "Prix", "Code_postal", "Ville", "Departement

我正试图从这个链接中获取数据我得到了这个错误，我不明白哪里出了问题，因为我以前已经尝试过这个代码，它成功了

import re
import requests
import csv
import json


with open("selog.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "Type", "Prix", "Code_postal", "Ville", "Departement", "Nombre_pieces", "Nbr_chambres", "Type_cuisine", "Surface"]) 


for i in range(1, 500):
   url = str('https://www.seloger.com/list.htm?tri=initial&idtypebien=1,2&pxMax=3000000&div=2238&idtt=2,5&naturebien=1,2,4&LISTING-LISTpg=' + str(i))
   r = requests.get(url, headers = {'User-Agent' : 'Mozilla/5.0'})
   p = re.compile('var ava_data =(.*);\r\n\s+ava_data\.logged = logged;', re.DOTALL)
   x = p.findall(r.text)[0].strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')
   x = re.sub(r'\s{2,}|\\r\\n', '', x)
   data = json.loads(x)
   f = csv.writer(open("Seloger.csv", "wb+"))


   for product in data['products']:
      ID = product['idannonce']
      prix = product['prix']
      surface = product['surface']
      code_postal = product['codepostal']
      nombre_pieces = product['nb_pieces']
      nbr_chambres = product['nb_chambres']
      Type = product['typedebien']
      type_cuisine = product['idtypecuisine']
      ville = product['ville']
      departement = product['departement']
      etage = product['etage']
      writer.writerow([ID, Type, prix, code_postal, ville, departement, nombre_pieces, nbr_chambres, type_cuisine, surface])

这将导致错误：

Traceback (most recent call last):
File "Seloger.py", line 20, in <module>
x = p.findall(r.text)[0].strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')
IndexError: list index out of range

回溯（最近一次呼叫最后一次）：
文件“Seloger.py”，第20行，在
x=p.findall（r.text）[0].strip（）.replace（'\r\n'，''）.replace（'\xa0'，''）.replace（'\\'，'\\\'）
索引器：列表索引超出范围

这一行是错误的：

x = p.findall(r.text)[0].strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')

你需要在文本中找到什么

若要在文本上进行刮削处理，需要将第行上方更改为：

x = r.text.strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')

然后查找您需要的内容时出错，因为有时不存在匹配项，并且您试图访问空列表中不存在的项。同样的结果可以通过

打印（关于findall（“s”、“d”）[0]）

复制

若要解决此问题，请将

x=p.findall（r.text）[0].strip（）.replace（'\r\n'，''）.replace（'\xa0'，''）.replace（'\\'，'\\\'）

行替换为

x = ''
xm = p.search(r.text)
if xm:
    x = xm.group(1).strip().replace('\r\n    ','').replace('\xa0',' ').replace('\\','\\\\')

注释

当您使用
```
p.findall（r.text）[0]
```
时，您希望获得输入中的第一个匹配项，因此在这里最好，因为它只返回第一个匹配项
要获取第一个捕获组中捕获的子字符串，需要使用
```
matchObject.grou[p（1）
```
```
如果xm:
```
很重要：如果没有匹配，
```
x
```
将保持空字符串，否则，将在组1中为其分配修改后的值

列表索引超出范围

表示索引

[0]

有问题，因此，如果

p.findall（r.text）的列表为空，请首先检查打印（p.findall（r.text））中的内容
然后您可以检查r.text
-您可以将其保存在文件中并在web浏览器中打开-可能有一些关于机器人程序/脚本或captch的有用信息或警告。我运行代码，有时我会看到带有文本的页面，“哦，我的错误技术是正确的。谢谢您的回复。”
意思是“哦，发生了一个技术错误。请稍后再试。”然后
findall（）`返回空列表-因此它没有索引[1]
，代码显示错误列表索引超出范围
问题是，有时页面会显示消息“哦，我们的错误技术是多余的。Merci de ressayer ultérieurement。”这意味着”哦，出现了一个技术错误。请稍后再试。”然后，findall（）
找不到预期的文本。