Python获取链接脚本-需要通配符搜索_Python_Python 3.x

Python获取链接脚本-需要通配符搜索

python python-3.x

Python获取链接脚本-需要通配符搜索,python,python-3.x,Python,Python 3.x,我有下面的代码，当你把一个URL与一堆链接，它会返回列表给你。这工作得很好，除了我只想要以…开头的链接。。。这将返回每个链接，包括home/back/等。有没有办法使用通配符或“start with”功能 from bs4 import BeautifulSoup import requests url = "" # Getting the webpage, creating a Response object. response = requests.get(url) # Extract

我有下面的代码，当你把一个URL与一堆链接，它会返回列表给你。这工作得很好，除了我只想要以…开头的链接。。。这将返回每个链接，包括home/back/等。有没有办法使用通配符或“start with”功能

from bs4 import BeautifulSoup
import requests

url = ""

# Getting the webpage, creating a Response object.
response = requests.get(url)

# Extracting the source code of the page.
data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')

# Extracting all the <a> tags into a list.
tags = soup.find_all('a')

# Extracting URLs from the attribute href in the <a> tags.
for tags in tags:
    print(tags.get('href'))

从bs4导入美化组
导入请求
url=“”
#获取网页，创建响应对象。
response=requests.get（url）
#提取页面的源代码。
data=response.text
#将源代码传递给BeautifulSoup以为其创建BeautifulSoup对象。
soup=BeautifulSoup（数据'lxml'）
#将所有标记提取到列表中。
tags=soup.find_all（'a'））
#从标记中的属性href提取URL。
对于标记中的标记：
打印（tags.get（'href'））

还有，是否有导出到excel的方法？我对python不是很在行，老实说，我不知道我是如何走到这一步的

谢谢，

如果

tags.get

返回一个字符串，您应该能够根据需要筛选任何起始字符串，如下所示：

URLs = [URL for URL in [tag.get('href') for tag in tags] 
        if URL.startswith('/some/path/')]

编辑：

在您的例子中，

标记.get

并不总是返回字符串。对于不包含链接的标记，返回类型为

NoneType

，我们不能在

NoneType

上使用字符串方法。在使用字符串方法

startswith

之前，很容易检查

tags.get

的返回值是否为

None

URLs = [URL for URL in [tag.get('href') for tag in tags] 
        if URL is not None and URL.startswith('/some/path/')]

请注意，添加的

URL不是None和。这必须在URL.startswith
之前，否则Python将尝试在None
上使用字符串方法并进行投诉。你可以像读一个英语句子一样读这篇文章，它突出了Python的一大优点；该代码比任何其他编程语言都更易于阅读，这使得它非常适合与其他人交流想法。
您可以使用startswith（）
：
以下是代码的更新版本，将从页面获取所有https HREF：
from bs4 import BeautifulSoup
import requests

url = "https://www.google.com"

# Getting the webpage, creating a Response object.
response = requests.get(url)

# Extracting the source code of the page.
data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data)

# Extracting all the <a> tags into a list.
tags = soup.find_all('a')

# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
    if str.startswith(tag.get('href'), 'https'):
        print(tag.get('href'))

从bs4导入美化组
导入请求
url=”https://www.google.com"
#获取网页，创建响应对象。
response=requests.get（url）
#提取页面的源代码。
data=response.text
#将源代码传递给BeautifulSoup以为其创建BeautifulSoup对象。
汤=美汤（数据）
#为您的第二个问题提取所有
：是否有导出到Excel的方法？我一直在使用python模块
XlsxWriter允许编码遵循基本的excel约定-我是python新手，第一次尝试就可以轻松地启动、运行和工作。
首先，我认为标记中标记的会给您带来问题。尝试查找标记中的标记
。是否标记。获取
返回字符串？如果是这样的话，您应该能够使用str.statswith
准确地执行您想要的操作。我将在下面很快发布一个答案。这听起来很糟糕，我很抱歉，但我怎么知道它是否返回字符串。。还是不太适合python，谢谢，不要抱歉。您可以插入一行，如print（type（tag.get（'href'））
，如果它打印str
，您就得到了一个字符串。您好，谢谢，这很简单，它返回的是字符串。您好，试一下你的，当我只在google>TypeError中更改url时，出现了这个错误：描述符“startswith”需要一个“str”对象，但收到了一个“NoneType”Hello，我尝试运行您的，但出现以下错误：>>AttributeError:“NoneType”对象没有属性“startswith”，当用“http”替换pre时，理想情况下我想用完整的替换它，但这根本不起作用。对不起，谢谢you@DefcaTrick根据错误消息，tag.get（'href'）似乎不是Hello，对不起，我是否要用上面的替换tags:print（tag.get（'href'））中的for标记？还是替换ulr=''所在的位置？感谢you@DefcaTrick在创建标记后，您可以将其放在正确的位置。它应该生成URL列表，但不打印它们。为此，在URL:print（URL）
中需要一行类似于的URL。我不知道通过将空字符串传递给请求。get
，您希望做什么。您好，我确定这是因为我没有收到此消息，但我替换了标记中的标记：print（tag.get（'href'））使用上述内容并收到错误：AttributeError:“NoneType”对象没有属性“startswith”@DefcaTrick，这意味着标记不是链接，也没有href字段Tag.get
在这种情况下返回None
，而不是空字符串。我已更新我的答案以过滤掉None。
from bs4 import BeautifulSoup
import requests

url = "https://www.google.com"

# Getting the webpage, creating a Response object.
response = requests.get(url)

# Extracting the source code of the page.
data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data)

# Extracting all the <a> tags into a list.
tags = soup.find_all('a')

# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
    if str.startswith(tag.get('href'), 'https'):
        print(tag.get('href'))

import xlsxwriter

# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Expenses01.xlsx')
worksheet = workbook.add_worksheet()

# Some data we want to write to the worksheet.
expenses = (
    ['Rent', 1000],
    ['Gas',   100],
    ['Food',  300],
    ['Gym',    50],
)

# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0

# Iterate over the data and write it out row by row.
for item, cost in (expenses):
    worksheet.write(row, col,     item)
    worksheet.write(row, col + 1, cost)
    row += 1

# Write a total using a formula.
worksheet.write(row, 0, 'Total')
worksheet.write(row, 1, '=SUM(B1:B4)')

workbook.close()