Html String.IndexOf()返回意外值-无法提取两个搜索字符串之间的子字符串
脚本来操纵网络故事中的一些专有名称,以帮助我的阅读工具正确发音 我通过Html String.IndexOf()返回意外值-无法提取两个搜索字符串之间的子字符串,html,powershell,substring,string-parsing,Html,Powershell,Substring,String Parsing,脚本来操纵网络故事中的一些专有名称,以帮助我的阅读工具正确发音 我通过 $webpage = (Invoke-WebRequest -URI 'https://wanderinginn.com/2018/03/20/4-20-e/').Content 此$网页应为字符串类型 现在 返回意外值,我需要解释为什么或如何自己找到错误 从理论上讲,它应该剪切页面的“主体”,通过我要替换的专有名词列表运行它,并将其推送到htm文件中。 这一切都可行,但IndexOf(“Prev…”)的值却不行 编辑:
$webpage = (Invoke-WebRequest -URI 'https://wanderinginn.com/2018/03/20/4-20-e/').Content
此$网页应为字符串类型
现在
返回意外值,我需要解释为什么或如何自己找到错误
从理论上讲,它应该剪切页面的“主体”,通过我要替换的专有名词列表运行它,并将其推送到htm文件中。
这一切都可行,但IndexOf(“Prev…”)的值却不行
编辑:
调用webrequest后,我可以
Set-Clipboard $webrequest
然后在记事本++中发布,在那里我可以找到'div class=“entry content”和'Previous Chapter'。
如果我做了类似的事情
Set-Clipboard $webpage.substring(
$webpage.IndexOf('<div class="entry-content">'),
$webpage.IndexOf('PreviousChapter')
)
设置剪贴板$webpage.substring(
$webpage.IndexOf(“”),
$webpage.IndexOf('PreviousChapter')
)
我希望Powershell能够正确地确定这些字符串的两个第一个实例,并在它们之间切换。因此,我的剪贴板现在应该有我想要的内容,但字符串比第一个实例更进一步。tl;dr
- 您对如何工作有一个误解:第二个参数必须是要提取的子字符串的长度,而不是结束索引(字符位置)-请参见下文
- 作为一种替代方法,您可以使用更简洁(尽管更复杂)的正则表达式操作和
来提取单个操作中感兴趣的子字符串-见下文-replace
- 总的来说,最好使用HTML解析器来提取所需的信息,因为字符串处理是脆弱的(HTML允许空格、引号样式等的变化)
正如所指出的,您对其工作原理有一个误解:其论点如下:
- 开始索引(
基于字符位置)0
- 从中应返回给定长度的子字符串
# Sample input from which to extract the substring
# '>>this up to here'
# or, better,
# 'this up to here'.
$webpage = 'Return from >>this up to here<<'
# WRONG (your attempt):
# *index* of 2nd substring is mistakenly used as the *length* of the
# substring to extract, which in this even *breaks*, because a length
# that exceeds the bounds of the string is specified.
$webpage.Substring(
$webpage.IndexOf('>>'),
$webpage.IndexOf('<<')
)
# OK, extracts '>>this up to here'
# The difference between the two indices is the correct length
# of the substring to extract.
$webpage.Substring(
($firstIndex = $webpage.IndexOf('>>')),
$webpage.IndexOf('<<') - $firstIndex
)
# BETTER, extracts 'this up to here'
$startDelimiter = '>>'
$endDelimiter = '<<'
$webpage.Substring(
($firstIndex = $webpage.IndexOf($startDelimiter) + $startDelimiter.Length),
$webpage.IndexOf($endDelimiter) - $firstIndex
)
也就是说,您可以使用单个正则表达式()通过以下方式提取感兴趣的子字符串:
- 内联选项(
)(?…)
确保元字符s
也匹配换行符(以便
跨行匹配),默认情况下不会匹配*
- 请注意,如果搜索字符串恰好包含正则表达式元字符(在正则表达式上下文中具有特殊含义的字符),则可能必须将转义应用于要嵌入正则表达式中的搜索字符串:
- 对于嵌入的文字字符串,
-根据需要转义字符;e、 例如,将\
转义为.txt
\.txt
- 如果要嵌入的字符串来自变量,请首先对其值应用
;e、 g:[regex]::Escape()
$var = '.txt' # [regex]::Escape() yields '\.txt', which ensures # that '.txt' doesn't also match '_txt" 'a_txt a.txt' -replace ('a' + [regex]::Escape($var)), 'a.csv'
- 对于嵌入的文字字符串,
(调用WebRequest-URI'https://wanderinginn.com/2018/03/20/4-20-e/”).Content.indexof(“上一章”)
这让我得到了87859。这有什么不对?是否将行号视为字符号?IndexOf()
只返回请求字符串的整数索引。您需要使用该信息来剪切所需内容。.SubString()
方法单独使用StartIndex
或StartIndex
,Length`。您将为它提供两个起始索引号。//您需要将第二个数字设置为两个索引值之间的差值。天啊,我是个笨蛋。非常感谢。我想我让自己更难了,因为记事本++的发现总是显示不同的字符数与powershell。
# Sample input from which to extract the substring
# '>>this up to here'
# or, better,
# 'this up to here'.
$webpage = 'Return from >>this up to here<<'
# WRONG (your attempt):
# *index* of 2nd substring is mistakenly used as the *length* of the
# substring to extract, which in this even *breaks*, because a length
# that exceeds the bounds of the string is specified.
$webpage.Substring(
$webpage.IndexOf('>>'),
$webpage.IndexOf('<<')
)
# OK, extracts '>>this up to here'
# The difference between the two indices is the correct length
# of the substring to extract.
$webpage.Substring(
($firstIndex = $webpage.IndexOf('>>')),
$webpage.IndexOf('<<') - $firstIndex
)
# BETTER, extracts 'this up to here'
$startDelimiter = '>>'
$endDelimiter = '<<'
$webpage.Substring(
($firstIndex = $webpage.IndexOf($startDelimiter) + $startDelimiter.Length),
$webpage.IndexOf($endDelimiter) - $firstIndex
)
'abc'.Substring(4) # ERROR "startIndex cannot be larger than length of string"
'abc'.Substring(1, 3) # ERROR "Index and length must refer to a location within the string"
$webpage = 'Return from >>this up to here<<'
# Outputs 'this up to here'
$webpage -replace '^.*?>>(.*?)<<.*', '$1'
$webpage -replace '(?s).*?<div class="entry-content">(.*?)Previous Chapter.*', '$1'
$var = '.txt'
# [regex]::Escape() yields '\.txt', which ensures
# that '.txt' doesn't also match '_txt"
'a_txt a.txt' -replace ('a' + [regex]::Escape($var)), 'a.csv'