Python 如何仅从示例中的stings中提取网络指示符？_Python_Regex_Http_Batch File_Ip

Python 如何仅从示例中的stings中提取网络指示符？

python regex http batch-file ip

Python 如何仅从示例中的stings中提取网络指示符？,python,regex,http,batch-file,ip,Python,Regex,Http,Batch File,Ip,我有一个潜在恶意软件行为的示例，我想显示所有网络指示器，如网站名称和它连接的ip地址通过使用字符串输出，我得到了 $ strings 6787c54e6a2c5cffd1576dcdc8c4f42c954802b7 %PDF-1.5 1 0 obj <</Type/Page/Parent 80 0 R/Contents 36 0 R/MediaBox[0 0 612 792]/Annots[2 0 R 4 0 R 6 0 R 8 0 R 10 0 R

我有一个潜在恶意软件行为的示例，我想显示所有网络指示器，如网站名称和它连接的ip地址

通过使用字符串输出，我得到了

    $ strings 6787c54e6a2c5cffd1576dcdc8c4f42c954802b7
    %PDF-1.5
    1 0 obj
    <</Type/Page/Parent 80 0 R/Contents 36 0 R/MediaBox[0 0 612 792]/Annots[2 0 R 4 0 R 6 0 R 8 0 R 10 0 R 12 0 R 14 0 R 16 0 R 18 0 R]/Group 20 0 R/StructParents 1/Tabs/S/Resources<</Font<</F1 21 0 R/F2 23 0 R/F3 26 0 R/F4 29 0 R/F5 31 0 R>>/XObject<</Image6 33 0 R/Image9 34 0 R>>>>>>
    endobj
    2 0 obj
    <</Type/Annot/Subtype/Link/Rect[139.10001 398.20001 449.84 726.20001]/Border[0 0 0]/F 4/NM(PDFE-48D407B4789BA8880)/P 1 0 R/StructParent 0/A 3 0 R>>
    endobj
    3 0 obj
    <</S/URI/URI(http://www.pdfupdatersacrobat.top/website/hts-cache/index.php?userid=info@narainsfashionfabrics.com)>>
    endobj
    4 0 obj
    <</Type/Annot/Subtype/Link/Rect[232.39999 618.03003 370.14999 629.53003]/Border[0 0 0]/F 4/NM(PDFE-48D407B4789BA8881)/P 1 0 R/StructParent 2/A 5 0 R>>
    endobj
    5 0 obj
    <</S/URI/URI(>>
    endobj
    6 0 obj
    <</Type/Annot/Subtype/Link/Rect[278.87 583.20001 324.88 594.13]/Border[0 0 0]/F 4/NM(PDFE-48D407B4789BA8882)/P 1 0 R/StructParent 3/A 7 0 R>>
    endobj
    7 0 obj
    <</S/URI/URI()>>
    endobj
    8 0 obj
    <</Type/Annot/Subtype/Link/Rect[185.75999 377.28 398.16 733.67999]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D4183FB09C5EC13)/P 1 0 R/A 9 0 R/H/N>>
    endobj
    9 0 obj
    <</S/URI/URI(http://sajiye.net/file/website/file/main/index.php?userid=alwaha_alghannaa@hotmail.com)>>
    endobj
    10 0 obj
    <</Type/Annot/Subtype/Link/Rect[185.75999 373.67999 398.88 734.40002]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D4183FB09C5EC14)/P 1 0 R/A 11 0 R/H/N>>
    endobj
    11 0 obj
    <</S/URI/URI(http://sajiye.net/file/website/file/main/index.php?userid=kitja@siamdee2558.com)>>
    endobj
    12 0 obj
    <</Type/Annot/Subtype/Link/Rect[132.48 0 474.48001 772.56]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D460B5879C4D8C5)/P 1 0 R/A 13 0 R/H/N>>
    endobj
    13 0 obj
    <</S/URI/URI(http://nurking.pl/wp-admin/user/email.163.htm?login=)>>
    endobj
    14 0 obj
    <</Type/Annot/Subtype/Link/Rect[0 0 612 792]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D465334C760A446)/P 1 0 R/A 15 0 R/H/N>>
    endobj
    15 0 obj
    <</S/URI/URI(https://www.dropbox.com/s/76jr9jzg020gory/Swift%20Copy.uue?dl=1)>>
    endobj
    16 0 obj
    <</Type/Annot/Subtype/Link/Rect[.72 0 612 789.84003]/Border[0 0 0]/C[0 0 0]/F 4/NM(PDFE-48D4C7F946F3F02B7)/P 1 0 R/A 17 0 R/H/N>>
    endobj
    17 0 obj
    <</S/URI/URI(https://www.dropbox.com/s/28aaqjdradyy4io/Swift-Copy_pdf.uue?dl=1)>>
    endobj
    18 0 obj
    <</Type/Annot/Subtype/Link/Rect[0 5.76 612 792]/Border[0 0 0]/C[0 0 0]/F 4/P 1 0 R/A 19 0 R/H/N>>
    endobj
    19 0 obj
    <</S/URI/URI(https://www.dropbox.com/s/d71h5a56r16u3f0/swift_copy.jar?dl=1)>>
    endobj
    20 0 obj
    <</S/Transparency/CS/DeviceRGB>>
    endobj
    21 0 obj
    <</Type/Font/Subtype/TrueType/BaseFont/TimesNewRoman/FirstChar 32/LastChar 252/Encoding/WinAnsiEncoding/FontDescriptor 22 0 R/Widths[250 333 408 500 500 833 777 180 333 333 500 563 250 333 250 277 500 500 500 500 500 500 500 500 500 500 277 277 563 563 563 443 920 722 666 666 722 610 556 722 722 333 389 722 610 889 722 722 556 722 666 556 610 722 722 943 722 722 610 333 277 333 469 500 333 443 500 443 500 443 333 500 500 277 277 500 277 777 500 500 500 500 333 389 277 500 500 722 500 500 443 479 200 479 541 350 500 350 333 500 443 1000 500 500 333 1000 556 333 889 350 610 350 350 333 333 443 443 350 500 1000 333 979 389 333 722 350 443 722 250 333 500 500 500 500 200 500 333 759 275 500 563 333 759 500 399 548 299 299 333 576 453 333 333 299 310 500 750 750 750 443 722 722 722 722 722 722 889 666 610 610 610 610 333 333 333 333 722 722 722 722 722 722 722 563 722 722 722 722 722 722 556 500 443 443 443 443 443 443 666 443 443 443 443 443 277 277 277 277 500 500 500 500 500 500 500 548 500 500 500 500 500]>>
    endobj
    22 0 obj
    <</Type/FontDescriptor/FontName/TimesNewRoman/Flags 32/FontBBox[-568 -215 2045 891]/FontFamily(Times New Roman)/FontWeight 400/Ascent 891/CapHeight 693/Descent -215/MissingWidth 777/StemV 0/ItalicAngle 0/XHeight 485>>
    endobj
    23 0 obj
    <</Type/Font/Subtype/TrueType/BaseFont/ABCDEE+Calibri,BoldItalic/FirstChar 32/LastChar 117/Name/F2/Encoding/WinAnsiEncoding/FontDescriptor 24 0 R/Widths[226 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 630 0 459 0 0 0 0 0 0 0 0 668 532 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 528 0 412 0 491 316 0 0 246 0 0 246 804 527 527 0 0 0 0 347 527]>>
    endobj
    24 0 obj
    <</Type/FontDescriptor/FontName/ABCDEE+Calibri,BoldItalic/FontWeight 700/Flags 32/FontBBox[-691 -250 1265 750]/Ascent 750/CapHeight 750/Descent -250/StemV 53/ItalicAngle -11/AvgWidth 536/MaxWidth 1956/XHeight 250/FontFile2 25 0 R>>
    endobj
<</Type/Pages/Count 1/Kids[1 0 R]>>
endobj
81 0 obj
<</Type/Catalog/Pages 80 0 R/Lang(en-US)/MarkInfo<</Marked true>>/Metadata 83 0 R/StructTreeRoot 37 0 R>>
endobj
82 0 obj
<</Producer(RAD PDF 2.36.8.0 - http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
endobj
83 0 obj
<</Type/Metadata/Subtype/XML/Length 1031>>stream
<?xpacket begin="
" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="DynaPDF 4.0.11.30, http://www.dynaforms.com">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
        xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:xmp="http://ns.adobe.com/xap/1.0/"
        xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<pdf:Producer>RAD PDF 2.36.8.0 - http://www.radpdf.com</pdf:Producer>
<xmp:CreateDate>2016-08-25T07:52:02+01:00</xmp:CreateDate>
<xmp:CreatorTool>RAD PDF</xmp:CreatorTool>
<xmp:MetadataDate>2017-07-11T01:25:32-08:00</xmp:MetadataDate>
<xmp:ModifyDate>2017-07-11T01:25:32-08:00</xmp:ModifyDate>
<dc:creator><rdf:Seq><rdf:li xml:lang="x-default">alesk</rdf:li></rdf:Seq></dc:creator>
<xmpMM:DocumentID>uuid:a184332f-8592-38c8-908c-45914e523218</xmpMM:DocumentID>
<xmpMM:VersionID>1</xmpMM:VersionID>
<xmpMM:RenditionClass>default</xmpMM:RenditionClass>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
endobj
84 0 obj
<</Type/XRef/Size 85/Root 81 0 R/Info 82 0 R/ID[<299C21286E590F03363518EFD9FBBF99><299C21286E590F03363518EFD9FBBF99>]/W[1 3 0]/Filter/FlateDecode/Length 239>>stream
cx?{
endstream
endobj
startxref
204273
%%EOF

由于findstr只提供基本的正则表达式功能，我建议使用PowerShell

如有必要，分批包装

相反，这个过程RegEx并没有去掉http行的尾部：

> gc .\sample.txt |sls '^.*?(https?:\/\/.*)$'|%{$_.Matches.Groups[1].Value}
http://www.pdfupdatersacrobat.top/website/hts-cache/index.php?userid=info@narainsfashionfabrics.com)>>
http://sajiye.net/file/website/file/main/index.php?userid=alwaha_alghannaa@hotmail.com)>>
http://sajiye.net/file/website/file/main/index.php?userid=kitja@siamdee2558.com)>>
http://nurking.pl/wp-admin/user/email.163.htm?login=)>>
https://www.dropbox.com/s/76jr9jzg020gory/Swift%20Copy.uue?dl=1)>>
https://www.dropbox.com/s/28aaqjdradyy4io/Swift-Copy_pdf.uue?dl=1)>>
https://www.dropbox.com/s/d71h5a56r16u3f0/swift_copy.jar?dl=1)>>
http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
http://www.dynaforms.com">
http://www.w3.org/1999/02/22-rdf-syntax-ns#">
http://ns.adobe.com/pdf/1.3/"
http://purl.org/dc/elements/1.1/"
http://ns.adobe.com/xap/1.0/"
http://ns.adobe.com/xap/1.0/mm/">
http://www.radpdf.com</pdf:Producer>

同样，对于可能的IP也是粗糙的

> gc .\sample.txt |sls '^(.*?(\d{1,3}\.){3}\d{1,3}.*)$'|%{$_.Matches.Groups[1].Value}
<</Producer(RAD PDF 2.36.8.0 - http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="DynaPDF 4.0.11.30, http://www.dynaforms.com">
<pdf:Producer>RAD PDF 2.36.8.0 - http://www.radpdf.com</pdf:Producer>

是的，这是可能的。您可以找到所有URL，然后使用反向引用提取它们。您可以阅读更多关于反向引用的信息

注:

您应该使用pattern.finditer，因为这样可以通过称为string的文本中的所有模式结果进行迭代。从re.finditer文档：

re.findItemPattern，字符串，标志=0 返回一个迭代器针对RE的所有非重叠匹配的MatchObject实例字符串中的模式。字符串从左到右扫描，并匹配按找到的顺序返回。空匹配项包含在结果，除非他们触及另一场比赛的开始

那么，你想要一个正则表达式来正则化你的字符串吗？是的，一个正则表达式，我可以传递这个输入，它将给出网络指示符好的-这是可能的批量。但您知道堆栈溢出是如何工作的。你需要表现出你的努力

> gc .\sample.txt |sls '^(.*?(\d{1,3}\.){3}\d{1,3}.*)$'|%{$_.Matches.Groups[1].Value}
<</Producer(RAD PDF 2.36.8.0 - http://www.radpdf.com)/Author(alesk)/Creator(RAD PDF)/RadPdfCustomData(pdfescape.com-open-AC00E8D5A4B4C84BC37A2054F4EC794B0297765728CB8415)/CreationDate(D:20160825075202+01'00')/ModDate(D:20170711012532-08'00')>>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="DynaPDF 4.0.11.30, http://www.dynaforms.com">
<pdf:Producer>RAD PDF 2.36.8.0 - http://www.radpdf.com</pdf:Producer>

Aliases used: gc = Get-Content sls = Select-String % = ForEach-Object

# Pattern describing regular expression
pattern = re.compile(r'(\(https?[:_%A-Z=?/a-z0-9.-]+\))') 

# List where we store all URLs
urls = []

# For each invoice pattern you find in string, append it to list
for url in pattern.finditer(string):
    urls.append(url.group(1))