Python-简化字符串[Regex]中的URL_Python_Regex_String_Filter_Replace

Python-简化字符串[Regex]中的URL

python regex string filter replace

Python-简化字符串[Regex]中的URL,python,regex,string,filter,replace,Python,Regex,String,Filter,Replace,所以我一直在想我如何简化URL，它位于一个字符串中我是这样想的：我用正则表达式过滤字符串中的所有链接简化字符串中的每个链接这就是我迄今为止所尝试的： string = "Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better" filter = re.findall(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.

所以我一直在想我如何简化URL，它位于一个字符串中

我是这样想的：

我用正则表达式过滤字符串中的所有链接
简化字符串中的每个链接

这就是我迄今为止所尝试的：

string = "Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better"

filter = re.findall(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""", string)

#Output: ['https://www.youtube.com/watch?v=d1YBv2mWll0']

我只是不太明白如何从URL中删除前端和结尾，然后将其放回原始字符串中

我还读到，

re.sub

可能是它们的发展方向，但我也需要帮助，因为我对Regex还很陌生

提前谢谢。

您可以不用重新启动就完成类似的操作

string = "Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better"
ss = string.split(' ')
ss = [x for x in ss if x.startswith('http')]
string = string.replace(ss[0], 'youtube.com')
print(string)

'Whoever visits the site youtube.com deserves no better'

或者像这样与re

import re
new_string = re.sub(r"http\S+", "youtube.com", string)
print(new_string)
'Whoever visits the site youtube.com deserves no better'

如果字符串可以有多个链接，并且链接可以有不同的域：

从

所以对你来说：

from urllib.parse import urlparse
import re
string = "Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better"

filter = re.findall(r"""(?i)(.*?)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))(.*?)""", string)
print(filter)  # Output: [('Whoever visits the site ', 'https://www.youtube.com/watch?v=d1YBv2mWll0', ''), (' deserves no better Whoever visits the site ', 'https://www.youtube.com/watch?v=d1YBv2mWll0', '')]

final_string = ""
for y in filter:
    parsed_uri = urlparse(y[1])
    shorter_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    final_string += y[0] + shorter_url + y[2]
print(final_string)  # Output: Whoever visits the site https://www.youtube.com/ deserves no better Whoever visits the site https://www.youtube.com/

从urllib.parse导入urlparse
进口稀土
string=“任何访问该站点的人https://www.youtube.com/watch?v=d1YBv2mWll0 无论是谁访问该网站，都不应该得到更好的结果https://www.youtube.com/watch?v=d1YBv2mWll0 不应该有更好的”
filter=re.findall（r“”）（？i）（.*）\b（（？：https？：（？：/{1,3}|[a-z0-9%]））|[a-z0-9.\-]+[.]（简称：：：：：）com（124）网站网站（124）网络（124）网站（124）网站（124）网站网站（124）互联网（124）网络（124）网站（124）网站（124）网站（124）网络（124）网络（124）网络（124）网站（124）网站（124）网站（124）网站（124）网站（124）网络）网站（124）网站（124）网站（124）互联网）网站（124）网站（网络）网站（124）运营商）亚洲亚洲亚洲（商业运营商（124）运营商）猫猫（124）运营商）猫猫（猫猫）运营运营商（猫猫）合合合合合合运营商（124）运营商（124）运营商（124）运营商（124）运营商（124）运营商）方方（124）运营商（124）信息（124）信息（124）方方（124）信息（124）互联网（124）方）方方方方方）方方（124）电电电电电运营商（124）方）《卡本本斯》的瓦瓦本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本本| cv | cx | cy | cz | dd | de | dj | dk | dm | do | dz | ec | ee | eg | eh | er | es | et | eu fi 124; fj | fk | fm foGf| G| G| G| GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG卡本尼·凯凯撒·基本本本本本尼·卡卡本尼·卡卡本尼·卡卡本尼·卡卡本尼·卡本尼·卡卡本尼·卡卡本尼·卡卡本尼·卡卡本尼·卡卡本尼·卡本本尼·卡本尼·卡本尼·卡本尼·卡本本尼·卡本尼·卡本本尼·卡本本尼·卡本尼·卡本本本尼·卡本本尼·卡本本尼·卡本本尼·卡本本本本尼·卡本本尼·卡本本本尼·卡本尼·卡本本本本尼·卡本本尼·卡本尼·卡本尼·卡本尼·卡本本本本本本尼·卡本本本本本本尼·卡本尼·卡本本尼·卡本本本本尼·卡本本本尼·卡本尼·卡本本本本本本本本尼·本本本本本尼·本本本本本本尼“mv”奈奈奈奈何，奈奈奈奈奈奈奈奈奈何，奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈奈rw | sa | sb | sc | sd | se|本周四的赛方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方英国-英国（1244）UUU124英国（1244）UUU124英国（1244）UUUU124英国英国（1244）美国（1244）美国（1244）英国（1244）美国（1244）UUUUUUUUUU1244）UUUUUUUUUU1244英国（1244）UUUUUUUUUUUUUUUUUUUUUUUUUUUUU1244-UUUUUUUUUUUUUUUUUUUUUUUUUUUUU1244.-1244-UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU12.的（124测试测试测试测试测试测试（1244-UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU1244+\）[^\s（）]*？\）\（[^\s]+？\）\[^\s`！（）\[\]{}；：\'\'\'，«»''''））\124；（？：（（？{}\[\]+\\（[^\ s（3）除除以下几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几几[^\s（）（？此评论应该会通知您我对答案所做的更改。谢谢：）效果很好，但它仍然保留了。在它前面..有没有关于如何删除此内容的想法？我真的只想拥有“youtube.com”例如，我会给你一个合适的解决方案是的，你是对的，我可以复制它。这很棘手。我可以解决这个问题，看看上面的新代码。这应该也会更容易，但我的正则表达式技能结束了我的可能性。这很有魅力！就像我说的，非常感谢你，伙计
from urllib.parse import urlparse
# from urlparse import urlparse  # Python 2
parsed_uri = urlparse('http://stackoverflow.com/questions/1234567/blah-blah-blah-blah' )
result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
print(result)

# gives
'http://stackoverflow.com/'

from urllib.parse import urlparse
import re
string = "Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better"

filter = re.findall(r"""(?i)(.*?)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))(.*?)""", string)
print(filter)  # Output: [('Whoever visits the site ', 'https://www.youtube.com/watch?v=d1YBv2mWll0', ''), (' deserves no better Whoever visits the site ', 'https://www.youtube.com/watch?v=d1YBv2mWll0', '')]

final_string = ""
for y in filter:
    parsed_uri = urlparse(y[1])
    shorter_url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    final_string += y[0] + shorter_url + y[2]
print(final_string)  # Output: Whoever visits the site https://www.youtube.com/ deserves no better Whoever visits the site https://www.youtube.com/

from urllib.parse import urlparse
import re
string = "Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better"

filter = re.findall(r"""(?i)(.*?)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))(.*?)""", string)
print(filter)  # Output: [('Whoever visits the site ', 'https://www.youtube.com/watch?v=d1YBv2mWll0', ''), (' deserves no better Whoever visits the site ', 'https://www.youtube.com/watch?v=d1YBv2mWll0', '')]

final_string = ""
for y in filter:
    parsed_uri = urlparse(y[1])
    shorter_url = '{uri.netloc}'.format(uri=parsed_uri)
    final_string += y[0] + shorter_url + y[2]
print(final_string)  # Output: Whoever visits the site www.youtube.com deserves no better Whoever visits the site

import re
from urllib.parse import urlparse
import tldextract


def shorter(text_with_urls):
    result = re.findall(
        r"""(?i)(.*?)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))(.*)""",
        text_with_urls)
    if len(result) == 1:  # success
        result = result[0]
    else:
        return text_with_urls
    parsed_uri = urlparse(result[1])
    # Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly
    # introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.
    shorter_url = '{uri.netloc}'.format(uri=parsed_uri) if parsed_uri.netloc else ('{uri.path}'.format(uri=parsed_uri) if parsed_uri.path else parsed_uri)
    extracted = tldextract.extract(shorter_url)
    if extracted.domain and extracted.suffix:
        shorter_url = "{}.{}".format(extracted.domain, extracted.suffix)
    return result[0] + shorter_url + shorter(result[2])


# string = "Hi"
# string = "Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better Whoever visits the site https://www.youtube.com/watch?v=d1YBv2mWll0 deserves no better"
string = "This fails to recover any valid url when I input the string. Assume that this regex will be used for a public URL shortener written in PHP, so URLs like http://localhost/, https://www.webdesignerdepot.com/2012/10/creating-a-modal-window-with-html5-and-css3/, www.testing.com. So what should I do"
# every string from above works.
print(string)
final_string = shorter(string)
print(final_string)