这里有新鲜出炉的 PHP 教程,程序狗速度看过来!
PHP(外文名: Hypertext Preprocessor,中文名:"超文本预处理器")是一种通用开源脚本语言。语法吸收了 C 语言、Java 和 Perl 的特点,入门门槛较低,易于学习,使用广泛,主要适用于 web 开发领域。PHP 的文件后缀名为 php。
这篇文章主要介绍了 PHP 利用正则表达式将相对路径转成绝对路径的方法,文中给出了详细的示例代码,大家可以整合成一个方法,在需要的地方调用,非常的不错。需要的朋友们下面来一起看看吧。
前言
大家应该都有所体会,很多时候在做网络爬虫的时候特别需要将爬虫搜索到的超链接进行处理,统一都改成绝对路径的,所以本文就写了一个正则表达式来对搜索到的链接进行处理。下面话不多说,来看看详细的介绍吧。
通常我们可能会搜索到如下的链接:
- <!-- 空超链接 -->
- <a href=""></a>
- <!-- 空白符 -->
- <a href=" " rel="external nofollow" > </a>
- <!-- a标签含有其它属性 -->
- <a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超链接"> index.html </a>
- <a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" target="_blank"> / target="_blank" </a>
- <a target="_blank" href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超链接" > target="_blank" / alt="超链接" </a>
- <a target="_blank" title="超链接" href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超链接" > target="_blank" title="超链接" / alt="超链接" </a>
- <!-- 根目录 -->
- <a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" > / </a>
- <a href="a" rel="external nofollow" > a </a>
- <!-- 含参数 -->
- <a href="/index.html?id=1" rel="external nofollow" > /index.html?id=1 </a>
- <a href="?id=2" rel="external nofollow" > ?id=2 </a>
- <!-- // -->
- <a href="//index.html" rel="external nofollow" > //index.html </a>
- <a href="//www.mafutian.net" rel="external nofollow" > //www.mafutian.net </a>
- <!-- 站内链接 -->
- <a href="http://www.hole_1.com/index.html" rel="external nofollow" > http://www.hole_1.com/index.html </a>
- <!-- 站外链接 -->
- <a href="http://www.mafutian.net" rel="external nofollow" > http://www.mafutian.net </a>
- <a href="http://www.numberer.net" rel="external nofollow" > http://www.numberer.net </a>
- <!-- 图片,文本文件格式的链接 -->
- <a href="1.jpg" rel="external nofollow" > 1.jpg </a>
- <a href="1.jpeg" rel="external nofollow" > 1.jpeg </a>
- <a href="1.gif" rel="external nofollow" > 1.gif </a>
- <a href="1.png" rel="external nofollow" > 1.png </a>
- <a href="1.txt" rel="external nofollow" > 1.txt </a>
- <!-- 普通链接 -->
- <a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a>
- <a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a>
- <a href="./index.html" rel="external nofollow" > ./index.html </a>
- <a href="../index.html" rel="external nofollow" > ../index.html </a>
- <a href=".../" rel="external nofollow" > .../ </a>
- <a href="..." rel="external nofollow" > ... </a>
- <!-- 非链接,含有链接冒号 -->
- <a href="javascript:void(0)" rel="external nofollow" > javascript:void(0) </a>
- <a href="a:b" rel="external nofollow" > a:b </a>
- <a href="/a#a:b" rel="external nofollow" > /a#a:b </a>
- <a href="mailto:'mafutian@126.com'" rel="external nofollow" > mailto:'mafutian@126.com' </a>
- <a href="/tencent://message/?uin=335134463" rel="external nofollow" > /tencent://message/?uin=335134463 </a>
- <!-- 相对路径 -->
- <a href="." rel="external nofollow" > . </a>
- <a href=".." rel="external nofollow" > .. </a>
- <a href="../" rel="external nofollow" > ../ </a>
- <a href="/a/b/.." rel="external nofollow" > /a/b/.. </a>
- <a href="/a" rel="external nofollow" > /a </a>
- <a href="./b" rel="external nofollow" > ./b </a>
- <a href="./././././././././b" rel="external nofollow" > ./././././././././b </a> <!-- 其实就是 ./b -->
- <a href="../c" rel="external nofollow" > ../c </a>
- <a href="../../d" rel="external nofollow" > ../../d </a>
- <a href="../a/../b/c/../d" rel="external nofollow" > ../a/../b/c/../d </a>
- <a href="./../e" rel="external nofollow" > ./../e </a>
- <a href="http://www.hole_1.org/./../e" rel="external nofollow" > http://www.hole_1.org/./../e </a>
- <a href="./.././f" rel="external nofollow" > ./.././f </a>
- <a href="http://www.hole_1.org/../a/.../../b/c/../d/.." rel="external nofollow" > http://www.hole_1.org/../a/.../../b/c/../d/.. </a>
- <!-- 带有端口号 -->
- <a href=":8081/index.html" rel="external nofollow" > :8081/index.html </a>
- <a href="http://www.mafutian.net:80/index.html" rel="external nofollow" > :80/index.html </a>
- <a href="http://www.mafutian.net:8081/index.html" rel="external nofollow" > http://www.mafutian.net:8081/index.html </a>
- <a href="http://www.mafutian.net:8082/index.html" rel="external nofollow" > http://www.mafutian.net:8082/index.html </a>
处理的第一步,设置成绝对路径:
- http:// ... / ../ ../
然后本文讲讲如何去除绝对路径中的 './'、'../'、'/..'的实现代码:
- function url_to_absolute($relative) {
- $absolute = '';
- // 去除所有的 './'
- $absolute = preg_replace('/(?<!\.)\.\//', '', $relative);
- $count = preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//', $absolute, $res);
- // 迭代去除所有的 '/abc/../'
- do {
- $absolute = preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//', '/', $absolute);
- $count = preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//', $absolute, $res);
- } while ( $count >= 1 );
- // 除去最后的 '/..'
- $absolute = preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.$/', '/', $absolute);
- $absolute = preg_replace('/\/\.\.$/', '', $absolute);
- // 除去存在的 '../'
- $absolute = preg_replace('/(?<!\.)\.\.\//', '', $absolute);
- return $absolute;
- }
- $relative = 'http://www.mytest.org/../a/.../../b/c/../d/..';
- var_dump(url_to_absolute($relative));
- // 输出:string 'http://www.mytest.org/a/b/' (length=26)
总结
以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流,谢谢大家对 PHPERZ 的支持。
来源: http://www.phperz.com/article/17/0811/340440.html