非 Python 独有, re 模块实现
re.match
re.match 尝试从字符串的起始位置匹配一个模式, 如果不是起始位置匹配成功的话, match() 就返回 none.
re.match(pattern,string,flags=0)
最常规的匹配
- import re
- content = 'Hello 123 4567 World_This is a Regex Demo'
- result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)
- print(result)
- print(len(content))
- print(result.span())
- print(result.group())
- <_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
- 41
- (0, 41)
- Hello 123 4567 World_This is a Regex Demo
泛匹配
- import re
- content = 'Hello 123 4567 World_This is a Regex Demo'
- result = re.match('^Hello.*Demo$',content)
- print(result)
- <_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
匹配目标
- content = 'Hello 1234456 World, Nice_to meet u'
- result = re.match('^Hello\s(\d+)\sWorld',content)
- print(result)
- print(result.group(1))
- <_sre.SRE_Match object; span=(0, 19), match='Hello 1234456 World'>
- 1234456
贪婪匹配
- content = 'Hello 1234456 World, Nice to meet u_This is a Regex Demo'
- result = re.match('^He.*(\d+).*Demo$',content)
- print(result)
- print(result.group(1))
- <_sre.SRE_Match object; span=(0, 56), match='Hello 1234456 World, Nice to meet u_This is a Reg>
- 6 # .* 匹配到最后一个字符
非贪婪匹配
- content = 'Hello 1234456 World, Nice to meet u_This is a Regex Demo'
- result = re.match('^He.*?(\d+).*Demo$',content)
- print(result)
- print(result.group(1))
- <_sre.SRE_Match object; span=(0, 56), match='Hello 1234456 World, Nice to meet u_This is a Reg>
- 1234456 # .*? 会匹配尽可能少的字符
匹配模式
. 本身不能匹配换行符
- content = '''Hello 1234456 World, Nice to meet u_This
- is A Regex Demo
- ''' result = re.match('^He.*?(\d+).*Demo$',content)
- print(result)
- None
加上第三个参数
- result = re.match('^He.*?(\d+).*Demo$',content,re.S)
- print(result)
- print(result.group(1))
- <_sre.SRE_Match object; span=(0, 57), match='Hello 1234456 World, Nice to meet u_This \nis A R>
- 1234456
转义
- content = 'price is $5.00'
- result = re.match('price is $5.00',content)
- print(result)
- None
增加转义字符后:
- result = re.match('price is \$5\.00',content)
- print(result)
- <_sre.SRE_Match object; span=(0, 14), match='price is $5.00'>
总结: 尽量使用泛匹配, 使用括号得到匹配目标, 尽量使用非贪婪模式, 有换行符就用 re.S
re.search
re.search 扫描整个字符串并返回第一个成功的匹配
- import re
- content = 'Extra strings Hello 1234556 World_This is a Regex Demo Extra strings'
- result = re.match('Hello.*?(\d+).*?Demo',content)
- print(result)
- None
re.match 没有找到字符
- result = re.search('Hello.*?(\d+).*?Demo',content)
- print(result)
- print(result.group(1))
- <_sre.SRE_Match object; span=(14, 54), match='Hello 1234556 World_This is a Regex Demo'>
- 1234556
总结: 为匹配方便, 能用 search 就不用 match
匹配演练
- import re
- html = '''<div id="songs-list">
- <h2 class="title"> 经典老歌 </h2>
- <p class="introduction"> 经典老歌列表 </p>
- <ul id="list" class="list-group">
- <li data-view="2"> 一路上有你 </li>
- <li data-view="7">
- <a href="2.mp3" singer="任贤齐"> 沧海一声笑 </a>
- </li>
- <li data-view="4" class="active">
- <a href="3.mp3" singer="齐秦"> 往事随风 </a>
- </li>
- <li data-view="6"><a href="4.mp3" singer="beyond"> 光辉岁月 </a></li>
- <li data-view="5"><a href="5.mp3" singer="陈慧琳"> 记事本 </a></li>
- <li data-view="5">
- <a href="6.mp3" singer="邓丽君"><i class="fa fa-user"></i > 但愿人长久 </a>
- </li>
- </ul>
- </div>
- '''result = re.search('<li.*?active.*?singer="(.*?)">(.*?)</a>',HTML,re.S)
- if result:
- print(result.group(1),result.group(2))
齐秦 往事随风
- result = re.search('<li.*?singer="(.*?)">(.*?)</a>',HTML,re.S)
- if result:
- print(result.group(1),result.group(2))
任贤齐 沧海一声笑
默认匹配第一个
- result = re.search('<li.*?singer="(.*?)">(.*?)</a>',HTML)
- if result:
- print(result.group(1),result.group(2))
beyond 光辉岁月
去掉换行符后的输出结果
re.findall
搜索字符串, 以列表形式返回全部匹配的子串
- results = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>',HTML,re.S)
- print(results)
- [('2.mp3', '任贤齐', '沧海一声笑'), ('3.mp3', '齐秦', '往事随风'), ('4.mp3', 'beyond', '光辉岁月'), ('5.mp3', '陈慧琳', '记事本'), ('6.mp3', '邓丽君', '但愿人长久')]
- for result in results:
- print(result[0],result[1],result[2])
2.mp3 任贤齐 沧海一声笑
3.mp3 齐秦 往事随风
4.mp3 beyond 光辉岁月
5.mp3 陈慧琳 记事本
6.mp3 邓丽君 但愿人长久
- results = re.findall('<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>',HTML,re.S)
- print(results)
- [('','一路上有你',''), ('','沧海一声笑',''), ('','往事随风',''), ('','光辉岁月',''), ('','记事本',''), ('','但愿人长久','')]
- for result in results:
- print(result[1])
一路上有你
沧海一声笑
往事随风
光辉岁月
记事本
但愿人长久
re.sub
替换字符串中每一个匹配的子串后返回替换后的字符
- content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
- content = re.sub('\d+','',content)
- print(content)
- Extra strings Hello World_This is a Regex Demo Extra strings
- content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
- content = re.sub('\d+','Replacement',content)
- print(content)
- Extra strings Hello Replacement World_This is a Regex Demo Extra strings
字符 "\1 表示": 引用前面的字符串
- content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
- content = re.sub('(\d+)',r'\1 8910',content)
- print(content)
- Extra strings Hello 1234567 8910 World_This is a Regex Demo Extra strings
- HTML = re.sub('<a.*?>|</a>','',HTML)
- print(HTML)
- <div id="songs-list">
- <h2 class="title"> 经典老歌 </h2>
- <p class="introduction"> 经典老歌列表 </p>
- <ul id="list" class="list-group">
- <li data-view="2"> 一路上有你 </li>
- <li data-view="7">
沧海一声笑
- </li>
- <li data-view="4" class="active">
往事随风
- </li>
- <li data-view="6"> 光辉岁月 </li>
- <li data-view="5"> 记事本 </li>
- <li data-view="5">
- <i class="fa fa-user"></i > 但愿人长久
- </li>
- </ul>
- </div>
- results = re.findall('<li.*?>(.*?)</li>',HTML,re.S)
- print(results)
- ['一路上有你', '\n 沧海一声笑 \ n', '\n 往事随风 \ n', '光辉岁月', '记事本', '\n 但愿人长久 \ n']
- for result in results:
- print(result.strip())
一路上有你
沧海一声笑
往事随风
光辉岁月
记事本
re.compile
将正则字符串编译为正则表达式的对象, 以便于复用该匹配模式
- content = '''Hello 1234567 World_This
- is a Regex Demo''' pattern = re.compile('Hello.*Demo',re.S)
- result = re.match(pattern,content)
- print(result)
- <_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This\nis a Regex Demo'>
实例练习
(会卡机的)
获取豆瓣图书信息
- import requests
- import re
- content = requests.get('https://book.douban.com/').text
- pattern = re.compile('<li.*?cover.*?href="(.*?)".*?alt="(.*?)".*?more-meta.*?author">(.*?)</span>.*?year">(.*?)</span>.*?</li>',re.S)
- results = re.findall(pattern,content)
- for result in results:
- url,name,author,date = result
- author = re.sub('\s','',author)
- date = re.sub('\s','',date)
- print(url,name,author,date)
来源: http://www.bubuko.com/infodetail-2869394.html