- Date: 2019-07-03
- Author: Sun
本节目的:
(1)掌握正则表达式和 re 模块使用
(2)python 操作正则表达式, 匹配贪婪和非贪婪模式使用
(3)掌握常见函数 find, findall, search, match, split 等用法
正则表达式
? 正则表达式 (Regular Expression) 是一种文本模式, 包括普通字符 (例如, a 到 z 之间的字母) 和特殊字符(称为 "元字符").
? 正则表达式使用单个字符串来描述, 匹配一系列匹配某个句法规则的字符串.
1 为什么使用正则表达式?
? 列举几个比较鲜明的例子帮助你理解.
? (1)判断一个字符串里是否包含数字, 如果有, 返回 true; 否则返回 false;
? (2)给定字符串 str, 检查其是否包含连续重复的字母(a-zA-Z), 包含返回 true, 否则返回 false
? (3) 从一个大文本里面, 提取出我们想要数据.
再者比如在工作中我们经常遇到这样的需求:
1. 给你一个字符串, 把字符串里面的链接, 数字, 电话等显示不同的颜色;
2. 给你一个包含自定义表情的文字, 找出里面的表情, 替换成本地的表情图片;
3. 根据用户的输入内容, 判断是否是微信号, 手机号, 邮箱, 纯数字等;
提示:
对于 1 和 2 的情景, 我们使用正则表达式 + 富文本 便可以轻松应对.
对于 3, 我们只需根据正则表达式的规则, 封装好自己的正则库, 就可以做到一劳永逸了!
常用的正则匹配工具
? 在线匹配工具:
- 1 http://www.regexpal.com/
- ? 2 http://rubular.com/
? 正则匹配软件
? McTracer http://pan.baidu.com/s/19Yn49 (https://pan.baidu.com/s/19Yn49)
- (1)^ $ * ? + {
- 2
- } {
- 2,
- } {
- 2, 5
- } |
- (2)[], [^], [a-z], [0-9], [4|5]
- (3) \s, \S, \w, \W
- (4) [\u4E00-\u9FA5] () \d
- import re
- a = 'one1two2three3four4'
- ret = re.findall(r'(\d+)', a)
- print(ret)
- ['1', '2', '3', '4']
- import re
- p = re.compile(r"(\d+)")
- a = 'one1two2three3four4'
- res = p.findall(a)
- print(res)
- ['1', '2', '3', '4']
- a = 'hello alex alex adn acd'
- n = re.findall('(a)(\w+)',a)
- print(n) #从左到右, 从外到内
- #[('a', 'lex'), ('a', 'lex'), ('a', 'dn'), ('a', 'cd')]
- # -*- coding: utf-8 -*-
- __author__ = 'sun'
- __date__ = '2019/7/03 上午 9:48'
- import re
- line = "liu dehua was older than you"
- matchObj = re.match(r'^liu (.*) was (.*?) .*', line, re.M | re.I)
- if matchObj:
- print("matchObj.group() :", matchObj.group())
- print("matchObj.group(1) :", matchObj.group(1))
- print("matchObj.group(2) :", matchObj.group(2))
- else:
- print("No match!!")
- matchObj.group() : liu dehua was older than you
- matchObj.group(1) : dehua
- matchObj.group(2) : older
- # -*- coding: utf-8 -*-
- __author__ = 'sun'
- __date__ = '2019/7/03 上午 9:48'
- import re
- line = "liu dehua was older than you"
- matchObj = re.search(r'^liu (.*) was (.*?) .*', line, re.M | re.I)
- if matchObj:
- print("matchObj.group() :", matchObj.group())
- print("matchObj.group(1) :", matchObj.group(1))
- print("matchObj.group(2) :", matchObj.group(2))
- else:
- print("No match!!")
- matchObj.group() : liu dehua was older than you
- matchObj.group(1) : dehua
- matchObj.group(2) : older
- import re
- ret_match = re.match("c", "abcde") # 从字符串开头匹配, 匹配到返回 match 的对象, 匹配不到返回 None
- if (ret_match):
- print("ret_match:" + ret_match.group())
- else:
- print("ret_match:None")
- ret_search = re.search("c", "abcde") # 扫描整个字符串返回第一个匹配到的元素并结束, 匹配不到返回 None
- if (ret_search):
- print("ret_search:" + ret_search.group())
- ret_match:None
- ret_search:c
- import re
- a = "123abc456"
- re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(0) # 123abc456, 返回整体默认返回 group(0)
- re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(1) # 123
- re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(2) # abc
- re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(3) # 456
- import re
- # sub
- ret_sub = re.sub(r'(one|two|three)', 'ok', 'one word two words three words')
- print(ret_sub)
- # subn
- import re
- ret_subn = re.subn(r'(one|two|three)', 'ok',
- 'one word two words three words')
- print(ret_subn)
- # ok Word ok words ok words
- # ('ok word ok words ok words', 3) 3, 表示替换的次数
- import re
- ret = re.split('\d+',
- 'one1two2three3four4')
- print(ret)
- ####output####
- # 匹配到 1 的时候结果为'one'和'two2three3four4', 匹配到 2 的时候结果为'one',
- # 'two'和'three3four4', 所以结果为:
- #['one', 'two', 'three', 'four', '']
- >>> s="This is a number 234-235-22-423"
- >>> r=re.match(".+(\d+-\d+-\d+-\d+)",s) #贪婪
- >>> r.group(1)
- '4-235-22-423'
- >>> r=re.match(".+?(\d+-\d+-\d+-\d+)",s) #非贪婪
- >>> r.group(1)
- '234-235-22-423'
- # 贪婪
- >>> re.match(r"aa(\d+)","aa2343ddd").group(1)
- '2343'
- # 非贪婪
- >>> re.match(r"aa(\d+?)","aa2343ddd").group(1)
- '2'
- >>> re.match(r"aa(\d+)ddd","aa2343ddd").group(1)
- '2343'
- >>> re.match(r"aa(\d+?)ddd","aa2343ddd").group(1)
- '2343'
- >>>
- # 贪婪
- ret_greed= re.findall(r'a(\d+)','a23b')
- print(ret_greed)
- ['23']
- # 非贪婪
- ret_no_greed= re.findall(r'a(\d+?)','a23b')
- print(ret_no_greed)
- ['2']
- str = "i love 2,45 china v5 , 6666, yes"
- res = re.findall(r".*?(.*)yes$", str)
- print(res)
- p = re.compile(r"^(6\d{5}[1,2,4])")
- print(p.match("6256432"))
- ^(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3|5|6|7|8|9])\d{
- 8
- }$
- (^(13\d|14[57]|15[^4\D]|17[13678]|18\d)\d{
- 8
- }|170[^346\D]\d{
- 7
- })$
- if (!s.match(/^[a-zA-Z]+:\\/\\//))
- {
- s = 'http://' + s;
- }
- # -*- coding: utf-8 -*-
- __author__ = 'sun'
- __date__ = '2019/7/03 下午 3:24'
- import re
- def check_card_isvalid(card_str):
- p = re.compile(r"^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])$")
- return p.match(card_str)
- card_str = "422101198808100412"
- res = check_card_isvalid(card_str)
- print(res)
- '''
- 正则表达式匹配
- '''regstr ="[DEBUG][2018-09-10 09:10:34][192.169.11.34][function1]""[this is our log file, has error]"
- p = re.compile(r"\[(?P<log_level>.*)\]\[(?P<time_local>.*)\]"
- r"\[(?P<ip_address>\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})\]")
- res = p.findall(regstr)
- print(res)
- print(dir(res))
来源: http://www.bubuko.com/infodetail-3112548.html