regex represents regular expression, which is a character pattern in searching
Sets of matching characters
match for one or another character within a group, the group is defined with squre bracket
import re
#match a or A
pattern = re.compile(r'[aA]', flags=re.IGNORECASE)
#or
pattern = re.compile(r'[a,A]', flags=re.IGNORECASE)
#match anything between a and z or A and Z, length is not limited
pattern = re.compile(r'[a-z]', flags=re.IGNORECASE)
#match email consist of letters and numbers in .com domain
pattern = re.compiler(r'[a-zA-Z0-9]+@+[a-zA-Z]+\.com')
negation
NOT operation,search pattern except the ones that listed
pattern = re.compile(r'[^a-z]', flags=re.IGNORECASE)
special character and shortcuts
\w — any single letter, digit or underscore
\W — matches anything not covered with \w
\d — matches numerical digits 0–9
\D — Matches all non-digit characters(letters)
\s — Matches whitespace (including tabs)
\S — Matches non-whitespace
\n — Matches new lines
\r — Matches carriage returns
\t — Matches tabs
. -any single character except the newline character
match quantities
* — Zero or more
+ — One or more
? — Zero or one
{n} — Exactly ’n’ number
{n,} — Matches ’n’ or more occurrences
{n,m} — Between ’n’ and ‘m’
{m,n}? - m to n copies of RE to match in a non-greedy fashion
regex built-in methods
-
match prog=re.compile(r’ing’) words=[‘Spring’,’cycling’,’Ringtone’] for w in words: #if match then return an object, otherwise return None if prog.match(w,pos=len(w)-3)!=None: print(‘last three letters are ing’)
-
search returns the matched object, apply group() method on the object to get the matched string match_obj=prog.search(w) start=match_obj.span()[0] end=match_obj.span()[1] matched_string=match_obj.group()
- findall returns a list with the matching pattern
- finditer returns an iterator of the matched objects
- split used to get rid of extrinsic characters which messing up a regular sent #replace ‘;,space_’ from text “ “.join(re.split(‘[;,\s_]+’, text))
matching string
-
^(cart) pattern at the beginning of a string not anywhere else
pattern=re.compile(r’^Com’)
-
$ (dollar sign) matches a pattern at the end of the string pattern=re.compile(r’ing$’)
others
combining multi-pattern
#p0 and p1 are two patterns before compile pattern=p0+’|’+p1
Look-ahead/behind Assertions - (?>= ) | (?= ) | (?>! ) | (?! )
(?>= ): behind assertion (?= ): before assertion (?>! ): not behind assertion (?! ): not before assertion
#searching username in the following html James Briggs
if bool(re.search(r'(?=<\/)'@.*(?=\?source)', a)):
username = re.search(r'(?=>\/)'@.*(?=\?source)', a).group()
Modifiers - (?sm)
Single line [s] - Allows the . metacharacter (which matches everything except newlines) to match newlines tooMulti-line [m] - ^ and $ now match the beginning/end of lines, rather than default behavior of matching beggining/end of entire stringInsensitive [i] - Upper and lower-case characters are matched, e.g. A = a
Extended [x] - Ignores whitespace. To include spaces, they must be escaped using . Also allows comments inside the regex with #
ASCII [a] - Match to ASCII-only characters, rather than the full Unicode character set
#Adding both the single line and insensitive modifiers using modifier flags
re.match('[a-z]+01.*', text, re.S|re.I)
#or adding inline modifier with the (? ) syntax within the expression
#the modifier can be turn off by (?- ), but it is not supported in python
re.match('(?si)[a-z]+01.*', text)
Conditionals (If|Else) - (…)?(?(1)True|False)
for t in text:
print(re.search(r"hello)?(?(1) world| bye)!", t)
third party lib
Pampy
1) HEAD和TAIL
from pampy import match, HEAD, TAIL, _
x = [-1, -2, -3, 0, 1, 2, 3]
print(match(x, [-1, TAIL], lambda t: [-1, tuple(t)]))
# => [-1, (-2, -3, 0, 1, 2, 3)]
2) match dict
from pampy import match, HEAD, TAIL, _
my_dict = {
'global_setting': [1, 3, 3],
'user_setting': {
'face': ['beautiful', 'ugly'],
'mind': ['smart', 'stupid']
}
}
result = match(my_dict, { _: {'face': _}}, lambda key, son_value: (key, son_value))
print(result)
# => ('user_setting', ['beautiful', 'ugly'])
3) match regex
import re
from pampy import match, HEAD, TAIL, _
def what_is(pet):
return match(
pet, re.compile('(\w+),(\w)\w+鳕鱼$'), lambda mygod, you: you + "像鳕鱼"
)
print(what_is('我的天,你长得真像鳕鱼'))
# => '你像鳕鱼'