当前位置：

首页
/
IT
/
程序
/
Python
/
Pandas 高级教程之: 处理 text 数据

Pandas 高级教程之: 处理 text 数据

简介

在 1.0 之前, 只有一种形式来存储 text 数据, 那就是 object. 在 1.0 之后, 添加了一个新的数据类型叫做 StringDtype . 今天将会给大家讲解 Pandas 中 text 中的那些事.

创建 text 的 DF

先看下常见的使用 text 来构建 DF 的例子:

In [1]: pd.Series(['a', 'b', 'c'])
Out[1]:
0    a
1    b
2    c
dtype: object

如果要使用新的 StringDtype, 可以这样:

In [2]: pd.Series(['a', 'b', 'c'], dtype="string")
Out[2]:
0    a
1    b
2    c
dtype: string
In [3]: pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())
Out[3]:
0    a
1    b
2    c
dtype: string

或者使用 astype 进行转换:

In [4]: s = pd.Series(['a', 'b', 'c'])
In [5]: s
Out[5]:
0    a
1    b
2    c
dtype: object
In [6]: s.astype("string")
Out[6]:
0    a
1    b
2    c
dtype: string

String 的方法

String 可以转换成大写, 小写和统计它的长度:

In [24]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
   ....:               dtype="string")
   ....:
In [25]: s.str.lower()
Out[25]:
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string
In [26]: s.str.upper()
Out[26]:
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string
In [27]: s.str.len()
Out[27]:
0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

还可以进行 trip 操作:

In [28]: idx = pd.Index(['jack', 'jill', 'jesse', 'frank'])
In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
In [31]: idx.str.rstrip()
Out[31]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

columns 的 String 操作

因为 columns 是 String 表示的, 所以可以按照普通的 String 方式来操作 columns:

In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')
In [35]: df.columns.str.lower()
Out[35]: Index(['column a', 'column b'], dtype='object')
In [32]: df = pd.DataFrame(np.random.randn(3, 2),
   ....:                   columns=['Column A', 'Column B'], index=range(3))
   ....:
In [33]: df
Out[33]:
    Column A    Column B
0    0.469112   -0.282863
1   -1.509059   -1.135632
2    1.212112   -0.173215

分割和替换 String

Split 可以将一个 String 切分成一个数组.

In [38]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")
In [39]: s2.str.split('_')
Out[39]:
0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

要想访问 split 之后数组中的字符, 可以这样:

In [40]: s2.str.split('_').str.get(1)
Out[40]:
0       b
1       d
2    <NA>
3       g
dtype: object
In [41]: s2.str.split('_').str[1]
Out[41]:
0       b
1       d
2    <NA>
3       g
dtype: object

使用 expand=True 可以将 split 过后的数组扩展成为多列:

In [42]: s2.str.split('_', expand=True)
Out[42]:
      0     1     2
0     a     b     c
1     c     d     e
2  <NA>  <NA>  <NA>
3     f     g     h

可以指定分割列的个数:

In [43]: s2.str.split('_', expand=True, n=1)
Out[43]:
      0     1
0     a   b_c
1     c   d_e
2  <NA>  <NA>
3     f   g_h

replace 用来进行字符的替换, 在替换过程中还可以使用正则表达式:

s3.str.replace('^.a|dog', 'XX-XX', case=False)

String 的连接

使用 cat 可以连接 String:

In [64]: s = pd.Series(['a', 'b', 'c', 'd'], dtype="string")
In [65]: s.str.cat(sep=',')
Out[65]: 'a,b,c,d'

使用 .str 来 index

pd.Series 会返回一个 Series, 如果 Series 中是字符串的话, 可通过 index 来访问列的字符, 举个例子:

In [99]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
   ....:                'CABA', 'dog', 'cat'],
   ....:               dtype="string")
   ....:
In [100]: s.str[0]
Out[100]:
0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string
In [101]: s.str[1]
Out[101]:
0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string
extract

Extract 用来从 String 中解压数据, 它接收一个 expand 参数, 在 0.23 版本之前, 这个参数默认是 False. 如果是 false,extract 会返回 Series,index 或者 DF . 如果 expand=true, 那么会返回 DF.0.23 版本之后, 默认是 true.

extract 通常是和正则表达式一起使用的.

In [102]: pd.Series(['a1', 'b2', 'c3'],
   .....:           dtype="string").str.extract(r'([ab])(\d)', expand=False)
   .....:
Out[102]:
      0     1
0     a     1
1     b     2
2  <NA>  <NA>

上面的例子将 Series 中的每一字符串都按照正则表达式来进行分解. 前面一部分是字符, 后面一部分是数字.

注意, 只有正则表达式中 group 的数据才会被 extract .

下面的就只会 extract 数字:

In [106]: pd.Series(['a1', 'b2', 'c3'],
   .....:           dtype="string").str.extract(r'[ab](\d)', expand=False)
   .....:
Out[106]:
0       1
1       2
2    <NA>
dtype: string

还可以指定列的名字如下:

In [103]: pd.Series(['a1', 'b2', 'c3'],
   .....:           dtype="string").str.extract(r'(?P<letter>[ab])(?P<digit>\d)',
   .....:                                       expand=False)
   .....:
Out[103]:
  letter digit
0      a     1
1      b     2
2   <NA>  <NA>
extractall

和 extract 相似的还有 extractall, 不同的是 extract 只会匹配第一次, 而 extractall 会做所有的匹配, 举个例子:

In [112]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],
   .....:               dtype="string")
   .....:
In [113]: s
Out[113]:
A    a1a2
B      b1
C      c1
dtype: string
In [114]: two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
In [115]: s.str.extract(two_groups, expand=True)
Out[115]:
  letter digit
A      a     1
B      b     1
C      c     1

extract 匹配到 a1 之后就不会继续了.

In [116]: s.str.extractall(two_groups)
Out[116]:
        letter digit
  match
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1

extractall 匹配了 a1 之后还会匹配 a2.

contains 和 match

contains 和 match 用来测试 DF 中是否含有特定的数据:

In [127]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
   .....:           dtype="string").str.contains(pattern)
   .....:
Out[127]:
0    False
1    False
2     True
3     True
4     True
5     True
dtype: boolean
In [128]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
   .....:           dtype="string").str.match(pattern)
   .....:
Out[128]:
0    False
1    False
2     True
3     True
4    False
5     True
dtype: boolean
In [129]: pd.Series(['1', '2', '3a', '3b', '03c', '4dx'],
   .....:           dtype="string").str.fullmatch(pattern)
   .....:
Out[129]:
0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

String 方法总结

最后总结一下 String 的方法:

Method	Description
cat()	Concatenate strings
split()	Split strings on delimiter
rsplit()	Split strings on delimiter working from the end of the string
get()	Index into each element (retrieve i-th element)
join()	Join strings in each element of the Series with passed separator
get_dummies()	Split strings on the delimiter returning DataFrame of dummy variables
contains()	Return boolean array if each string contains pattern/regex
replace()	Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
repeat()	Duplicate values (s.str.repeat(3) equivalent to x * 3)
pad()	Add whitespace to left, right, or both sides of strings
center()	Equivalent to str.center
ljust()	Equivalent to str.ljust
rjust()	Equivalent to str.rjust
zfill()	Equivalent to str.zfill
wrap()	Split long strings into lines with length less than a given width
slice()	Slice each string in the Series
slice_replace()	Replace slice in each string with passed value
count()	Count occurrences of pattern
startswith()	Equivalent to str.startswith(pat) for each element
endswith()	Equivalent to str.endswith(pat) for each element
findall()	Compute list of all occurrences of pattern/regex for each string
match()	Call re.match on each element, returning matched groups as list
extract()	Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group
extractall()	Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group
len()	Compute string lengths
strip()	Equivalent to str.strip
rstrip()	Equivalent to str.rstrip
lstrip()	Equivalent to str.lstrip
partition()	Equivalent to str.partition
rpartition()	Equivalent to str.rpartition
lower()	Equivalent to str.lower
casefold()	Equivalent to str.casefold
upper()	Equivalent to str.upper
find()	Equivalent to str.find
rfind()	Equivalent to str.rfind
index()	Equivalent to str.index
rindex()	Equivalent to str.rindex
capitalize()	Equivalent to str.capitalize
swapcase()	Equivalent to str.swapcase
normalize()	Return Unicode normal form. Equivalent to unicodedata.normalize
translate()	Equivalent to str.translate
isalnum()	Equivalent to str.isalnum
isalpha()	Equivalent to str.isalpha
isdigit()	Equivalent to str.isdigit
isspace()	Equivalent to str.isspace
islower()	Equivalent to str.islower
isupper()	Equivalent to str.isupper
istitle()	Equivalent to str.istitle
isnumeric()	Equivalent to str.isnumeric
isdecimal()	Equivalent to str.isdecimal

本文已收录于 http://www.flydean.com/06-python-pandas-text/

最通俗的解读, 最深刻的干货, 最简洁的教程, 众多你不知道的小技巧等你来发现!

来源: https://segmentfault.com/a/1190000040223153

与本文相关文章

暂无,快来抢沙发吧！