Pandas 中的遍历与并行处理

使用 pandas 处理数据时，遍历和并行处理是比较常见的操作了本文总结了几种不同样式的操作和并行处理方法。

1. 准备示例数据

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randint(40, 100, (5, 10)), columns=[f's{i}' for i in range(10)], index=['john', 'bob', 'mike', 'bill', 'lisa'])

df['is_passed'] = df.s9.map(lambda x: True if x > 60 else False)

df 输出：

      s0  s1  s2  s3  s4  s5  s6  s7  s8  s9  is_passed

john  56  70  85  91  92  80  63  81  45  57      False

bob   99  93  80  42  91  81  53  75  61  78       True

mike  76  92  76  80  57  98  94  79  87  94       True

bill  81  83  92  91  51  55  40  77  96  90       True

lisa  85  82  56  57  54  56  49  43  99  51      False

2. 遍历

在 pandas 中，共有三种遍历数据的方法，分别是：

2.1. iterrows

按行遍历，将 DataFrame 的每一行迭代为 (index, Series) 对，可以通过 row[name] 或 row.name 对元素进行访问。

>>> for index, row in df.iterrows():

...     print(row['s0'])  # 也可使用 row.s0

56

99

76

81

85

2.2. itertuples

按行遍历，将 DataFrame 的每一行迭代为命名元祖，可以通过 row.name 对元素进行访问，比 iterrows 效率高。

>>> for row in df.itertuples():

...     print(row.s0)

56

99

76

81

85

2.3. iteritems

按列遍历，将 DataFrame 的每一列迭代为 (列名, Series) 对，可以通过 row[index] 对元素进行访问。

>>> for index, row in df.iteritems():

...     print(row[0])

56

70

85

91

92

80

63

81

45

57

False

3. 并行处理

3.1. map 方法

类似 Python 内建的 map() 方法，pandas 中的 map() 方法将函数、字典索引或是一些需要接受单个输入值的特别的对象与对应的单个列的每一个元素建立联系并串行得到结果。map() 还有一个参数 na_action，类似 R 中的 na.action，取值为 None(默认) 或 ingore，用于控制遇到缺失值的处理方式，设置为 ingore 时串行运算过程中将忽略 Nan 值原样返回。

比如这里将 is_passed 列中的 True 换为 1，False 换位 0，可以有下面几种实现方式：

3.1.1. 字典映射

>>> # 定义映射字典

... score_map = {True: 1, False: 0}

>>> # 利用 map() 方法得到对应 mike 列的映射列

... df.is_passed.map(score_map)

john    0

bob     1

mike    1

bill    1

lisa    0

Name: is_passed, dtype: int64

3.1.2. `lambda` 函数

>>> # 如同创建该列时的那样

... df.is_passed.map(lambda x: 1 if x else 0)

john    0

bob     1

mike    1

bill    1

lisa    0

Name: is_passed, dtype: int64

3.1.3. 常规函数

>>> def bool_to_num(x):

...     return 1 if x else 0

>>> df.is_passed.map(bool_to_num)

3.1.4. 特殊对象

一些接收单个输入值且有输出的对象也可以用map()方法来处理：

>>> df.is_passed.map('is passed: {}'.format)

john    is passed: False

bob      is passed: True

mike     is passed: True

bill     is passed: True

lisa    is passed: False

Name: is_passed, dtype: object

3.2. apply 方法

apply() 使用方式跟 map() 很像，主要传入的主要参数都是接受输入返回输出，但相较于 map() 针对单列 Series 进行处理，一条 apply() 语句可以对单列或多列进行运算，覆盖非常多的使用场景，下面分别介绍：

3.2.1. 单列数据

传入 lambda 函数：

df.is_passed.apply(lambda x: 1 if x else 0)

3.2.2. 输入多列数据

>>> def gen_describe(s9, is_passed):

...     return f"s9's score is {s9}, so {'passed' if is_passed else 'failed'}"

>>> df.apply(lambda r: gen_describe(r['s9'], r['is_passed']), axis=1)

john    s9's score is 57, so failed

bob     s9's score is 78, so passed

mike    s9's score is 94, so passed

bill    s9's score is 90, so passed

lisa    s9's score is 51, so failed

dtype: object

3.2.3. 输出多列数据

>>> df.apply(lambda row: (row['s9'], row['s8']), axis=1)

john    (57, 45)

bob     (78, 61)

mike    (94, 87)

bill    (90, 96)

lisa    (51, 99)

dtype: object

3.3. applymap 方法

applymap 是与 map 方法相对应的专属于 DataFrame 对象的方法，类似 map 方法传入函数、字典等，传入对应的输出结果，

不同的是 applymap 将传入的函数等作用于整个数据框中每一个位置的元素，比如将 df 中的所有小于 50 的全部改为 50：

>>> def at_least_get_50(x):

...     if isinstance(x, int) and x < 50:

...         return 50

...     return x

>>> df.applymap(at_least_get_50)

      s0  s1  s2  s3  s4  s5  s6  s7  s8  s9  is_passed

john  56  70  85  91  92  80  63  81  50  57      False

bob   99  93  80  50  91  81  53  75  61  78       True

mike  76  92  76  80  57  98  94  79  87  94       True

bill  81  83  92  91  51  55  50  77  96  90       True

lisa  85  82  56  57  54  56  50  50  99  51      False

附：结合 tqdm 给 apply 过程添加进度条

在 jupyter 中并行处理较大数据量的时候，往往执行后就只能干等着报错或者执行完了，使用 tqdm 可以查看数据实时处理进度，使用前需使用 pip install tqdm 安装该包。使用示例如下：

from tqdm import tqdm

def gen_describe(s9, is_passed):

    return f"s9's score is {s9}, so {'passed' if is_passed else 'failed'}"

#启动对紧跟着的 apply 过程的监视

tqdm.pandas(desc='apply')

df.progress_apply(lambda r: gen_describe(r['s9'], r['is_passed']), axis=1)

参考

（数据科学学习手札69）详解pandas中的map、apply、applymap、groupby、agg