python 进阶篇迭代器和生成器深入理解

列表/元组/字典/集合都是容器。对于容器，可以很直观地想象成多个元素在一起的单元；而不同容器的区别，正是在于内部数据结构的实现方法。

所有的容器都是可迭代的（iterable）。另外字符串也可以被迭代。

迭代器类比

迭代可以想象成是你去买苹果，卖家并不告诉你他有多少库存。这样，每次你都需要告诉卖家，你要一个苹果，然后卖家采取行为：要么给你拿一个苹果；要么告诉你，苹果已经卖完了。你并不需要知道，卖家在仓库是怎么摆放苹果的。

严谨地说，迭代器（iterator）提供了一个 next（可以不重复不遗漏地一个一个拿到所有元素）的方法。调用这个方法后，你要么得到这个容器的下一个对象，要么得到一个 StopIteration 的错误（苹果卖完了）。

示例，判断是否可迭代

from collections.abc import Iterable

params = [

    1234,

    '1234',

    [1, 2, 3, 4],

    set([1, 2, 3, 4]),

    {1:1, 2:2, 3:3, 4:4},

    (1, 2, 3, 4)

]

for param in params:

    print('{} is iterable? {}'.format(param, isinstance(param, Iterable)))

# 输出

# 1234 is iterable? False

# 1234 is iterable? True

# [1, 2, 3, 4] is iterable? True

# {1, 2, 3, 4} is iterable? True

# {1: 1, 2: 2, 3: 3, 4: 4} is iterable? True

# (1, 2, 3, 4) is iterable? True

生成器类比

生成器可以想象成是你去买苹果，卖家并没有库存。这样，每次你都需要告诉卖家，你要一个苹果，然后卖家采取行为，立马生成 1 个苹果（生成速度极快）：要么给你拿一个苹果；要么告诉你，苹果已经卖完了。

生成器是懒人版本的迭代器

示例，迭代器与生成器的对比

import os

import psutil

import time

import functools

def log_execution_time(func):

    @functools.wraps(func)

    def wrapper(*args, **kwargs):

        start = time.perf_counter()

        res = func(*args, **kwargs)

        end = time.perf_counter()

        print('{} took {} ms'.format(func.__name__, (end - start) * 1000))

        return res

    return wrapper

# 显示当前 python 程序占用的内存大小

def show_memory_info(hint):

    pid = os.getpid()

    p = psutil.Process(pid)

    info = p.memory_full_info()

    memory = info.uss / 1024. / 1024

    print('{} memory used: {} MB'.format(hint, memory))

@log_execution_time

def test_iterator():

    show_memory_info('initing iterator')

    list_1 = [i for i in range(100000000)]

    show_memory_info('after iterator initiated')

    print(sum(list_1))

    show_memory_info('after sum called')

@log_execution_time

def test_generator():

    show_memory_info('initing generator')

    list_2 = (i for i in range(100000000))

    show_memory_info('after generator initiated')

    print(sum(list_2))

    show_memory_info('after sum called')

test_iterator()

print()

test_generator()

########## 输出 ##########

# initing iterator memory used: 10.16796875 MB

# after iterator initiated memory used: 3664.34765625 MB

# 4999999950000000

# after sum called memory used: 3664.34765625 MB

# test_iterator took 6179.794754018076 ms

# initing generator memory used: 19.140625 MB

# after generator initiated memory used: 19.14453125 MB

# 4999999950000000

# after sum called memory used: 19.171875 MB

# test_generator took 4912.561981996987 ms

迭代器是一个有限集合，生成器则可以成为一个无限集。

我们并不需要在内存中同时保存这么多东西，比如对元素求和，我们只需要知道每个元素在相加的那一刻是多少就行了，用完就可以扔掉了。

于是，生成器的概念应运而生，在你调用 next() 函数的时候，才会生成下一个变量。生成器在 Python 的写法是用小括号括起来，(i for i in range(100000000))，即初始化了一个生成器。

这样一来，你可以清晰地看到，生成器并不会像迭代器一样占用大量内存，只有在被使用的时候才会调用。而且生成器在初始化的时候，并不需要运行一次生成操作，相比于 test_iterator() ，test_generator() 函数节省了一次生成一亿个元素的过程，因此耗时明显比迭代器短。

示例，数学中有一个恒等式，(1 + 2 + 3 + ... + n)^2 = 1^3 + 2^3 + 3^3 + ... + n^3 的证明

def generator(k):

    i = 1

    while True:

        yield i ** k

        i += 1

gen_1 = generator(1)

gen_3 = generator(3)

print(gen_1)

print(gen_3)

def get_sum(n):

    sum_1, sum_3 = 0, 0

    for i in range(n):

        next_1 = next(gen_1)

        next_3 = next(gen_3)

        print('next_1 = {}, next_3 = {}'.format(next_1, next_3))

        sum_1 += next_1

        sum_3 += next_3

    print(sum_1 * sum_1, sum_3)

get_sum(8)

########## 输出 ##########

# <generator object generator at 0x10c30d3d0>

# <generator object generator at 0x10c6d61d0>

# next_1 = 1, next_3 = 1

# next_1 = 2, next_3 = 8

# next_1 = 3, next_3 = 27

# next_1 = 4, next_3 = 64

# next_1 = 5, next_3 = 125

# next_1 = 6, next_3 = 216

# next_1 = 7, next_3 = 343

# next_1 = 8, next_3 = 512

# 1296 1296

接下来的 yield 是魔术的关键。对于初学者来说，你可以理解为，函数运行到这一行的时候，程序会从这里暂停，然后跳出，不过跳到哪里呢？答案是 next() 函数。那么 i ** k 是干什么的呢？它其实成了 next() 函数的返回值。这样，每次 next(gen) 函数被调用的时候，暂停的程序就又复活了，从 yield 这里向下继续执行；同时注意，局部变量 i 并没有被清除掉，而是会继续累加。我们可以看到 next_1 从 1 变到 8，next_3 从 1 变到 512。

示例，给定两个序列，判定第一个是不是第二个的子序列。

LeetCode 链接如下：https://leetcode.com/problems/is-subsequence/

先来解读一下这个问题本身。序列就是列表，子序列则指的是，一个列表的元素在第二个列表中都按顺序出现，但是并不必挨在一起。举个例子，[1, 3, 5] 是 [1, 2, 3, 4, 5] 的子序列，[1, 4, 3] 则不是。

def is_subsequence(ls, sub):

    ls = iter(ls)

    return all(i in ls for i in sub)

print(is_subsequence([1, 2, 3, 4, 5],[1, 3, 5]))

print(is_subsequence([1, 2, 3, 4, 5],[1, 4, 3]))

########## 输出 ##########

# True

# False