pandas - groupby 深入及数据清洗案例

import pandas as pd

import numpy as np

分割-apply-聚合

大数据的MapReduce

The most general-purpose GroupBy method is apply, which is the subject of the rest of this section. As illustrated in Figure 10-2, apply splits the object being manipulated into pieces, invokes the passed function on each piece, and then attempts to concatenate the pieces together.

Returning to the tipping dataset from before, suppose you wanted to select the top five tip_pct values by group. First, write a function that selects the rows with the largest values in a particular column:

tips = pd.read_csv('../examples/tips.csv')

tips.head(2)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	total_bill	tip	smoker	day	time	size
0	16.99	1.01	No	Sun	Dinner	2
1	10.34	1.66	No	Sun	Dinner	3

tips['tip_pct'] = tips['tip'] / tips['total_bill']

def top(df, n=5, column='tip_pct'):

    """返回某列排序后后第n个元素"""

    return df.sort_values(by=column)[-n:]

top(tips, n=6)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	total_bill	tip	smoker	day	time	size	tip_pct
109	14.31	4.00	Yes	Sat	Dinner	2	0.279525
183	23.17	6.50	Yes	Sun	Dinner	4	0.280535
232	11.61	3.39	No	Sat	Dinner	2	0.291990
67	3.07	1.00	Yes	Sat	Dinner	1	0.325733
178	9.60	4.00	Yes	Sun	Dinner	2	0.416667
172	7.25	5.15	Yes	Sun	Dinner	2	0.710345

Now, if we group by smoker, say, and call apply with this function, we get the following:

"先按smoker分组, 然后组内调用top方法"

tips.groupby('smoker').apply(top)

'先按smoker分组, 然后组内调用top方法'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

		total_bill	tip	smoker	day	time	size	tip_pct
smoker
No	88	24.71	5.85	No	Thur	Lunch	2	0.236746
	185	20.69	5.00	No	Sun	Dinner	5	0.241663
	51	10.29	2.60	No	Sun	Dinner	2	0.252672
	149	7.51	2.00	No	Thur	Lunch	2	0.266312
	232	11.61	3.39	No	Sat	Dinner	2	0.291990
Yes	109	14.31	4.00	Yes	Sat	Dinner	2	0.279525
	183	23.17	6.50	Yes	Sun	Dinner	4	0.280535
	67	3.07	1.00	Yes	Sat	Dinner	1	0.325733
	178	9.60	4.00	Yes	Sun	Dinner	2	0.416667
	172	7.25	5.15	Yes	Sun	Dinner	2	0.710345

What has happened here? The top function is called on each row(类似RDD) group from the DataFrame, and then the results are glued together using pandas.concat, labeling the pieces with the group names. The result therefore has a hierarchical index whose inner level contains index values from the original DataFrame.

If you pass a function to apply that takes other arguments or keywords, you can pass these after the function:

tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

			total_bill	tip	smoker	day	time	size	tip_pct
smoker	day
No	Fri	94	22.75	3.25	No	Fri	Dinner	2	0.142857
	Sat	212	48.33	9.00	No	Sat	Dinner	4	0.186220
	Sun	156	48.17	5.00	No	Sun	Dinner	6	0.103799
	Thur	142	41.19	5.00	No	Thur	Lunch	5	0.121389
Yes	Fri	95	40.17	4.73	Yes	Fri	Dinner	4	0.117750
	Sat	170	50.81	10.00	Yes	Sat	Dinner	3	0.196812
	Sun	182	45.35	3.50	Yes	Sun	Dinner	3	0.077178
	Thur	197	43.11	5.00	Yes	Thur	Lunch	4	0.115982

Beyound these basic usage mechanics, getting the most out of apply may require some creativity. What occurs inside the function passed is up to you; it only needs to only return a pandas object or a scalar value. The rest of this chapter will mainly consist of examples showing you how to solve various using groupby.

可以自定义各种函数, 只要返回的是df, 然后, 又可以各种groupby..

You may recall that I earlier called describe on a GroupBy object:

result = tips.groupby('smoker')['tip_pct'].describe()

result

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	count	mean	std	min	25%	50%	75%	max
smoker
No	151.0	0.159328	0.039910	0.056797	0.136906	0.155625	0.185014	0.291990
Yes	93.0	0.163196	0.085119	0.035638	0.106771	0.153846	0.195059	0.710345

result.unstack('smoker')

       smoker

count  No        151.000000

       Yes        93.000000

mean   No          0.159328

       Yes         0.163196

std    No          0.039910

       Yes         0.085119

min    No          0.056797

       Yes         0.035638

25%    No          0.136906

       Yes         0.106771

50%    No          0.155625

       Yes         0.153846

75%    No          0.185014

       Yes         0.195059

max    No          0.291990

       Yes         0.710345

dtype: float64

Inside GroupBy, when you invoke a method like describe, it's actually just a shortcut for:

f = lambda x: x.describe()

grouped.apply(f)

过滤分组键

group_keys=False

In the preceding examples, you see that the resulting object has a hierarchical index formed from the group keys along with the indexes of each piece of the original object. You can disable this by passing group_keys=False to groupby.

tips.groupby('smoker', group_keys=False).apply(top)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	total_bill	tip	smoker	day	time	size	tip_pct
88	24.71	5.85	No	Thur	Lunch	2	0.236746
185	20.69	5.00	No	Sun	Dinner	5	0.241663
51	10.29	2.60	No	Sun	Dinner	2	0.252672
149	7.51	2.00	No	Thur	Lunch	2	0.266312
232	11.61	3.39	No	Sat	Dinner	2	0.291990
109	14.31	4.00	Yes	Sat	Dinner	2	0.279525
183	23.17	6.50	Yes	Sun	Dinner	4	0.280535
67	3.07	1.00	Yes	Sat	Dinner	1	0.325733
178	9.60	4.00	Yes	Sun	Dinner	2	0.416667
172	7.25	5.15	Yes	Sun	Dinner	2	0.710345

分位数和桶分析

cut, qcut

As you may recall from Chapter8, pandas has some tool, in particular cut and qcut, for slicing data up into buckets with bins of your choosing or by sample quantiles. Combineing these functions with groupby makes it convenient to perform bucket or quantile analysis on a dataset. Consider a simple random dataset and equal-length bucket categorization using cut:

frame = pd.DataFrame({

    'data1': np.random.randn(1000),

    'data2': np.random.randn(1000)

})

quartiles = pd.cut(frame.data1, 4)

quartiles[:10]

0    (-1.672, 0.361]

1    (-1.672, 0.361]

2    (-1.672, 0.361]

3    (-1.672, 0.361]

4     (0.361, 2.395]

5    (-1.672, 0.361]

6    (-1.672, 0.361]

7     (0.361, 2.395]

8    (-1.672, 0.361]

9     (0.361, 2.395]

Name: data1, dtype: category

Categories (4, interval[float64]): [(-3.714, -1.672] < (-1.672, 0.361] < (0.361, 2.395] < (2.395, 4.429]]

The Categorical object returned by cut can be passed directly to groupby. So we could compute a set of statistics for the data2 column like so:

def get_stats(group):

    return {'min': group.min(), 'max': group.max,

           'count': group.count(), 'mean': group.mean()}

grouped = frame.data2.groupby(quartiles)

grouped.apply(get_stats).unstack()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	count	max	mean	min
data1
(-3.714, -1.672]	49	<bound method Series.max of 25 -0.372893\n2...	-0.2432	-2.16709
(-1.672, 0.361]	601	<bound method Series.max of 0 0.861588\n1...	-0.0253114	-2.90659
(0.361, 2.395]	340	<bound method Series.max of 4 0.228388\n7...	0.024466	-3.14779
(2.395, 4.429]	10	<bound method Series.max of 201 -0.519746\n4...	-0.267874	-0.835444

Theses were equal-length buckets; to compute equal-size buckets based on sample quantiles, use qcut.(等长度的'桶'), I'll pass lable=false to just get quantile numbers:

grouping = pd.qcut(frame.data1, 10, labels=False)

grouped = frame.data2.groupby(grouping)

grouped.apply(get_stats).unstack()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	count	max	mean	min
data1
0	100	<bound method Series.max of 11 2.804563\n2...	-0.069347	-2.25593
1	100	<bound method Series.max of 1 -0.195015\n2...	-0.0408363	-2.75307
2	100	<bound method Series.max of 6 -1.087337\n1...	-0.212456	-2.88498
3	100	<bound method Series.max of 5 0.120671\n1...	0.0688246	-2.82311
4	100	<bound method Series.max of 22 0.058132\n3...	0.0401668	-2.69601
5	100	<bound method Series.max of 0 0.861588\n3...	-0.12863	-2.90659
6	100	<bound method Series.max of 47 0.543961\n5...	0.108924	-3.14779
7	100	<bound method Series.max of 4 0.228388\n7...	0.0391474	-1.8324
8	100	<bound method Series.max of 9 0.303886\n1...	-0.00849982	-2.19997
9	100	<bound method Series.max of 23 0.246278\n3...	-0.0121871	-2.40748

Example 缺失值填充

When cleaning up missing data, in some cases you will replace data observations using dropna, but in others you may want to impute(归咎于) (fill in) the null(NA) values using a fixed value or some value derived(派生) from the data(cj.随机森林预测). fillna is the right tool to use; for example, here i fill in NA values with the mean.

s = pd.Series(np.random.randn(6))

s[::2] = np.nan  # 每个就na

s

0         NaN

1   -0.661528

2         NaN

3    0.144512

4         NaN

5    1.096004

dtype: float64

"用均值填充"

s.fillna(s.mean())

'用均值填充'

0    0.192996

1   -0.661528

2    0.192996

3    0.144512

4    0.192996

5    1.096004

dtype: float64

Suppose you need the fill value to vary(变化) by group. One way to do this is to group the data and use apply with a function that calls fillna on each data chunk. Here is some sample data on US states divided into eastern and western regions:

states = ['Ohio', 'New York', 'Vermont', 'Florida',

    'Oregon', 'Nevada', 'California', 'Idaho']

group_key = ['East'] * 4 + ['West'] * 4 

data = pd.Series(np.random.randn(8), index=states)

data

Ohio          0.508352

New York     -1.029373

Vermont      -0.506223

Florida      -0.128709

Oregon        0.445320

Nevada        2.064584

California   -0.795793

Idaho        -1.115522

dtype: float64

Note that the syntax ['East'] * 4 produces a list containing four copies of the elements in ['East

']. Adding lists together concatenates them.

Let's set some values in the data to be missing:

data[['Vermont', 'Nevada', 'Idaho']] = np.nan 

data

Ohio          0.508352

New York     -1.029373

Vermont            NaN

Florida      -0.128709

Oregon        0.445320

Nevada             NaN

California   -0.795793

Idaho              NaN

dtype: float64

data.groupby(group_key).mean()  # 默认忽略缺失值

East   -0.216577

West   -0.175236

dtype: float64

We can fill the NA values using the group means like so:

fill_mean = lambda g: g.fillna(g.mean())

data.groupby(group_key).apply(fill_mean)

Ohio          0.508352

New York     -1.029373

Vermont      -0.216577

Florida      -0.128709

Oregon        0.445320

Nevada       -0.175236

California   -0.795793

Idaho        -0.175236

dtype: float64

In another case, you might have predifined fill values in your code that vary by group. Since the groups have a name attribute set internallh, we can use that:

fill_values = {'East': 0.5, 'West': -1}

fill_func = lambda g: g.fillna(fill_values[g.name])

data.groupby(group_key).apply(fill_func)

Ohio          0.508352

New York     -1.029373

Vermont       0.500000

Florida      -0.128709

Oregon        0.445320

Nevada       -1.000000

California   -0.795793

Idaho        -1.000000

dtype: float64

Example: 随机采样

Suppose you wanted to draw a random sample(with or without replacement) from a large dataset for Monte Calo(蒙特卡洛) simulation purposes or some other application. There are a number of ways to perform the "draws"; here we use the sample method for Series.

To demonstrate, here's a way to construct a deck of English-style playing cards:

# Hearts, Spades, Clubs, Diamonds

suits = 'H S C D'.split()

card_val = (list(range(1, 11)) + [10]*3) * 4

base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']

cards = []

for suit in ['H', 'S', 'C', 'D']:

    cards.extend(str(num) + suit for num in base_names)

deck = pd.Series(card_val, index=cards)

So now we have a Series of lenght 52 whose index contains card names and values are the ones used in Blackjack and other games

deck[:13]

AH      1

2H      2

3H      3

4H      4

5H      5

6H      6

7H      7

8H      8

9H      9

10H    10

JH     10

KH     10

QH     10

dtype: int64

Now, based on what i said before, drawing a hand of five cards from the deck could be written as:

def draw(deck, n=5):

    return deck.sample(n)

draw(deck)

3H     3

5C     5

JD    10

4H     4

JH    10

dtype: int64

Suppose you wanted two random cards from each suit. Because the suit is the last character of each card name, we can group based on this and use apply:

get_suit = lambda card: card[-1]  # last letter is suit

deck.groupby(get_suit).apply(draw, n=2)

C  3C      3

   8C      8

D  4D      4

   7D      7

H  4H      4

   3H      3

S  2S      2

   10S    10

dtype: int64

Alternatively, we could write:

deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

KC     10

3C      3

9D      9

KD     10

9H      9

6H      6

10S    10

7S      7

dtype: int64

Example: 加权平均和相关

Under the split-combine paradigm of groupby, operations between columns in a DataFrame or two Series, such as a group weighted average, are posible. As an example, take this dataset containing group keys, values, and some weights:

df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',

    'b', 'b', 'b', 'b'],

    'data': np.random.randn(8),

    'weights': np.random.rand(8)})

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	category	data	weights
0	a	0.434777	0.486455
1	a	-2.414575	0.374778
2	a	-0.682643	0.651142
3	a	0.538472	0.238194
4	b	1.001960	0.724147
5	b	-2.006634	0.770404
6	b	0.162167	0.262188
7	b	0.924946	0.723322

The group weighted average by category would then be:

grouped = df.groupby('category')

get_wavg = lambda g: np.average(g['data'], weights=g['weights'])

grouped.apply(get_wavg)

category

a   -0.576765

b   -0.043870

dtype: float64

As another example, consider a financial dataset originally obtained from Yahoo! Finance containing end-of-day prices for a few stocks and the S&P 500 index.

close_px = pd.read_csv('../examples/stock_px_2.csv',

                       parse_dates=True, index_col=0)

close_px.info()

<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14

Data columns (total 4 columns):

AAPL    2214 non-null float64

MSFT    2214 non-null float64

XOM     2214 non-null float64

SPX     2214 non-null float64

dtypes: float64(4)

memory usage: 86.5 KB

close_px[-4:]  # 选取后4条记录

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	AAPL	MSFT	XOM	SPX
2011-10-11	400.29	27.00	76.27	1195.54
2011-10-12	402.19	26.96	77.16	1207.25
2011-10-13	408.43	27.18	76.37	1203.66
2011-10-14	422.00	27.27	78.11	1224.58

One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns with SPX. As one way to do this, we first create a function that computes the pairwise correlation of each column with the 'SPX' column:

spx_corr = lambda x: x.corrwith(x['SPX'])

Next, we compute percent change on close_px using pct_change:

rets = close_px.pct_change().dropna()

Lastly, we group these percent changes by year, which can be extracted from each row label with a one-line function that returns the year attribute of each datetime label:

get_year = lambda x: x.year  

by_year = rets.groupby(get_year)  # 函数作为分组的 key

by_year.apply(spx_corr)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	AAPL	MSFT	XOM	SPX
2003	0.541124	0.745174	0.661265	1.0
2004	0.374283	0.588531	0.557742	1.0
2005	0.467540	0.562374	0.631010	1.0
2006	0.428267	0.406126	0.518514	1.0
2007	0.508118	0.658770	0.786264	1.0
2008	0.681434	0.804626	0.828303	1.0
2009	0.707103	0.654902	0.797921	1.0
2010	0.710105	0.730118	0.839057	1.0
2011	0.691931	0.800996	0.859975	1.0

You could also compute inter-column correlations. Here we compute the annual correlation between Apple and Microsoft:

by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

2003    0.480868

2004    0.259024

2005    0.300093

2006    0.161735

2007    0.417738

2008    0.611901

2009    0.432738

2010    0.571946

2011    0.581987

dtype: float64

Example: 线性回归

In the same theme as the previous example, you can use groupby to perform more complex group-wise statistical analysis, as long as the function returns a pandas object or scalar value.

For example, i can define the following regress function, which executes an ordinary least squares(OLS) regression on each chunk of data:

import statsmodels.api as sm

def regress(data, yvar, xvars):

    """最小二乘"""

    Y = data[yvar]

    X = data[xvars]

    X['intercept'] = 1

    result = sm.OLS(Y, X).fit()

    return result.params

Now, to run a yearly linear regression of AAPL on SPX return , execute:

%time by_year.apply(regress, 'AAPL', ['SPX'])

Wall time: 277 ms

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	SPX	intercept
2003	1.195406	0.000710
2004	1.363463	0.004201
2005	1.766415	0.003246
2006	1.645496	0.000080
2007	1.198761	0.003438
2008	0.968016	-0.001110
2009	0.879103	0.002954
2010	1.052608	0.001261
2011	0.806605	0.001514