Advanced pandas

import numpy as np

import pandas as pd

Categorical Data

This section introduces the pandas Categorical type.Using it will achieve better performance and memory use in some pandas operations.

Background and motivation

Frequently,a column in a table may contain repeated instances of a smaller set of distinct values. We have already seen functions like unique and value_counts,which enable us to extract the distinct values from an array and compute their freuencies respectively.

values=pd.Series(['apple','orange','apple','apple']*2)

values

0     apple

1    orange

2     apple

3     apple

4     apple

5    orange

6     apple

7     apple

dtype: object

pd.unique(values)

array(['apple', 'orange'], dtype=object)

pd.value_counts(values)

apple     6

orange    2

dtype: int64

Many data systems have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best pratice is to use so-called dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension tatble.

values=pd.Series([0,1,0,0]*2)

dim=pd.Series(['apple','orange'])

dim.take(values)

0     apple

1    orange

0     apple

0     apple

0     apple

1    orange

0     apple

0     apple

dtype: object

We can use the take method to restore the original Series of strings.

This representation as integers is called the categorical or dictionary-encoded representation.The array of distinct values can be called the categories,dictionary or levels of the data.In this blog, we will use the terms categorical and categories.The integer values that reference the categories are called the category codes or simply codes.

The categorical representation can yield significant performance improvements when you are doing analytics.You can also perform transformations on the categories while leaving the codes unmodified.Some example transformations that can be made at relatively low cost are:

Renaming categories.
Appending a new category without changing the order or position of the existing categories.

Categorical type in pandas

Pandas has a special Categorical type for holding data that uses the integer-based categorical representation or encoding.

fruits=['apple','orange','apple','apple']*2

N=len(fruits)

df=pd.DataFrame({'fruit':fruits,

                'basket_id':np.arange(N),

                'count':np.random.randint(3,15,size=N),

                'weight':np.random.uniform(0,4,size=N)},columns=['basket_id','fruit','count','weight'])

df

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

	basket_id	fruit	count	weight
0	0	apple	3	3.224342
1	1	orange	8	2.625838
2	2	apple	4	1.285304
3	3	apple	11	1.510722
4	4	apple	4	1.560894
5	5	orange	13	3.138222
6	6	apple	12	3.994037
7	7	apple	10	0.644615

df['fruit'] is an array of Python string objects.We can convert it to categorical by calling:

fruit_cat=df['fruit'].astype('category')

fruit_cat

0     apple

1    orange

2     apple

3     apple

4     apple

5    orange

6     apple

7     apple

Name: fruit, dtype: category

Categories (2, object): [apple, orange]

df['fruit']

0     apple

1    orange

2     apple

3     apple

4     apple

5    orange

6     apple

7     apple

Name: fruit, dtype: object

The values for fruit_cat are not a Numpy array,but an instance of pandas.Categorical:

c=fruit_cat.values

[apple, orange, apple, apple, apple, orange, apple, apple]

Categories (2, object): [apple, orange]

type(c)

pandas.core.arrays.categorical.Categorical

df['fruit'].values  # df['fruit']'s  values are Numpy array.

array(['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple',

       'apple'], dtype=object)

df.values # df.values are also Numpy array

array([[0, 'apple', 3, 3.224342140640482],

       [1, 'orange', 8, 2.625837611614019],

       [2, 'apple', 4, 1.2853036490436986],

       [3, 'apple', 11, 1.5107221991867759],

       [4, 'apple', 4, 1.5608944958834634],

       [5, 'orange', 13, 3.1382222188914577],

       [6, 'apple', 12, 3.9940366021092872],

       [7, 'apple', 10, 0.64461515110633]], dtype=object)

The Categorical object has categories and codes attributes:

c.categories

Index(['apple', 'orange'], dtype='object')

c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

c.categories.take(c.codes)

Index(['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple',

       'apple'],

      dtype='object')

You can also create pandas.Categorical directly from other types of Python sequence:

my_categories=pd.Categorical(['foo','bar','baz','foo','bar'])

my_categories

[foo, bar, baz, foo, bar]

Categories (3, object): [bar, baz, foo]

If you have obtained categorical encoded data from another source,you can use the alternative from_codes constructor:

categories=['foo','bar','baz','baz']

codes=[0,1,2,0,0,1,3]

my_cat_2=pd.Categorical.from_codes(codes,categories)  # categories have to be unique

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-29-5368c7cbd53a> in <module>()

----> 1 my_cat_2=pd.Categorical.from_codes(codes,categories)  # categories have to be unique

D:\Anaconda\lib\site-packages\pandas\core\arrays\categorical.py in from_codes(cls, codes, categories, ordered)

    584                 "codes need to be convertible to an arrays of integers")

    585

--> 586         categories = CategoricalDtype.validate_categories(categories)

    587

    588         if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):

D:\Anaconda\lib\site-packages\pandas\core\dtypes\dtypes.py in validate_categories(categories, fastpath)

    322

    323             if not categories.is_unique:

--> 324                 raise ValueError('Categorical categories must be unique')

    325

    326         if isinstance(categories, ABCCategoricalIndex):

ValueError: Categorical categories must be unique

categories=['foo','bar','baz','baxz']

codes=[0,1,2,0,0,1,3]

pd.Categorical.from_codes(codes,categories)

[foo, bar, baz, foo, foo, bar, baxz]

Categories (4, object): [foo, bar, baz, baxz]

Unless explicitly specified,categorical conversion assume no specific ordering of the categories,so the categories array may be in a different order depending on the ordering of the input data.When using from_codes or any of the other constructors,you can indicate that the categories have a meaningful ordering.

ordered_cat=pd.Categorical.from_codes(codes,categories,ordered=True)

ordered_cat

[foo, bar, baz, foo, foo, bar, baxz]

Categories (4, object): [foo < bar < baz < baxz]

ordered_cat.codes

array([0, 1, 2, 0, 0, 1, 3], dtype=int8)

ordered_cat.categories

Index(['foo', 'bar', 'baz', 'baxz'], dtype='object')

Unordered categorical instance can be made ordered with as_ordered:

As a last note,categorical data need not be strings,a categorical array can consist of any immutable value types:

pd.Categorical([(1,2),'a',3])

[(1, 2), a, 3]

Categories (3, object): [(1, 2), a, 3]

Computations with Categorical

Using Categorical in pandas compared with the non-encoded version(like an array of strings) generally behaves the same way.Some parts of pandas,like the groupby function,perform better when working with categories.There are also some functions that can utilize the ordered flag.

Let's consider some random numeric data,and use the panda.qcut binning function.This return pandas.Categorical;We used pandas.cut ealier but glossed over the details of how categoricals work.

np.random.seed(12345)

draws=np.random.randn(1000)

bins=pd.qcut(draws,4)

bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]

Length: 1000

Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

bins.value_counts()

(-2.9499999999999997, -0.684]    250

(-0.684, -0.0101]                250

(-0.0101, 0.63]                  250

(0.63, 3.928]                    250

dtype: int64

pd.value_counts(bins)

(0.63, 3.928]                    250

(-0.0101, 0.63]                  250

(-0.684, -0.0101]                250

(-2.9499999999999997, -0.684]    250

dtype: int64

type(bins)

pandas.core.arrays.categorical.Categorical

While useful,the exact sample quartiles may be less useful for producing a report than quartile names.We can achieve this with the labels argument to qcut:

bins=pd.qcut(draws,4,labels=['Q1','Q2','Q3','Q4'])

bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]

Length: 1000

Categories (4, object): [Q1 < Q2 < Q3 < Q4]

bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

bins.categories

Index(['Q1', 'Q2', 'Q3', 'Q4'], dtype='object')

The labeled bins categorical does not contain information about the bin edges in the data,so we can use groupby to extract some summary statistics:

bins=pd.Series(bins,name='quartile')

pd.Series(draws).groupby(bins).mean()

quartile

Q1   -1.215981

Q2   -0.362423

Q3    0.307840

Q4    1.261160

dtype: float64

results=(pd.Series(draws).groupby(bins).agg(['count','min','max','mean']).reset_index())

results

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

	quartile	count	min	max	mean
0	Q1	250	-2.949343	-0.685484	-1.215981
1	Q2	250	-0.683066	-0.010115	-0.362423
2	Q3	250	-0.010032	0.628894	0.307840
3	Q4	250	0.634238	3.927528	1.261160

bins

0      Q2

1      Q3

2      Q2

3      Q2

4      Q4

5      Q4

6      Q3

7      Q3

8      Q4

9      Q4

10     Q4

11     Q1

12     Q3

13     Q3

14     Q4

15     Q4

16     Q1

17     Q2

18     Q4

19     Q2

20     Q2

21     Q3

22     Q4

23     Q1

24     Q2

25     Q3

26     Q3

27     Q3

28     Q3

29     Q4

       ..

970    Q2

971    Q1

972    Q2

973    Q4

974    Q3

975    Q1

976    Q1

977    Q2

978    Q2

979    Q3

980    Q3

981    Q1

982    Q3

983    Q4

984    Q2

985    Q4

986    Q1

987    Q4

988    Q1

989    Q3

990    Q1

991    Q4

992    Q1

993    Q4

994    Q2

995    Q3

996    Q2

997    Q1

998    Q3

999    Q4

Name: quartile, Length: 1000, dtype: category

Categories (4, object): [Q1 < Q2 < Q3 < Q4]

Better performance with categoricals

If you do a lot of analytics on a particular dataset,converting to categorical can yield substantial overall performance gains.A categorical version of a DataFrame column will often use significantly less memory,too.

N=10000000

draws=pd.Series(np.random.randn(N))

labels=pd.Series(['foo','bar','baz','qux']*(N//4))

12//5

12%5

categories=labels.astype('category')

labels.memory_usage()

80000080

categories.memory_usage()

10000272

So we can see that,Categorical ueses less memory than Series.While the conversion to category is not free,of course,but it is a one-time cost:

%time _=labels.astype('category')

Wall time: 400 ms

GroupBy operations can be significantly faster with categoricals because the underlying algoriths use the integer-based codes array instead of an array of strings.

Categorical methods

Series containing categorical data have several special methods similar to the Series.str specialized string methods.This also provides convenient access to the categories and codes.

s=pd.Series(['a','b','c','d']*2)

cat_s=s.astype('category')

cat_s

0    a

1    b

2    c

3    d

4    a

5    b

6    c

7    d

dtype: category

Categories (4, object): [a, b, c, d]

The special attribute cat provides access to categorical methods:

cat_s.cat.codes

0    0

1    1

2    2

3    3

4    0

5    1

6    2

7    3

dtype: int8

cat_s.values.codes

array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int8)

cat_s.values.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data.We often use the set_categories method to change them:

actual_categories=['a','b','c','d','e']

cat_s2=cat_s.cat.set_categories(actual_categories)

cat_s2

0    a

1    b

2    c

3    d

4    a

5    b

6    c

7    d

dtype: category

Categories (5, object): [a, b, c, d, e]

While it appears that the data is unchanged,the new categories will be reflected in operations that use them.For example,value_counts respects the categories,if present:

cat_s.value_counts()

d    2

c    2

b    2

a    2

dtype: int64

cat_s2.value_counts()

d    2

c    2

b    2

a    2

e    0

dtype: int64

In large datasets,categoricals are often used as a convenient fool for memory savings and better performance.After you filter a large DataFrame or Series,many of the categories may not appear in the data.To help with this,we can use the remove_unused_categories method to trim unobserved categories:

cat_s

0    a

1    b

2    c

3    d

4    a

5    b

6    c

7    d

dtype: category

Categories (4, object): [a, b, c, d]

cat_s3=cat_s[cat_s.isin(['a','b'])]

cat_s.isin(['a','b'])

0     True

1     True

2    False

3    False

4     True

5     True

6    False

7    False

dtype: bool

cat_s3

0    a

1    b

4    a

5    b

dtype: category

Categories (4, object): [a, b, c, d]

cat_s[[True,False,False,False,False,False,True,False]]

0    a

6    c

dtype: category

Categories (4, object): [a, b, c, d]

cat_s3.cat.remove_unused_categories()

0    a

1    b

4    a

5    b

dtype: category

Categories (2, object): [a, b]

Categorical methods for Series in pandas:

add_categories---> Append new(unused) categories at end of existing categories
as_ordered-------->Make categories ordered
as_unordered----->Make categories unordered
remove_categories--> Remove categories,setting any removed values to null
remove_unused_categories-->Remove any category values which do not appear in the data
rename_categories---->Replace categories with indicated set of new category names;cannot change the number of categories
reorder_categories--->Behaves like rename_categories,but can also change the result to have ordered categories
set_categories------->Replcace the categories with the indicated set of new categories;can add or remove categories

Creating dummy variables for modeling

When you are using statistics or machine learning tools,you will often transform categorical data into dummy variables,also known as one-hot encoding.This involves creating a DataFrame with a column for each distinct category;these columns contain 1s for occurrences of a given category and 0 otherwise;

cat_s=pd.Series(['a','b','c','d']*2,dtype='category')

The pandas.get_dummies function converts this one-dimensional categorical data into a DataFrame containing the dummy variable:

pd.get_dummies(cat_s)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

	a	b	c	d
0	1	0	0	0
1	0	1	0	0
2	0	0	1	0
3	0	0	0	1
4	1	0	0	0
5	0	1	0	0
6	0	0	1	0
7	0	0	0	1

Advanced GroupBy use

df

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

	time	value
0	2020-05-23 00:00:00	0
1	2020-05-23 00:01:00	1
2	2020-05-23 00:02:00	2
3	2020-05-23 00:03:00	3
4	2020-05-23 00:04:00	4
5	2020-05-23 00:05:00	5
6	2020-05-23 00:06:00	6
7	2020-05-23 00:07:00	7
8	2020-05-23 00:08:00	8
9	2020-05-23 00:09:00	9
10	2020-05-23 00:10:00	10
11	2020-05-23 00:11:00	11
12	2020-05-23 00:12:00	12
13	2020-05-23 00:13:00	13
14	2020-05-23 00:14:00	14

df.groupby('fruit').max()

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-113-79d22643c2e2> in <module>()

----> 1 df.groupby('fruit').max()

D:\Anaconda\lib\site-packages\pandas\core\generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, **kwargs)

   6663         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,

   6664                        sort=sort, group_keys=group_keys, squeeze=squeeze,

-> 6665                        observed=observed, **kwargs)

   6666

   6667     def asfreq(self, freq, method=None, how=None, normalize=False,

D:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in groupby(obj, by, **kwds)

   2150         raise TypeError('invalid type: %s' % type(obj))

   2151

-> 2152     return klass(obj, by, **kwds)

   2153

   2154 

D:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, **kwargs)

    597                                                     sort=sort,

    598                                                     observed=observed,

--> 599                                                     mutated=self.mutated)

    600

    601         self.obj = obj

D:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _get_grouper(obj, key, axis, level, sort, observed, mutated, validate)

   3289                 in_axis, name, level, gpr = False, None, gpr, None

   3290             else:

-> 3291                 raise KeyError(gpr)

   3292         elif isinstance(gpr, Grouper) and gpr.key is not None:

   3293             # Add key to exclusions

KeyError: 'fruit'

Groupby transforms and 'unwrapped' groupbys

There is a built-in method called transform,which is similar to apply but imposes more constraints on the kind of function you can use:

It can produce a scalar value to be broadcast to the shape of the group
It can produce an object of the same shape as the input group
It must not mutate its input

df=pd.DataFrame({'key':['a','b','c']*4,'value':np.arange(12)})

df

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

	key	value
0	a	0
1	b	1
2	c	2
3	a	3
4	b	4
5	c	5
6	a	6
7	b	7
8	c	8
9	a	9
10	b	10
11	c	11

g=df.groupby('key').value

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x00000277954F00B8>

g.apply(np.mean)

key

a    4.5

b    5.5

c    6.5

Name: value, dtype: float64

g.apply(lambda x:x.mean())

key

a    4.5

b    5.5

c    6.5

Name: value, dtype: float64

g.apply(lambda x:x.max()-x.min()) # x represents the subgroup in group.

key

a    9

b    9

c    9

Name: value, dtype: int64

Suppose instead we wanted to produce a Series of the same shape as df['value']but with values replaced by the average grouped by 'key',we can pass the function lambda x:x.mean() to transform:

g.transform(lambda x:x.mean())

0     4.5

1     5.5

2     6.5

3     4.5

4     5.5

5     6.5

6     4.5

7     5.5

8     6.5

9     4.5

10    5.5

11    6.5

Name: value, dtype: float64

For built-in aggregation functions,we can pass a string alias as with the groupby agg method:

g.transform('mean')

0     4.5

1     5.5

2     6.5

3     4.5

4     5.5

5     6.5

6     4.5

7     5.5

8     6.5

9     4.5

10    5.5

11    6.5

Name: value, dtype: float64

Like apply,transform works with functions that return Series,but the result must be the same size as the input.

g.transform(lambda x:x*2)

0      0

1      2

2      4

3      6

4      8

5     10

6     12

7     14

8     16

9     18

10    20

11    22

Name: value, dtype: int32

def normalize(x):

    return (x-x.mean())/x.std()

g.transform(normalize)

0    -1.161895

1    -1.161895

2    -1.161895

3    -0.387298

4    -0.387298

5    -0.387298

6     0.387298

7     0.387298

8     0.387298

9     1.161895

10    1.161895

11    1.161895

Name: value, dtype: float64

g.apply(normalize)

0    -1.161895

1    -1.161895

2    -1.161895

3    -0.387298

4    -0.387298

5    -0.387298

6     0.387298

7     0.387298

8     0.387298

9     1.161895

10    1.161895

11    1.161895

Name: value, dtype: float64

Built-in aggregate functions like 'mean' or 'sum' are often much faster than a general apply function.Thses also have a fast patst when used with transform.This allows us to perform a so-called unwrapped group operation:

g.transform('mean')

0     4.5

1     5.5

2     6.5

3     4.5

4     5.5

5     6.5

6     4.5

7     5.5

8     6.5

9     4.5

10    5.5

11    6.5

Name: value, dtype: float64

(df['value']-g.transform('mean'))/g.transform('std')

0    -1.161895

1    -1.161895

2    -1.161895

3    -0.387298

4    -0.387298

5    -0.387298

6     0.387298

7     0.387298

8     0.387298

9     1.161895

10    1.161895

11    1.161895

Name: value, dtype: float64

Grouped time resampling

For time series,the resamplemethod is semantically a group operation based on a time intervalization.

N=15

times=pd.date_range('2020--5-23 00:00',freq='1min',periods=N)

df=pd.DataFrame({'time':times,

                'value':np.arange(N)})

df

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

	time	value
0	2020-05-23 00:00:00	0
1	2020-05-23 00:01:00	1
2	2020-05-23 00:02:00	2
3	2020-05-23 00:03:00	3
4	2020-05-23 00:04:00	4
5	2020-05-23 00:05:00	5
6	2020-05-23 00:06:00	6
7	2020-05-23 00:07:00	7
8	2020-05-23 00:08:00	8
9	2020-05-23 00:09:00	9
10	2020-05-23 00:10:00	10
11	2020-05-23 00:11:00	11
12	2020-05-23 00:12:00	12
13	2020-05-23 00:13:00	13
14	2020-05-23 00:14:00	14

Here,we can index by 'time' and then resample:

df.set_index('time').resample('5min').count()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

	value
time
2020-05-23 00:00:00	5
2020-05-23 00:05:00	5
2020-05-23 00:10:00	5

help(pd.DataFrame.resample)

Help on function resample in module pandas.core.generic:

resample(self, rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)

    Convenience method for frequency conversion and resampling of time

    series.  Object must have a datetime-like index (DatetimeIndex,

    PeriodIndex, or TimedeltaIndex), or pass datetime-like values

    to the on or level keyword.

    Parameters

    ----------

    rule : string

        the offset string or object representing target conversion

    axis : int, optional, default 0

    closed : {'right', 'left'}

        Which side of bin interval is closed. The default is 'left'

        for all frequency offsets except for 'M', 'A', 'Q', 'BM',

        'BA', 'BQ', and 'W' which all have a default of 'right'.

    label : {'right', 'left'}

        Which bin edge label to label bucket with. The default is 'left'

        for all frequency offsets except for 'M', 'A', 'Q', 'BM',

        'BA', 'BQ', and 'W' which all have a default of 'right'.

    convention : {'start', 'end', 's', 'e'}

        For PeriodIndex only, controls whether to use the start or end of

        `rule`

    kind: {'timestamp', 'period'}, optional

        Pass 'timestamp' to convert the resulting index to a

        ``DateTimeIndex`` or 'period' to convert it to a ``PeriodIndex``.

        By default the input representation is retained.

    loffset : timedelta

        Adjust the resampled time labels

    base : int, default 0

        For frequencies that evenly subdivide 1 day, the "origin" of the

        aggregated intervals. For example, for '5min' frequency, base could

        range from 0 through 4. Defaults to 0

    on : string, optional

        For a DataFrame, column to use instead of index for resampling.

        Column must be datetime-like.

        .. versionadded:: 0.19.0

    level : string or int, optional

        For a MultiIndex, level (name or number) to use for

        resampling.  Level must be datetime-like.

        .. versionadded:: 0.19.0

    Returns

    -------

    Resampler object

    Notes

    -----

    See the `user guide

    <http://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling>`_

    for more.

    To learn more about the offset strings, please see `this link

    <http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.

    Examples

    --------

    Start by creating a series with 9 one minute timestamps.

    >>> index = pd.date_range('1/1/2000', periods=9, freq='T')

    >>> series = pd.Series(range(9), index=index)

    >>> series

    2000-01-01 00:00:00    0

    2000-01-01 00:01:00    1

    2000-01-01 00:02:00    2

    2000-01-01 00:03:00    3

    2000-01-01 00:04:00    4

    2000-01-01 00:05:00    5

    2000-01-01 00:06:00    6

    2000-01-01 00:07:00    7

    2000-01-01 00:08:00    8

    Freq: T, dtype: int64

    Downsample the series into 3 minute bins and sum the values

    of the timestamps falling into a bin.

    >>> series.resample('3T').sum()

    2000-01-01 00:00:00     3

    2000-01-01 00:03:00    12

    2000-01-01 00:06:00    21

    Freq: 3T, dtype: int64

    Downsample the series into 3 minute bins as above, but label each

    bin using the right edge instead of the left. Please note that the

    value in the bucket used as the label is not included in the bucket,

    which it labels. For example, in the original series the

    bucket ``2000-01-01 00:03:00`` contains the value 3, but the summed

    value in the resampled bucket with the label ``2000-01-01 00:03:00``

    does not include 3 (if it did, the summed value would be 6, not 3).

    To include this value close the right side of the bin interval as

    illustrated in the example below this one.

    >>> series.resample('3T', label='right').sum()

    2000-01-01 00:03:00     3

    2000-01-01 00:06:00    12

    2000-01-01 00:09:00    21

    Freq: 3T, dtype: int64

    Downsample the series into 3 minute bins as above, but close the right

    side of the bin interval.

    >>> series.resample('3T', label='right', closed='right').sum()

    2000-01-01 00:00:00     0

    2000-01-01 00:03:00     6

    2000-01-01 00:06:00    15

    2000-01-01 00:09:00    15

    Freq: 3T, dtype: int64

    Upsample the series into 30 second bins.

    >>> series.resample('30S').asfreq()[0:5] #select first 5 rows

    2000-01-01 00:00:00   0.0

    2000-01-01 00:00:30   NaN

    2000-01-01 00:01:00   1.0

    2000-01-01 00:01:30   NaN

    2000-01-01 00:02:00   2.0

    Freq: 30S, dtype: float64

    Upsample the series into 30 second bins and fill the ``NaN``

    values using the ``pad`` method.

    >>> series.resample('30S').pad()[0:5]

    2000-01-01 00:00:00    0

    2000-01-01 00:00:30    0

    2000-01-01 00:01:00    1

    2000-01-01 00:01:30    1

    2000-01-01 00:02:00    2

    Freq: 30S, dtype: int64

    Upsample the series into 30 second bins and fill the

    ``NaN`` values using the ``bfill`` method.

    >>> series.resample('30S').bfill()[0:5]

    2000-01-01 00:00:00    0

    2000-01-01 00:00:30    1

    2000-01-01 00:01:00    1

    2000-01-01 00:01:30    2

    2000-01-01 00:02:00    2

    Freq: 30S, dtype: int64

    Pass a custom function via ``apply``

    >>> def custom_resampler(array_like):

    ...     return np.sum(array_like)+5

    >>> series.resample('3T').apply(custom_resampler)

    2000-01-01 00:00:00     8

    2000-01-01 00:03:00    17

    2000-01-01 00:06:00    26

    Freq: 3T, dtype: int64

    For a Series with a PeriodIndex, the keyword `convention` can be

    used to control whether to use the start or end of `rule`.

    >>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',

                                                    freq='A',

                                                    periods=2))

    >>> s

    2012    1

    2013    2

    Freq: A-DEC, dtype: int64

    Resample by month using 'start' `convention`. Values are assigned to

    the first month of the period.

    >>> s.resample('M', convention='start').asfreq().head()

    2012-01    1.0

    2012-02    NaN

    2012-03    NaN

    2012-04    NaN

    2012-05    NaN

    Freq: M, dtype: float64

    Resample by month using 'end' `convention`. Values are assigned to

    the last month of the period.

    >>> s.resample('M', convention='end').asfreq()

    2012-12    1.0

    2013-01    NaN

    2013-02    NaN

    2013-03    NaN

    2013-04    NaN

    2013-05    NaN

    2013-06    NaN

    2013-07    NaN

    2013-08    NaN

    2013-09    NaN

    2013-10    NaN

    2013-11    NaN

    2013-12    2.0

    Freq: M, dtype: float64

    For DataFrame objects, the keyword ``on`` can be used to specify the

    column instead of the index for resampling.

    >>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])

    >>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')

    >>> df.resample('3T', on='time').sum()

                         a  b  c  d

    time

    2000-01-01 00:00:00  0  3  6  9

    2000-01-01 00:03:00  0  3  6  9

    2000-01-01 00:06:00  0  3  6  9

    For a DataFrame with MultiIndex, the keyword ``level`` can be used to

    specify on level the resampling needs to take place.

    >>> time = pd.date_range('1/1/2000', periods=5, freq='T')

    >>> df2 = pd.DataFrame(data=10*[range(4)],

                           columns=['a', 'b', 'c', 'd'],

                           index=pd.MultiIndex.from_product([time, [1, 2]])

                           )

    >>> df2.resample('3T', level=0).sum()

                         a  b   c   d

    2000-01-01 00:00:00  0  6  12  18

    2000-01-01 00:03:00  0  4   8  12

    See also

    --------

    groupby : Group by mapping, function, label, or list of labels.

help(np.tile)

Help on function tile in module numpy.lib.shape_base:

tile(A, reps)

    Construct an array by repeating A the number of times given by reps.

    If `reps` has length ``d``, the result will have dimension of

    ``max(d, A.ndim)``.

    If ``A.ndim < d``, `A` is promoted to be d-dimensional by prepending new

    axes. So a shape (3,) array is promoted to (1, 3) for 2-D replication,

    or shape (1, 1, 3) for 3-D replication. If this is not the desired

    behavior, promote `A` to d-dimensions manually before calling this

    function.

    If ``A.ndim > d``, `reps` is promoted to `A`.ndim by pre-pending 1's to it.

    Thus for an `A` of shape (2, 3, 4, 5), a `reps` of (2, 2) is treated as

    (1, 1, 2, 2).

    Note : Although tile may be used for broadcasting, it is strongly

    recommended to use numpy's broadcasting operations and functions.

    Parameters

    ----------

    A : array_like

        The input array.

    reps : array_like

        The number of repetitions of `A` along each axis.

    Returns

    -------

    c : ndarray

        The tiled output array.

    See Also

    --------

    repeat : Repeat elements of an array.

    broadcast_to : Broadcast an array to a new shape

    Examples

    --------

    >>> a = np.array([0, 1, 2])

    >>> np.tile(a, 2)

    array([0, 1, 2, 0, 1, 2])

    >>> np.tile(a, (2, 2))

    array([[0, 1, 2, 0, 1, 2],

           [0, 1, 2, 0, 1, 2]])

    >>> np.tile(a, (2, 1, 2))

    array([[[0, 1, 2, 0, 1, 2]],

           [[0, 1, 2, 0, 1, 2]]])

    >>> b = np.array([[1, 2], [3, 4]])

    >>> np.tile(b, 2)

    array([[1, 2, 1, 2],

           [3, 4, 3, 4]])

    >>> np.tile(b, (2, 1))

    array([[1, 2],

           [3, 4],

           [1, 2],

           [3, 4]])

    >>> c = np.array([1,2,3,4])

    >>> np.tile(c,(4,1))

    array([[1, 2, 3, 4],

           [1, 2, 3, 4],

           [1, 2, 3, 4],

           [1, 2, 3, 4]])

df2=pd.DataFrame({'time':times.repeat(3),

                 'key':np.tile(['a','b','c'],N),

                 'value':np.arange(N*3)})

df2

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

	time	key	value
0	2020-05-23 00:00:00	a	0
1	2020-05-23 00:00:00	b	1
2	2020-05-23 00:00:00	c	2
3	2020-05-23 00:01:00	a	3
4	2020-05-23 00:01:00	b	4
5	2020-05-23 00:01:00	c	5
6	2020-05-23 00:02:00	a	6
7	2020-05-23 00:02:00	b	7
8	2020-05-23 00:02:00	c	8
9	2020-05-23 00:03:00	a	9
10	2020-05-23 00:03:00	b	10
11	2020-05-23 00:03:00	c	11
12	2020-05-23 00:04:00	a	12
13	2020-05-23 00:04:00	b	13
14	2020-05-23 00:04:00	c	14
15	2020-05-23 00:05:00	a	15
16	2020-05-23 00:05:00	b	16
17	2020-05-23 00:05:00	c	17
18	2020-05-23 00:06:00	a	18
19	2020-05-23 00:06:00	b	19
20	2020-05-23 00:06:00	c	20
21	2020-05-23 00:07:00	a	21
22	2020-05-23 00:07:00	b	22
23	2020-05-23 00:07:00	c	23
24	2020-05-23 00:08:00	a	24
25	2020-05-23 00:08:00	b	25
26	2020-05-23 00:08:00	c	26
27	2020-05-23 00:09:00	a	27
28	2020-05-23 00:09:00	b	28
29	2020-05-23 00:09:00	c	29
30	2020-05-23 00:10:00	a	30
31	2020-05-23 00:10:00	b	31
32	2020-05-23 00:10:00	c	32
33	2020-05-23 00:11:00	a	33
34	2020-05-23 00:11:00	b	34
35	2020-05-23 00:11:00	c	35
36	2020-05-23 00:12:00	a	36
37	2020-05-23 00:12:00	b	37
38	2020-05-23 00:12:00	c	38
39	2020-05-23 00:13:00	a	39
40	2020-05-23 00:13:00	b	40
41	2020-05-23 00:13:00	c	41
42	2020-05-23 00:14:00	a	42
43	2020-05-23 00:14:00	b	43
44	2020-05-23 00:14:00	c	44

To do the same resampling for each value of 'key' for each value of 'key',we introduce the pandas.TimeGrouper object:

resample=df2.groupby(['key','time']).sum()

resample

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

		value
key	time
a	2020-05-23 00:00:00	0
	2020-05-23 00:01:00	3
	2020-05-23 00:02:00	6
	2020-05-23 00:03:00	9
	2020-05-23 00:04:00	12
	2020-05-23 00:05:00	15
	2020-05-23 00:06:00	18
	2020-05-23 00:07:00	21
	2020-05-23 00:08:00	24
	2020-05-23 00:09:00	27
	2020-05-23 00:10:00	30
	2020-05-23 00:11:00	33
	2020-05-23 00:12:00	36
	2020-05-23 00:13:00	39
	2020-05-23 00:14:00	42
b	2020-05-23 00:00:00	1
	2020-05-23 00:01:00	4
	2020-05-23 00:02:00	7
	2020-05-23 00:03:00	10
	2020-05-23 00:04:00	13
	2020-05-23 00:05:00	16
	2020-05-23 00:06:00	19
	2020-05-23 00:07:00	22
	2020-05-23 00:08:00	25
	2020-05-23 00:09:00	28
	2020-05-23 00:10:00	31
	2020-05-23 00:11:00	34
	2020-05-23 00:12:00	37
	2020-05-23 00:13:00	40
	2020-05-23 00:14:00	43
c	2020-05-23 00:00:00	2
	2020-05-23 00:01:00	5
	2020-05-23 00:02:00	8
	2020-05-23 00:03:00	11
	2020-05-23 00:04:00	14
	2020-05-23 00:05:00	17
	2020-05-23 00:06:00	20
	2020-05-23 00:07:00	23
	2020-05-23 00:08:00	26
	2020-05-23 00:09:00	29
	2020-05-23 00:10:00	32
	2020-05-23 00:11:00	35
	2020-05-23 00:12:00	38
	2020-05-23 00:13:00	41
	2020-05-23 00:14:00	44

time_key=pd.TimeGrouper('5min')

D:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: pd.TimeGrouper is deprecated and will be removed; Please use pd.Grouper(freq=...)

  """Entry point for launching an IPython kernel.

resample=(df2.set_index('time').groupby(['key',time_key]).sum())

resample

.dataframe tbody tr th:only-of-type { vertical-align: middle }
\3c pre>\3c code>.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

		value
key	time
a	2020-05-23 00:00:00	30
	2020-05-23 00:05:00	105
	2020-05-23 00:10:00	180
b	2020-05-23 00:00:00	35
	2020-05-23 00:05:00	110
	2020-05-23 00:10:00	185
c	2020-05-23 00:00:00	40
	2020-05-23 00:05:00	115
	2020-05-23 00:10:00	190

Techniques for method chainning

When applying a sequence of transformations to a dataset,you may find yourself creatingf numerous temporary variables that are never used in your analysis.

pseudocode here:

df=load_data()

df2=df[df['col2']<0]

df2['col1_demeaned']=df2['col1']-df2['col1'].mean()

result=df2.groupby('key').col1_demeaned.std()

While we are not using any real data here,this example highlights some new methods.First,the DataFrame.assign method is a functional alternative to column assignments of the form df[k]=v.Rather than modifying the object in-place,it returns a new DataFrame with the indicated modifications.So these statements are equivalent:

Usual non-functional way

df2=df.copy()

df2['k']=v

Functional assign way

df2=df.assign(k=v)

Assigning in-place may execute faster than using assign,but assign enables easier method chaining:

result=(df2.assign(col1_demeqaned=df2.col1-df2.col2.mean())

  .groupby('key')

  .col1_demeaned.std())

Version2

result=(load_data()[load_data()['col2']<0]

  .assign(col1_demeaned=df2.col1-df2.col2.mean())   # have to refering to df2....

  .groupby('key')

  .col1_demeaned.std()

  )

We use the outer parentheses to make it more convenient to add line breaks.

One thing to keep in mind when doing method chaining is that you may need to refer to temporary objects.In the preceding example, we cannot refer to the result of load_data until it has been assigning to the temporary variable df.To help with this assign and many other pandas functions accept function-like arguments,also known as callables.

To show callables in action, consider a fragment of the example from before:

df=load_data()

df2=df[df['col2']<0]

version 2

df=(load_data()

[lambda x:x['col2']<0]) # alternative method of refering to temporary objects using callables

result=(load_data()

[lambda x:x.col2<0]

.assign(col1_demeaned=lambda x:x.col1-x.col1.mean())

.groupby('key')

.col1_demeaned.std())

The pipe method

You can accomplish a lot with built-in pandas functions and the approaches to method chaining with callables taht we just looked at.However,sometimes you need to use your own functions or funcitons from third-party libraries.This is where the pipe method comes in.

Consider a sequence of functions calls:

a=f(df,arg1=v1)

b=g(a,v2,arg3=v3)

c=h(b,arg4=v4)

When using functions that accept and returns Series or DataFrame objects,you can rewrite this using calls to pipe:

result=(df.pipe(f,arg1=v1)

.pipe(g,v2,arg3=v3)

.pipe(h,arg4=v4))

The statement f(df) and df.pipe(f) are equivalent,but pipe makes chained inocation easier.

def group_demean(df,by,cols):

    result=df.copy()

    g=df.groupby(by)

    for c in cols:

        result[c]=df[c]-g[c].transform('mean')

    return result

result=(df[df.col1[<0]

.pipe(group_demean,['key1','key2'],['col1']])

Advanced pandas的更多相关文章

10分钟学习pandas
10 Minutes to pandas This is a short introduction to pandas, geared mainly for new users. You can se ...
【原】十分钟搞定pandas
http://www.cnblogs.com/chaosimple/p/4153083.html 本文是对pandas官方网站上<10 Minutes to pandas>的一个简单的翻译 ...
pandas入门
[原]十分钟搞定pandas 本文是对pandas官方网站上<10 Minutes to pandas>的一个简单的翻译,原文在这里.这篇文章是对pandas的一个简单的介绍,详细的介 ...
（转）十分钟入门pandas
本文是对pandas官方网站上<10 Minutes to pandas>的一个简单的翻译,原文在这里.这篇文章是对pandas的一个简单的介绍,详细的介绍请参考:Cookbook . 习 ...
10分钟快速搞定pandas
本文是对pandas官方网站上<10 Minutes to pandas>的一个简单的翻译,原文在这里.这篇文章是对pandas的一个简单的介绍,详细的介绍请参考:Cookbook .习惯 ...
pandas库的学习笔记
Environment pandas 0.21.0 python 3.6 jupyter notebook 开始习惯上,我们导入如下: import pandas as pd import nump ...
Python 数据处理库 pandas 入门教程
Python 数据处理库 pandas 入门教程2018/04/17 · 工具与框架 · Pandas, Python 原文出处: 强波的技术博客 pandas是一个Python语言的软件包,在我们使 ...
Kaggle：House Prices: Advanced Regression Techniques 数据预处理
本博客是博主在学习了两篇关于 "House Prices: Advanced Regression Techniques" 的教程 (House Prices EDA 和 Comp ...
十分钟搞定pandas
转至:http://www.cnblogs.com/chaosimple/p/4153083.html 本文是对pandas官方网站上<10 Minutes to pandas>的一个简单 ...
pandas category数据类型
实际应用pandas过程中,经常会用到category数据类型,通常以string的形式显示,包括颜色(红,绿,蓝),尺寸的大小(大,中,小),还有地理信息等(国家,省份),这些数据的处理经常会有各种 ...

随机推荐

ZLMediaKit: 快速入门
目录 ZLMediaKit是什么编译安装依赖库构建项目问题处理问题1: srtp 未找到, WebRTC 相关功能打开失败问题2:依赖库问题安装配置和运行问题处理问题1: 无权限监 ...
winform控件 datagridview分页功能界面实现需要有上一页下一页等操作控件 dataGridView1 控件的数据绑定方式如何实现分页中的数据修改然后进行保存请列出详细例子特别保存部分
以下提供一个示例来说明如何在 WinForms 中实现分页功能,并在分页中实现数据修改并保存的操作. 首先,我们需要一个包含数据源的 DataGridView 控件,并添加上一页.下一页等操作控件来实 ...
Linux 通过docker安装nginx，.net core sdk或运行时安装到Linux
1.Linux docker通过yum安装 https://blog.csdn.net/GMingZhou/article/details/94024453 https://qizhanming.co ...
红日复现为什么失败之struct-046流量分析加msf特征总结
struts2漏洞一.指纹识别 s2的url路径组成(详见struts.xml配置文件):name工程名+namespace命名空间+atcion名称+extends拓展名部署在根目录下,工程名可 ...
带大家做了个 AI 项目，没想到这么简单！
大家好,我是程序员鱼皮,现在已经是全民 AI 时代了,咱们程序员更要想办法榨干 AI,把 AI 利用起来.前几天我一时兴起,直播用 2 多个小时的时间,从需求分析开始,带大家做了一个 AI 海龟汤游戏 ...
Web前端入门第4问：HTML、CSS、JavaScript 的作用分别是什么？
HTML.CSS.JavaScript 的核心作用 HTML:网页的骨架功能:定义页面的内容结构(如按钮.表格.图片). 示例:<button>提交</button> 创建一 ...
Windows编程----结束进程
进程有启动就有终止,通过CreateProcess函数可以启动一个新的子进程,但是如何终结子进程呢?主要有四种方法: 通过主线程的入口函数(main函数.WinMain函数)的return关键字终止进 ...
Golang 入门 : 文件名、关键字与标识符
Go 的源文件以 .go 为后缀名存储在计算机中,这些文件名均由小写字母组成,如 scanner.go .如果文件名由多个部分组成,则使用下划线 _ 对它们进行分隔,如 scanner_test.go ...
关于能否用DeepSeek做危险的事情，DeepSeek本身给出了答案
AI教父辛顿说DeepSeek允许本地部署的话可能会导致用户用DeepSeek来做一些危险的事情(https://t.cj.sina.com.cn/articles/view/7879923924/m ...
Windows下PostgreSQL设置远程连接以及备份和恢复数据库
一.设置远程连接修改安装路径下的postgresql.conf,定位到listen_address = '*',确保其值为'*'(Windows下默认是这样的,可不用修改) 修改安装路径下的pg_ ...

Advanced pandas

Advanced pandas

Categorical Data

Background and motivation

Categorical type in pandas

Computations with Categorical

Better performance with categoricals

Categorical methods

Creating dummy variables for modeling

Advanced GroupBy use

Groupby transforms and 'unwrapped' groupbys

Grouped time resampling

Techniques for method chainning

pseudocode here:

Usual non-functional way

Functional assign way

Version2

version 2

The pipe method

Advanced pandas的更多相关文章

随机推荐

热门专题