
Data aggregation
As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe
function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:
>>> df.head() country city date hours 0 Germany Hamburg 2015-06-01 8 1 Germany Hamburg 2015-06-02 10 2 Germany Hamburg 2015-06-03 9 3 Germany Hamburg 2015-06-04 7 4 Germany Hamburg 2015-06-05 3
To view a summary per city
, we use the describe
function on the grouped data set:
>>> df.groupby("city").describe() hours city Berlin count 10.000000 mean 6.000000 std 3.741657 min 0.000000 25% 4.000000 50% 6.000000 75% 9.750000 max 10.000000 Birmingham count 10.000000 mean 5.100000 std 2.078995 min 2.000000 25% 4.000000 50% 5.500000 75% 6.750000 max 8.000000
On certain data sets
, it can be useful to group by more than one attribute. We can get an overview about the sunny hours per country and date by passing in two column names:
>>> df.groupby(["country", "date"]).describe() hours country date France 2015-06-01 count 5.000000 mean 6.200000 std 1.095445 min 5.000000 25% 5.000000 50% 7.000000 75% 7.000000 max 7.000000 2015-06-02 count 5.000000 mean 3.600000 std 3.577709 min 0.000000 25% 0.000000 50% 4.000000 75% 6.000000 max 8.000000 UK 2015-06-07 std 3.872983 min 0.000000 25% 2.000000 50% 6.000000 75% 8.000000 max 9.000000
We can compute single statistics as well:
>>> df.groupby("city").mean() hours city Berlin 6.0 Birmingham 5.1 Bordeax 4.7 Edinburgh 7.5 Frankfurt 5.8 Glasgow 4.8 Hamburg 5.5 Leipzig 5.0 London 4.8 Lyon 5.0 Manchester 5.2 Marseille 6.2 Munich 6.6 Nice 3.9 Paris 6.3
Finally, we can define any function to be applied on the groups with the agg
method. The above could have been written in terms of agg
like this:
>>> df.groupby("city").agg(np.mean) hours city Berlin 6.0 Birmingham 5.1 Bordeax 4.7 Edinburgh 7.5 Frankfurt 5.8 Glasgow 4.8 ...
But arbitrary functions are possible. As a last example, we define a custom
function, which takes an input of a series object and computes the difference between the smallest and the largest element:
>>> df.groupby("city").agg(lambda s: abs(min(s) - max(s))) hours city Berlin 10 Birmingham 6 Bordeax 10 Edinburgh 8 Frankfurt 9 Glasgow 10 Hamburg 10 Leipzig 9 London 10 Lyon 8 Manchester 10 Marseille 10 Munich 9 Nice 10 Paris 9