Chapter 2 — Groupby-Aggregate
In this chapter, you’ll learn how to calculate statistics like the average number of kills per game for each champion. These statistics summarize (or aggregate) information about groups of observations in a dataset. Computing these statistics is easy with the groupby-aggregate operation, which acts in two steps:
- Similar observations are placed in groups (e.g. observations in match_player that correspond to the same champion).
- Within each group, a function is applied to the observations which aggregates all the values into a single value (e.g. by taking the mean of the values within the group).
We’ll get into the implementation details shortly, but before that, it will be helpful to start with the concept of data levels.
Data levels
The level of a table is a conceptual unit that a row represents. So, if you stored your match history in a table, that would be a match-level table, since each row corresponds to a match. Here are a few examples of tables from the same data source, where each contains rows representing different objects.
matchId | participantId | teamId | kills | deaths |
---|---|---|---|---|
3321609215 | 1 | 100 | 0 | 3 |
3321609215 | 6 | 200 | 4 | 1 |
3321609215 | 2 | 100 | 0 | 4 |
3321609215 | 7 | 200 | 1 | 1 |
3321710988 | 1 | 100 | 11 | 6 |
3321710988 | 6 | 100 | 13 | 11 |
3321710988 | 2 | 200 | 4 | 0 |
3321710988 | 7 | 200 | 7 | 8 |
3321820661 | 1 | 100 | 2 | 4 |
3321820661 | 6 | 100 | 0 | 5 |
3321820661 | 2 | 200 | 6 | 2 |
3321820661 | 7 | 200 | 3 | 6 |
matchId | teamId | kills | deaths |
---|---|---|---|
3321609215 | 100 | 0 | 7 |
3321609215 | 200 | 5 | 2 |
3321710988 | 100 | 24 | 17 |
3321710988 | 200 | 11 | 8 |
3321820661 | 100 | 2 | 9 |
3321820661 | 200 | 9 | 8 |
matchId | kills | deaths |
---|---|---|
3321609215 | 5 | 9 |
3321710988 | 35 | 25 |
3321820661 | 11 | 17 |
Table A stores players (or “participants”) in matches, with variables corresponding to a player’s kills and deaths in a match. Therefore, it is a (match,player)-level table. Table B stores the total kills and deaths for each team in a match, so it is a (match,team)-level table. Table C stores the total kills and deaths for each match, so it is a match-level table.
If you like, you can refer to Table A as a (match,team,player)-level table, since each row corresponds to a player on a particular team in a match. However, it is not necessary to include the team variable because each player can only be on one team.
In the last chapter, we learned to create new variables as functions of old ones, and to subset DataFrames. None of these operations change the data level of a table, but groupby-aggregate does.
Groupby-aggregate
The groupby-aggregate command allows us to calculate averages, totals, and much more. This can be a great way to summarize data to produce insights.
Suppose you are given a DataFrame that contains observations of players in matches, and you want to compute the average wards placed for each team. The original DataFrame is shown in Table 1, and the desired output is in Table 2.
matchId | teamId | wardsPlaced |
---|---|---|
3321609215 | 100 | 5 |
3321609215 | 100 | 11 |
3321609215 | 100 | 4 |
3321609215 | 100 | 5 |
3321609215 | 100 | 5 |
3321609215 | 200 | 6 |
3321609215 | 200 | 8 |
3321609215 | 200 | 8 |
3321609215 | 200 | 5 |
3321609215 | 200 | 5 |
matchId | teamId | (mean) wardsPlaced |
---|---|---|
3321609215 | 100 | 6 |
3321609215 | 200 | 6.4 |
Using groupby-aggregate, the code to create Table 2 df2 from Table 1 table1 is as simple as:
The operation works in two steps:
STEP 1: Groupby step
Observations are assigned to groups defined by the groupby variables, as shown in color. Observations are put in the same group if they have the same value for every groupby variable. In the example, the groupby variables are matchId and teamId. In the table below, groups are shown with colors and the groupby variables are shown in bold.
matchId | teamId | wardsPlaced |
---|---|---|
3321609215 | 100 | 5 |
3321609215 | 100 | 11 |
3321609215 | 100 | 4 |
3321609215 | 100 | 5 |
3321609215 | 100 | 5 |
3321609215 | 200 | 6 |
3321609215 | 200 | 8 |
3321609215 | 200 | 8 |
3321609215 | 200 | 5 |
3321609215 | 200 | 5 |
STEP 2: Aggregate step
The specified aggregation function is applied to each group. In the example, the aggregation function computes the mean of wardsPlaced within each group of observations.
matchId | teamId | wardsPlaced |
---|---|---|
3321609215 | 100 | 5 |
3321609215 | 100 | 11 |
3321609215 | 100 | 4 |
3321609215 | 100 | 5 |
3321609215 | 100 | 5 |
3321609215 | 200 | 6 |
3321609215 | 200 | 8 |
3321609215 | 200 | 8 |
3321609215 | 200 | 5 |
3321609215 | 200 | 5 |
The .reset_index() command is optional; we’ll explain what it does later in the Indexes section.
Parameters in groupby-aggregate
There are two components that must be specified for any groupby-aggregate command: the grouping variable or variables (by) and the aggregating function or functions (func).
- by: a variable name (string) or a list of variable names, defining the level of the new dataset
- func: aggregation function(s) for combining observations within each group.
Pandas is flexible about how you specify aggregation functions. Each of the following is a valid input for the func parameter.
- A name or reference to a single function, such as 'mean', np.max, or a user-defined function like my_range_function. All variables not listed in the groupby step will be aggregated based on the given function. Variables that can’t be aggregated will be discarded (e.g., the champion column will be discarded because its values are strings).
- A list of names or references to functions such as ['mean', 'max', 'min'] or ['mean', np.max, my_custom_agg_function]. All variables that can be aggregated will be according to each function passed, resulting in multiple variables in the resulting table.
- A dictionary that assigns aggregation function names or references to variables. Given Table 3 (shown below), the following code illustrates how to use a dictionary in the aggregation part, and the
result is shown in Table 4.
df3 = table3.groupby(by=['matchId', 'teamId']).agg(
{'wardsPlaced':'mean', 'wardsKilled':'max'}
).reset_index()
matchId | teamId | wardsPlaced | wardsKilled |
---|---|---|---|
3321609215 | 100 | 5 | 1 |
3321609215 | 100 | 11 | 4 |
3321609215 | 100 | 4 | 1 |
3321609215 | 100 | 5 | 2 |
3321609215 | 100 | 5 | 0 |
3321609215 | 200 | 6 | 1 |
3321609215 | 200 | 8 | 1 |
3321609215 | 200 | 8 | 3 |
3321609215 | 200 | 5 | 1 |
3321609215 | 200 | 5 | 0 |
matchId | teamId | (mean) wardsPlaced | (max) wardsKilled |
---|---|---|---|
3321609215 | 100 | 6 | 4 |
3321609215 | 200 | 6.4 | 3 |
Note: It’s important to give variables accurate names, especially after a groupby-aggregate command, which by default leaves variable names unchanged. For example, in Table 4, we should rename 'wardsPlaced' to 'avgWardsPlaced' to reflect the meaning of the new variable. The following command will rename the variables.
Alternatively, you can rename all of the variables in your DataFrame by assigning a list of new column names directly to the DataFrame.
If you do this, make sure you’ve got your variables in the same order as your names, or else they’ll be mismatched!
A noteworthy detail: Indexes
Every Pandas DataFrame has an Index, accessible via df.index, which is essentially just a list of row names. As we have seen in the previous chapter, we can use .iloc or .loc to slice a DataFrame based on the Index or variable names. For now, it’s not important to know the details of Indexes, but it’s good to know they exist so you don’t get confused when they show up.
When applying groupby-aggregate, the groups defined by the operation become the row labels for the resulting DataFrame in the form of an Index. (In the case of multiple grouping variables, it is a MultiIndex object.) This is a nice way of reminding us of the level of our new data, but can be inconvenient if you prefer working with regular columns. To convert this Index of a Dataframe back to columns, you can simply use .reset_index() on the DataFrame.
Let’s go back to Table 1. After this line of code
we get the following table.
matchId | teamId | wardsPlaced |
---|---|---|
3321609215 | 100 | 6 |
3321609215 | 200 | 6.4 |
matchId | teamId | wardsPlaced |
---|---|---|
3321609215 | 100 | 6 |
3321609215 | 200 | 6.4 |
Compared to the three variables of Table 2, Table 5 has only one variable (wardsPlaced), but with a MultiIndex with two keys.
Leaving the Indexes in, however, allows for easy selection or subsetting of the DataFrame using the .loc command. The syntax for slicing a DataFrame with a MultiIndex is also df.loc[row_labels, column_names] as we have seen in Chapter 1. The only difference is that we need to pass row_labels a tuple of row labels.
To subset this table, use df.loc[row_labels,col_labels] like we saw in Chapter 1. Instead of a string or number, row_labels will be a tuple (or list of tuples) corresponding to the desired row(s). For example, if we want to know the average wards placed by team 100 in this match, we can use:
Data
The tasks given here use the same data as the previous chapter, match_player_basic.csv.
Tasks
- Data levels. What is the data level of df2 after the following code:
df2 = df.groupby([A,B]).agg({C:my_agg_function})
- Champion stats. Create a table that shows the average kills, deaths, assists, and game duration for each champion.
- Teams with the most gold. Create a table that shows the 10 teams that earned the most gold. Note that each game has two teams. Teams are uniquely identified by both matchId and teamId.
- The deadliest matches. Create a table of the 10 matches that had the most total kills.
- When kills don’t lead to a win. Consider two types of teams: teams that have more kills than deaths but lose, and teams that have more deaths than kills but win.
Create a table that shows the average gold earned by these two types of teams. Your final result should look like this:
teamType averageGold moreKillsButLose - moreDeathsButWin -