Chapter 1 — DataFrames

A convention for tabular data

The purpose of this course is to impart the core skills you need to manipulate data stored in tables. Although “manipulating data stored in tables” may seem limited in scope, you will soon discover how flexible and potent this paradigm is.

We start with a convention for how to represent data stored in a table. The idea is that a table stores information about a collection of objects with attributes: a row in a table corresponds to an object, and a column corresponds to an attribute of the objects in the table. We will refer to these objects (rows) as observations and their attributes (columns) as variables. This convention is so useful and common in data analytics that we will take it for granted here and only work with data in this format. If you are interested in reading more, see Wickham on Tidy data.

Pandas DataFrames

In Pandas, a table of data is called a DataFrame. Technically, a DataFrame is a dictionary of Pandas Series objects, each corresponding to a column in the table. A Series is similar to a list or a Numpy array, but with row names and a label indicating the name of Series. Because of this, commands you learn to manipulate arrays can be applied to Series and Dataframes as well. (In fact, Pandas Series, Numpy arrays, and regular Python lists are sometimes referred to generally as array-likes, and can be used interchangeably in Pandas.)

When you load data into a Pandas DataFrame, it lives in Python’s memory, so it is important to have a way to actually look at that data. If you’re using Anaconda Spyder, the Variable Explorer lets you do this directly. For the following examples, we’ll assume a DataFrame has been loaded into memory and stored in a variable called df, containing rows from the match_player_basic table. Each row contains end-of-match statistics for one of the participants in a match.

matchId teamId participantId champion win gameDuration kills deaths assists wardsPlaced wardsKilled goldEarned
2874719842 100 5 Lucian 0 30.33 4 8 6 9 2 10728
2874719842 100 3 Aatrox 0 30.33 1 6 8 8 1 10492
2874719842 200 9 Akali 1 30.33 7 3 8 16 5 12242
2874719842 200 7 Janna 1 30.33 4 4 14 25 3 9630
2874719842 200 6 Riven 1 30.33 9 5 8 14 0 14570
2874719842 200 10 Miss Fortune 1 30.33 6 4 8 14 8 13254
2874719842 100 4 Blitzcrank 0 30.33 1 5 7 24 7 6635
2874719842 100 2 Katarina 0 30.33 4 6 4 11 2 9059
2874719842 200 8 Rengar 1 30.33 4 4 9 13 5 13936
2874719842 100 1 Xin Zhao 0 30.33 10 5 4 9 4 14217

Here are a few useful Python commands that will help you when you’re starting to get familiar with a new dataset. (Click on one of the commands on to see its output below.)

  • df.head() — returns the first 5 rows of df
  • df.tail() — returns the last 5 rows of df
  • df.describe() — gives summary statistics on each variable in df
  • df['var_name'] — accesses the column with name var_name in df. The example below uses the variable teamId as var_name.
  • df['var_name'].value_counts() — returns frequency table for var_name in df. The example below uses the variable teamId as var_name.
  • df.sort_values(by='var_name') — sorts by var_name, returns a new DataFrame. The example below uses the variable kills as var_name.

Each of these commands returns a DataFrame or Series based on df. So, if you want to store a new DataFrame in memory that consists of the first 5 rows of df, you could use new_df = df.head().

On the other hand, there are times when you’ll only want to take a look at the data and not need to store it in memory. In that case, all you have to do is entering your command directly into the IPython console. When you do this, Python will print a human-readable string representation of the returned object. So, if you aren’t interested in storing the top 5 rows of your DataFrame but you do want to take a quick look at it, just type df.head() into the console. In fact, you can type almost any object into the console (including simply df) and Python will print a string representation of the object for you! With that in mind, here are a few more commands you may find useful. (Click on one of the commands on to see its output below.)

  • list(df) — returns the column names of the dataframe as a list
  • df.T — returns a transposed version of df (rows become columns, columns become rows)
  • df.dtypes — gives the data type of each variable in the dataframe
  • df.shape — returns a tuple with (# of observations, # of variables)

Operations on variables

In keeping with our new paradigm for representing tabular data, we will typically think about operations on our table in terms of operations on variables. Pandas (and Numpy) makes it easy and efficient to do these kinds of operations. For example, suppose we have a DataFrame df that stores kill and assist information for players in a game, and we want to calculate each player’s involvement in kills in that game. To do this, we create a new variable called 'kill_participation', which is equal to the number of kills plus the number of assists a player got in a game. Pandas makes this kind of calculation convenient: instead of having to add the kill and assist values for each game individually, we can do it all at once with one command.

kills
8
6
3
4
5
assists
1
2
8
6
14
kill_participation
9
8
11
10
19

The syntax to create this new column and include it in df is simply:

df['kill_participation'] = df['kills'] + df['assists']

If you don’t want to store the Series as a new column in the DataFrame df, you could instead write

my_series = df['kills'] + df['assists']

which would store the resulting Series in a new Series object called my_series.

Other arithmetic operations like multiplication and division work similarly. In fact, you can apply all kinds of functions to DataFrame columns, which utilize Numpy’s speed with vectorized operations, operations that act on all elements of an array simultaneously.

Suppose we wanted to create a new variable that measures if a player died a lot in a game. Using df['deaths'].describe(), we find that the median number of deaths in our dataset is 5. Let’s create a variable that is 1 if a player dies more than 5 times, and is 0 otherwise.

df['deaths']
8
6
3
4
5
df['deaths']>5
TRUE
TRUE
FALSE
FALSE
FALSE
df['over_5_deaths']
1
1
0
0
0

The syntax for this transformation is

df['over_5_deaths'] = (df['deaths']>5).astype(int)

and it works in two steps. First, df['deaths']>5 creates a Series of Boolean (True/False) values corresponding to the rows where the number of deaths was greater than 5, then .astype(int) says to interpret these Booleans as integers, which recasts True to 1 and False to 0. An equivalent way to do this would be to multiply by 1:

df['over_5_deaths'] = (df['deaths']>5)*1

Even though 1 is a single integer value and (df['deaths']>5) is a Series with many values, Pandas and Numpy interpret this command as multiplying 1 with every entry of (df['deaths']>5), just like we want.

Of course, if we wanted to do this in two steps we could, like so:

tmp = df['deaths']>5 # Series that is True for rows where 'deaths' > 5
df['over_5_deaths'] = tmp.astype(int) # convert to int and store in df

Finally, there is another way: We can write our own function and then apply it to each element of the Series with .apply(). It’s possible to do this with a regular Python function, as shown here:

def check_if_over_five(x):
return int(x>5)
df['over_5_deaths'] = df['deaths'].apply(check_if_over_five)

Notice that none of these operations involve writing a for loop that iterates over rows in the DataFrame. If you ever find yourself writing code that iterates through individual entries of a DataFrame, stop and think carefully about whether there is a vectorized way to do it instead!

However, there’s a more succinct way to do this, using Python’s lambda functions. This is a way for us to define a function in line, without having to give it a name. The syntax for this is simply:

df['over_5_deaths'] = df['deaths'].apply(lambda x: int(x>5))

Note: You are welcome to store a lambda function if you like. The syntax would look like check_if_over_five = lambda x: int(x>5). However, the point is that you don’t have to.

Subsetting DataFrames

Often, we will want to work with only part of our data, such as a specific column or collection of rows. In Pandas, subsetting (or “slicing”) DataFrames is done using the .loc and .iloc commands, which provide flexible ways to specify which rows and columns to keep.

  • df.loc[row_labels, col_labels] — returns a subset of df with specified row and column names given by row_labels and col_labels
  • df.iloc[row_indices, col_indices] — returns a subset of df with specified row and column indices given by row_indices and col_indices

To apply no restriction, use the : character. So df is the same as df.loc[:,:], which is the same as df.iloc[:,:]. Furthermore, .loc is flexible in that it lets us enter a Series of True/False values indicating which rows to keep, rather than row names themselves.

To put these all together, suppose we wanted to see the kill, death, and assist values for all Teemo games.

matchid participantid champion win kills deaths assists
2874721684 5 Ashe 1 2 7 8
2874726164 9 Anivia 0 3 7 1
2874765289 1 Teemo 1 9 4 10
2874870994 7 Teemo 0 8 7 1
2874870994 7 Ashe 0 3 10 5
2874765810 2 Teemo 0 3 8 3
kills deaths assists
9 4 10
8 7 1
3 8 3

To subset our DataFrame accordingly, we would use:

rows_to_keep = df['champion']=='Teemo' # a Series of Booleans
cols_to_keep = ['kills', 'deaths', 'assists'] # a list
my_subset = df.loc[rows_to_keep, cols_to_keep]

STEP 1: Determine rows to keep

rows_to_keep = df['champion'] =='Teemo'

This line creates a Series of values that are True when Teemo is the champion played, and False otherwise.

matchid participantid champion win kills deaths assists
2874721684 5 Ashe 1 2 7 8
2874726164 9 Anivia 0 3 7 1
2874765289 1 Teemo 1 9 4 10
2874870994 7 Teemo 0 8 7 1
2874870994 7 Ashe 0 3 10 5
2874765810 2 Teemo 0 3 8 3
df['champion'] == 'Teemo'
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE

STEP 2: Determine columns to keep

cols_to_keep = ['kills','deaths','assists']

The second line creates a list of strings corresponding to column names to keep.

matchid participantid champion win kills deaths assists
2874721684 5 Ashe 1 2 7 8
2874726164 9 Anivia 0 3 7 1
2874765289 1 Teemo 1 9 4 10
2874870994 7 Teemo 0 8 7 1
2874870994 7 Ashe 0 3 10 5
2874765810 2 Teemo 0 3 8 3

STEP 3: Return only those rows and columns

my_subset = df.loc[rows_to_keep, cols_to_keep]

Then, .loc is used to return only those rows and columns.

matchid participantid champion win kills deaths assists
2874721684 5 Ashe 1 2 7 8
2874726164 9 Anivia 0 3 7 1
2874765289 1 Teemo 1 9 4 10
2874870994 7 Teemo 0 8 7 1
2874870994 7 Ashe 0 3 10 5
2874765810 2 Teemo 0 3 8 3
kills deaths assists
9 4 10
8 7 1
3 8 3

This last step could be done in one line, like so:

my_subset = df.loc[df['champion']=='Teemo', ['kills', 'deaths', 'assists']]

Alternatively, this could be done with .iloc by specifying the indices of the desired columns and rows instead. Remembering that Python starts counting at 0, we specify:

rows_to_keep = [2,3,5]
cols_to_keep = [4,5,6]
my_subset = df.iloc[rows_to_keep, cols_to_keep]

Or, more succinctly:

my_subset = df.iloc[[2,3,5], [4,5,6]]

Pandas is flexible in how it lets us filter our data. If we are only filtering by rows or columns (not both), we can omit the .loc part. Here are two examples of equivalent commands:

  • df.loc[:, 'my_col'] and df['my_col'] — returns the entire column 'my_col' from df
  • df.loc[df.champion=='Zed', :] and df[df.champion=='Zed'] — returns rows where the champion is Zed

Technical note on subsetting data

The command my_subset = df.loc[df.champion=='Zed', :] creates a new DataFrame my_subset, but the data stored in that DataFrame is actually not distinct from the data in df. Instead, my_subset is merely a reference to some portion of the data located in df. If you modify my_subset, Pandas will warn you: “A value is trying to be set on a copy of a slice from a DataFrame”, because the values in my_subset are actually the same values as in df (they occupy the same location in your computer’s memory).

The appropriate way to deal with this warning depends on your goal. If you want to modify a subset of the values in df, you can do so directly by using:

df.loc[df.champion=='Zed', :] = my_new_values

If you want to work with a new table that’s not linked to df, then create a copy like so:

my_subset = df.loc[df.champion=='Zed', :].copy()

For more on this, see the Pandas documentation for returning a view versus a copy.

The Pandas documentation for indexing and selecting data provides more details.

Data

The accompanying dataset for the first two chapters is match_player_basic.csv, a modified subset of match_player.csv to get you started. It stores end-of-game statistics on each player for the first 200 matches in match_player.csv. This is the same dataset that we have been using so far in this chapter.

Five rows of the dataset are shown below.

matchId teamId participantId champion win gameDuration kills deaths assists wardsPlaced wardsKilled goldEarned
3321609215 100 1 Garen 0 18.76 0 3 1 5 1 4590
3321609215 100 2 Morgana 0 18.76 0 4 3 11 4 4212
3321609215 200 8 Sett 1 18.76 3 2 7 8 3 5308
3321609215 200 9 Kaisa 1 18.76 4 2 7 5 1 6385
3321609215 200 10 Yasuo 1 18.76 9 1 4 5 0 9489

In this table, the observations are players-in-games because each row corresponds to a particular player in a particular game. The variables are different properties of those players in games, like which champion they were playing, whether or not they won, etc.

Tasks

Starting with the provided sample code, load the dataset in match_player_basic.csv to a Dataframe called df. Use the commands given earlier in this document to get an idea of what data is stored in the table, then try the following exercises.

  1. Play rates: Create a table of the top 10 most played champions using the following commands:

    1. .value_counts() — to count the number of times each champion was played
    2. .sort_values(ascending = False) — to sort the result; note that you don’t need to specify the by= argument because .value_counts() returns a Series (one column), not a DataFrame
    3. .head(10) — to get the top 10 rows

    You will need to select the column corresponding to champions played in order to apply the methods.

  2. Kill-death differentials: Add a new column to df called 'kd_diff' which contains the kill-death differential for a player in a game, defined as kills – deaths, then:

    1. Subset df using .loc to return only rows corresponding to your favorite champion (e.g. Cassiopeia), then use .describe() to get summary statistics on their kill differential.
    2. Use .describe() to find the maximum kill differential in the dataset, then use .loc to subset the DataFrame to only those games in which the maximum kill differential occurred. Take a look at the data for these games. Can you get a sense of what happened?
Chapter 2: Groupby-Aggregate