The source for this notebook is the fantastic Kaggle kernel by Kevin Markham: https://www.kaggle.com/python10pm/pandas-100-tricks

I just converted it into a blog post to have it rendered faster.

Welcome to this Kernel

  • This kernel is a compilation of tricks of pandas published by Kevin Markham weekly.

You can find the the original 100 pandas tricks (created by Kevin Markham from Data School) on this page:

https://www.dataschool.io/python-pandas-tips-and-tricks/

* If you want to learn **sklearn** check this kernel with tricks and tips:

https://www.kaggle.com/python10pm/sklearn-24-best-tips-and-tricks

Importing libraries and setting some helper functions

Go back to the Table of Contents

# basic libraries
import os
import numpy as np
import pandas as pd

# this will allow us to print all the files as we generate more in the kernel
def print_files():
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))

# check Trick 91 for an example
def generate_sample_data(): # creates a fake df for testing
    number_or_rows = 20
    num_cols = 7
    cols = list("ABCDEFG")
    df = pd.DataFrame(np.random.randint(1, 20, size = (number_or_rows, num_cols)), columns=cols)
    df.index = pd.util.testing.makeIntIndex(number_or_rows)
    return df

# check Trick 91 for an example
def generate_sample_data_datetime(): # creates a fake df for testing
    number_or_rows = 365*24
    num_cols = 2
    cols = ["sales", "customers"]
    df = pd.DataFrame(np.random.randint(1, 20, size = (number_or_rows, num_cols)), columns=cols)
    df.index = pd.util.testing.makeDateIndex(number_or_rows, freq="H")
    return df

# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print_files()
/kaggle/input/imdb-data/IMDB-Movie-Data.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv
/kaggle/input/us-accidents/US_Accidents_Dec19.csv
/kaggle/input/drinks-by-country/drinksbycountry.csv
/kaggle/input/imdbmovies/imdb.csv
/kaggle/input/trick58data/trick58data.csv

Trick 100: Loading sample of big data

Go back to the Table of Contents

df = pd.read_csv("/kaggle/input/us-accidents/US_Accidents_Dec19.csv")
print("The shape of the df is {}".format(df.shape))

del df

df = pd.read_csv("/kaggle/input/us-accidents/US_Accidents_Dec19.csv", skiprows = lambda x: x>0 and np.random.rand() > 0.01)
print("The shape of the df is {}. It has been reduced 10 times!".format(df.shape))


'''
How it works:
skiprows accepts a function that is evaluated against the integer index.
x > 0 makes sure that the headers is not skipped
np.random.rand() > 0.01 returns True 99% of the tie, thus skipping 99% of the time.
Note that we are using skiprows
'''
The shape of the df is (2974335, 49)
The shape of the df is (29501, 49). It has been reduced 10 times!
'\nHow it works:\nskiprows accepts a function that is evaluated against the integer index.\nx > 0 makes sure that the headers is not skipped\nnp.random.rand() > 0.01 returns True 99% of the tie, thus skipping 99% of the time.\nNote that we are using skiprows\n'

Trick 99: How to avoid Unnamed: 0 columns

Go back to the Table of Contents

d = {\
"zip_code": [12345, 56789, 101112, 131415],
"factory": [100, 400, 500, 600],
"warehouse": [200, 300, 400, 500],
"retail": [1, 2, 3, 4]
}

df = pd.DataFrame(d)
df

# save to csv
df.to_csv("trick99data.csv")

df = pd.read_csv("trick99data.csv")
df
# To avoid Unnamed: 0

df = pd.read_csv("trick99data.csv", index_col=0)
# or when saving df = pd.read_csv("trick99data.csv", index = False)
df
zip_code factory warehouse retail
0 12345 100 200 1
1 56789 400 300 2
2 101112 500 400 3
3 131415 600 500 4
Unnamed: 0 zip_code factory warehouse retail
0 0 12345 100 200 1
1 1 56789 400 300 2
2 2 101112 500 400 3
3 3 131415 600 500 4
zip_code factory warehouse retail
0 12345 100 200 1
1 56789 400 300 2
2 101112 500 400 3
3 131415 600 500 4

Trick 98: Convert a wide DF into a long one

Go back to the Table of Contents

d = {\
"zip_code": [12345, 56789, 101112, 131415],
"factory": [100, 400, 500, 600],
"warehouse": [200, 300, 400, 500],
"retail": [1, 2, 3, 4]
}

df = pd.DataFrame(d)
df

# we have to reassing

# location_type is generated automatically from the columns left after specifying id_vars (you can pass a list also)
df = df.melt(id_vars = "zip_code", var_name = "location_type", value_name = "distance")
df
zip_code factory warehouse retail
0 12345 100 200 1
1 56789 400 300 2
2 101112 500 400 3
3 131415 600 500 4
zip_code location_type distance
0 12345 factory 100
1 56789 factory 400
2 101112 factory 500
3 131415 factory 600
4 12345 warehouse 200
5 56789 warehouse 300
6 101112 warehouse 400
7 131415 warehouse 500
8 12345 retail 1
9 56789 retail 2
10 101112 retail 3
11 131415 retail 4

Trick 97: Convert year and day of year into a single datetime column

Go back to the Table of Contents

# Trick 97
# Convert
d = {\
"year": [2019, 2019, 2020],
"day_of_year": [350, 365, 1]
}

df = pd.DataFrame(d)
df

# Step 1: create a combined column
df["combined"] = df["year"]*1000 + df["day_of_year"]
df

# Step 2: convert to datetime
df["date"] = pd.to_datetime(df["combined"], format = "%Y%j")
df
year day_of_year
0 2019 350
1 2019 365
2 2020 1
year day_of_year combined
0 2019 350 2019350
1 2019 365 2019365
2 2020 1 2020001
year day_of_year combined date
0 2019 350 2019350 2019-12-16
1 2019 365 2019365 2019-12-31
2 2020 1 2020001 2020-01-01

Trick 96: Interactive plots out of the box in pandas

Go back to the Table of Contents

print(pd.__version__)
# Pandas version 0.25 or higher requiered and you need hvplot

import pandas as pd
df = pd.read_csv("../input/drinks-by-country/drinksbycountry.csv")
df

# this one is not interactve
df.plot(kind = "scatter", x = "spirit_servings", y = "wine_servings")

# run !pip install hvplot
#pd.options.plotting.backend = "hvplot"
#df.plot(kind = "scatter", x = "spirit_servings", y = "wine_servings", c = "continent")
0.25.3
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
... ... ... ... ... ... ...
188 Venezuela 333 100 3 7.7 South America
189 Vietnam 111 2 1 2.0 Asia
190 Yemen 6 0 0 0.1 Asia
191 Zambia 32 19 4 2.5 Africa
192 Zimbabwe 64 18 4 4.7 Africa

193 rows × 6 columns

<matplotlib.axes._subplots.AxesSubplot at 0x7f325e201630>

Trick 95: Count the missing values

Go back to the Table of Contents

d = {\
"col1": [2019, 2019, 2020],
"col2": [350, 365, 1],
"col3": [np.nan, 365, None]
}

df = pd.DataFrame(d)
df

# Solution 1
df.isnull().sum().sum()

# Solution 2
df.isna().sum()

# Solution 3
df.isna().any()

# Solution 4:
df.isna().any(axis = None)
col1 col2 col3
0 2019 350 NaN
1 2019 365 365.0
2 2020 1 NaN
2
col1    0
col2    0
col3    2
dtype: int64
col1    False
col2    False
col3     True
dtype: bool
True

Trick 94: Save memory by fixing your date

Go back to the Table of Contents

df = pd.read_csv("../input/titanic/train.csv", usecols = ["Pclass", "Sex", "Parch", "Cabin"])
df

# let's see how much our df occupies in memory
df.memory_usage(deep = True)

# convert to smaller datatypes
df = df.astype({"Pclass":"int8",
                "Sex":"category", 
                "Parch": "Sparse[int]", # most values are 0
                "Cabin":"Sparse[str]"}) # most values are NaN

df.memory_usage(deep = True)
Pclass Sex Parch Cabin
0 3 male 0 NaN
1 1 female 0 C85
2 3 female 0 NaN
3 1 female 0 C123
4 3 male 0 NaN
... ... ... ... ...
886 2 male 0 NaN
887 1 female 0 B42
888 3 female 2 NaN
889 1 male 0 C148
890 3 male 0 NaN

891 rows × 4 columns

Index       128
Pclass     7128
Sex       54979
Parch      7128
Cabin     34376
dtype: int64
Index       128
Pclass      891
Sex        1095
Parch      2556
Cabin     29696
dtype: int64

Trick 93: Combine the small categories into a single category named "Others" (using frequencies)

Go back to the Table of Contents

d = {"genre": ["A", "A", "A", "A", "A", "B", "B", "C", "D", "E", "F"]}
df = pd.DataFrame(d)
df

# Step 1: count the frequencies
frequencies = df["genre"].value_counts(normalize = True)
frequencies

# Step 2: establish your threshold and filter the smaller categories
threshold = 0.1
small_categories = frequencies[frequencies < threshold].index
small_categories

# Step 3: replace the values
df["genre"] = df["genre"].replace(small_categories, "Other")
df["genre"].value_counts(normalize = True)
genre
0 A
1 A
2 A
3 A
4 A
5 B
6 B
7 C
8 D
9 E
10 F
A    0.454545
B    0.181818
D    0.090909
F    0.090909
C    0.090909
E    0.090909
Name: genre, dtype: float64
Index(['D', 'F', 'C', 'E'], dtype='object')
A        0.454545
Other    0.363636
B        0.181818
Name: genre, dtype: float64

Trick 92: Clean Object column with mixed data using regex

Go back to the Table of Contents

d = {"customer": ["A", "B", "C", "D"], "sales":[1100, 950.75, "$400", "$1250.35"]}
df = pd.DataFrame(d)
df

# Step 1: check the data types
df["sales"].apply(type)

# Step 2: use regex
df["sales"] = df["sales"].replace("[$,]", "", regex = True).astype("float")
df
df["sales"].apply(type)
customer sales
0 A 1100
1 B 950.75
2 C $400
3 D $1250.35
0      <class 'int'>
1    <class 'float'>
2      <class 'str'>
3      <class 'str'>
Name: sales, dtype: object
customer sales
0 A 1100.00
1 B 950.75
2 C 400.00
3 D 1250.35
0    <class 'float'>
1    <class 'float'>
2    <class 'float'>
3    <class 'float'>
Name: sales, dtype: object

Trick 91: Creating a time series dataset for testing

Go back to the Table of Contents

# Solution 1
number_or_rows = 365*24 # hours in a year
pd.util.testing.makeTimeDataFrame(number_or_rows, freq="H")

# Solution 2
num_cols = 2
cols = ["sales", "customers"]
df = pd.DataFrame(np.random.randint(1, 20, size = (number_or_rows, num_cols)), columns=cols)
df.index = pd.util.testing.makeDateIndex(number_or_rows, freq="H")
df
A B C D
2000-01-01 00:00:00 -0.585670 -1.137078 -0.631975 -1.054320
2000-01-01 01:00:00 -0.237215 -1.816933 0.202014 0.660456
2000-01-01 02:00:00 -1.957172 -0.956697 -1.197462 -0.739047
2000-01-01 03:00:00 -1.241711 0.945880 -0.330081 -2.054438
2000-01-01 04:00:00 -0.455008 -0.445184 0.480131 -0.990698
... ... ... ... ...
2000-12-30 19:00:00 -1.305448 0.357798 1.445436 -0.914071
2000-12-30 20:00:00 -0.265488 0.216962 -0.792367 -0.547871
2000-12-30 21:00:00 1.881420 0.226771 2.184607 -0.091759
2000-12-30 22:00:00 -1.913806 -1.952609 -1.957035 -0.238062
2000-12-30 23:00:00 -0.835225 -1.202716 1.199405 -0.116859

8760 rows × 4 columns

sales customers
2000-01-01 00:00:00 13 11
2000-01-01 01:00:00 5 14
2000-01-01 02:00:00 3 8
2000-01-01 03:00:00 8 2
2000-01-01 04:00:00 16 1
... ... ...
2000-12-30 19:00:00 12 10
2000-12-30 20:00:00 7 9
2000-12-30 21:00:00 13 3
2000-12-30 22:00:00 7 1
2000-12-30 23:00:00 1 14

8760 rows × 2 columns

Trick 90: Moving columns to a specific location

Go back to the Table of Contents

d = {"A":[15, 20], "B":[20, 25], "C":[30 ,40], "D":[50, 60]}
df = pd.DataFrame(d)
df

# Using insert
df.insert(3, "C2", df["C"]*2)
df

# Other solution
df["C3"] = df["C"]*3 # create a new columns, it will be at the end
columns = df.columns.to_list() # create a list with all columns
location = 4 # specify the location where you want your new column
columns = columns[:location] + ["C3"] + columns[location:-1] # reaarange the list
df = df[columns] # create te dataframe in with the order of columns you like
df
A B C D
0 15 20 30 50
1 20 25 40 60
A B C C2 D
0 15 20 30 60 50
1 20 25 40 80 60
A B C C2 C3 D
0 15 20 30 60 90 50
1 20 25 40 80 120 60

Trick 89: Split names into first and last name

Go back to the Table of Contents

df = pd.Series(["Geordi La Forge", "Deanna Troi", "Data"]).to_frame()
df.rename({0:"names"}, inplace = True, axis = 1)
df
#                              split on first space  
df["first_name"] = df["names"].str.split(n = 1).str[0]
df["last_name"] = df["names"].str.split(n = 1).str[1]
df
names
0 Geordi La Forge
1 Deanna Troi
2 Data
names first_name last_name
0 Geordi La Forge Geordi La Forge
1 Deanna Troi Deanna Troi
2 Data Data NaN

Trick 88: Rearrange columns in a df

Go back to the Table of Contents

df = generate_sample_data()
df.head()

# Solution 1
df[["A", "C", "D", "F", "E", "G", "B"]].head() # doesn't modify in place

# Solution 2
cols_to_move = ["A", "G", "B"]

new_order = cols_to_move + [c for c in df.columns if c not in cols_to_move] # generate your new order
df[new_order].head()

# Solutin 3: using index
cols = df.columns[[0, 5 , 3, 4, 2, 1, 6]] # df.columns returns a series with index, we use the list to iorder the index as we like --> this way we order the columns
df[cols].head()
A B C D E F G
0 18 18 2 4 17 19 18
1 4 3 2 7 19 9 19
2 10 10 4 7 12 1 3
3 15 6 19 5 14 17 16
4 13 5 2 11 15 15 9
A C D F E G B
0 18 2 4 19 17 18 18
1 4 2 7 9 19 19 3
2 10 4 7 1 12 3 10
3 15 19 5 17 14 16 6
4 13 2 11 15 15 9 5
A G B C D E F
0 18 18 18 2 4 17 19
1 4 19 3 2 7 19 9
2 10 3 10 4 7 12 1
3 15 16 6 19 5 14 17
4 13 9 5 2 11 15 15
A F D E C B G
0 18 19 4 17 2 18 18
1 4 9 7 19 2 3 19
2 10 1 7 12 4 10 3
3 15 17 5 14 19 6 16
4 13 15 11 15 2 5 9

Trick 87: Aggregate you datetime by by and filter weekends

Go back to the Table of Contents

df = generate_sample_data_datetime()
df.shape
df.head()

# Step 1: resample by D. Basically agregate by day and use to_frame() to convert it to frame
daily_sales = df.resample("D")["sales"].sum().to_frame()
daily_sales

# Step 2: filter weekends
weekends_sales = daily_sales[daily_sales.index.dayofweek.isin([5, 6])]
weekends_sales

'''
dayofweek day
0         Monday
1         Tuesday
2         Wednesday
3         Thursday
4         Friday
5         Saturday
6         Sunday
'''
(8760, 2)
sales customers
2000-01-01 00:00:00 13 10
2000-01-01 01:00:00 9 10
2000-01-01 02:00:00 18 16
2000-01-01 03:00:00 8 16
2000-01-01 04:00:00 11 8
sales
2000-01-01 257
2000-01-02 261
2000-01-03 265
2000-01-04 270
2000-01-05 267
... ...
2000-12-26 239
2000-12-27 229
2000-12-28 231
2000-12-29 220
2000-12-30 217

365 rows × 1 columns

sales
2000-01-01 257
2000-01-02 261
2000-01-08 224
2000-01-09 227
2000-01-15 249
... ...
2000-12-16 207
2000-12-17 232
2000-12-23 285
2000-12-24 257
2000-12-30 217

105 rows × 1 columns

'\ndayofweek day\n0         Monday\n1         Tuesday\n2         Wednesday\n3         Thursday\n4         Friday\n5         Saturday\n6         Sunday\n'

Trick 86: Named aggregations - avoids multiindex

Go back to the Table of Contents

print_files()

df = pd.read_csv("/kaggle/input/titanic/train.csv")
df.head()

# Problem 1
print("The Problem relies on that we don't know the column name")
df.groupby("Pclass")["Age"].agg(["mean", "max"])

# Problem 2
print("The Problem relies on that we have multiindex")
df.groupby("Pclass").agg({"Age":["mean", "max"]})

# Solution new in pandas 0.25 and higher
print("Now we have solved the previous problems by specifyig the column final name we want.")
print("BUT IT ONLY WORKS WITH A COLUMN. TO THIS KIND OF OPERATIONS ON MULTIPLE COLUMNS CHECK THE NEXT CELL")
df.groupby("Pclass")["Age"].agg(age_mean = "mean", age_max = "max")
/kaggle/input/imdb-data/IMDB-Movie-Data.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv
/kaggle/input/us-accidents/US_Accidents_Dec19.csv
/kaggle/input/drinks-by-country/drinksbycountry.csv
/kaggle/input/imdbmovies/imdb.csv
/kaggle/input/trick58data/trick58data.csv
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
The Problem relies on that we don't know the column name
mean max
Pclass
1 38.233441 80.0
2 29.877630 70.0
3 25.140620 74.0
The Problem relies on that we have multiindex
Age
mean max
Pclass
1 38.233441 80.0
2 29.877630 70.0
3 25.140620 74.0
Now we have solved the previous problems by specifyig the column final name we want.
BUT IT ONLY WORKS WITH A COLUMN. TO THIS KIND OF OPERATIONS ON MULTIPLE COLUMNS CHECK THE NEXT CELL
age_mean age_max
Pclass
1 38.233441 80.0
2 29.877630 70.0
3 25.140620 74.0

Trick 86bis: Named aggregations on multiple columns- avoids multiindex

Go back to the Table of Contents

def my_agg(x):
    names = {
        'age_mean': x['Age'].mean(),
        'age_max':  x['Age'].max(),
        'fare_mean': x['Fare'].mean(),
        'fare_max': x['Fare'].max()
    } # define you custom colum names and operations

    return pd.Series(names, index=[ key for key in names.keys()]) # all the columns you create in the previous dictionary will be in this list comprehension

df.groupby('Pclass').apply(my_agg)

# reference
# https://stackoverflow.com/questions/44635626/rename-result-columns-from-pandas-aggregation-futurewarning-using-a-dict-with
age_mean age_max fare_mean fare_max
Pclass
1 38.233441 80.0 84.154687 512.3292
2 29.877630 70.0 20.662183 73.5000
3 25.140620 74.0 13.675550 69.5500

Trick 85: Convert one type of values to others

Go back to the Table of Contents

# Do some fast feature eng on the DF
d = {"gender":["male", "female", "male"], "color":["red", "green", "blue"], "age":[25, 30, 15]}
df = pd.DataFrame(d)
df

# Solution
map_dict = {"male":"M", "female":"F"}
df["gender_mapped"] = df["gender"].map(map_dict) # using dictionaries to map values
df["color_factorized"] = df["color"].factorize()[0] # using factorize: returns a tuple of arrays (array([0, 1, 2]), Index(['red', 'green', 'blue'], dtype='object')) that's why we select [0]
df["age_compared_boolean"] = df["age"] < 18 # return a True False boolean value

df
gender color age
0 male red 25
1 female green 30
2 male blue 15
gender color age gender_mapped color_factorized age_compared_boolean
0 male red 25 M 0 False
1 female green 30 F 1 False
2 male blue 15 M 2 True

Trick 84: Show fewer rows in a df

Go back to the Table of Contents

print("This df occupies way too much space")
df = generate_sample_data()
df

print("using set_option to save some screen space")
pd.set_option("display.max_rows", 6)
df

print("use reset_option all to reset to default")
pd.reset_option("all")
df
This df occupies way too much space
A B C D E F G
0 19 16 10 18 4 17 16
1 6 2 9 12 13 5 16
2 16 16 12 18 9 12 14
3 13 19 2 11 11 9 16
4 11 17 12 10 16 11 12
5 13 2 19 9 9 9 7
6 18 16 6 12 19 16 8
7 10 15 16 19 2 12 7
8 19 3 11 1 10 12 16
9 1 7 12 19 8 15 1
10 3 19 1 16 12 2 8
11 7 1 19 3 3 3 4
12 11 18 18 5 6 11 4
13 4 3 18 11 9 10 15
14 1 18 9 6 19 2 4
15 15 11 19 17 5 4 4
16 7 14 6 15 18 17 17
17 18 13 17 8 7 6 17
18 1 9 13 14 12 17 11
19 18 7 11 14 11 7 4
using set_option to save some screen space
A B C D E F G
0 19 16 10 18 4 17 16
1 6 2 9 12 13 5 16
2 16 16 12 18 9 12 14
... ... ... ... ... ... ... ...
17 18 13 17 8 7 6 17
18 1 9 13 14 12 17 11
19 18 7 11 14 11 7 4

20 rows × 7 columns

use reset_option all to reset to default

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

/opt/conda/lib/python3.6/site-packages/pandas/_config/config.py:622: FutureWarning: 
: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

  warnings.warn(d.msg, FutureWarning)
A B C D E F G
0 19 16 10 18 4 17 16
1 6 2 9 12 13 5 16
2 16 16 12 18 9 12 14
3 13 19 2 11 11 9 16
4 11 17 12 10 16 11 12
5 13 2 19 9 9 9 7
6 18 16 6 12 19 16 8
7 10 15 16 19 2 12 7
8 19 3 11 1 10 12 16
9 1 7 12 19 8 15 1
10 3 19 1 16 12 2 8
11 7 1 19 3 3 3 4
12 11 18 18 5 6 11 4
13 4 3 18 11 9 10 15
14 1 18 9 6 19 2 4
15 15 11 19 17 5 4 4
16 7 14 6 15 18 17 17
17 18 13 17 8 7 6 17
18 1 9 13 14 12 17 11
19 18 7 11 14 11 7 4

Trick 83: Correct the data types while importing the df

Go back to the Table of Contents

print_files()
df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv")

# Step 1: Let's the datetype of the columns
col_types = df.dtypes.to_frame()
col_types.rename({0:"type"}, inplace = True, axis = 1)
col_types
col_types.to_csv("trick83data.csv")

# Step 2: Let's import the previous data and convert it to a dictionary
col_dict = pd.read_csv("trick83data.csv", index_col = 0)["type"].to_dict()

# Step 3: Edit the dictionary with the correct data types
print("Original dictionary")
col_dict
col_dict["country"] = "category"
col_dict["continent"] = "category"
print("Modified dictionary")
col_dict

# Step 4: Use the dictionary to import the data
df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv", dtype=col_dict)
df.dtypes

# Note: please note that you can use the dict from step1 and paste in like this
df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv", \
dtype=
{'country': 'category',
 'beer_servings': 'int64',
 'spirit_servings': 'int64',
 'wine_servings': 'int64',
 'total_litres_of_pure_alcohol': 'float64',
 'continent': 'category'})
# However, if you have many colums, this can be confusing
df.dtypes
/kaggle/input/imdb-data/IMDB-Movie-Data.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv
/kaggle/input/us-accidents/US_Accidents_Dec19.csv
/kaggle/input/drinks-by-country/drinksbycountry.csv
/kaggle/input/imdbmovies/imdb.csv
/kaggle/input/trick58data/trick58data.csv
type
country object
beer_servings int64
spirit_servings int64
wine_servings int64
total_litres_of_pure_alcohol float64
continent object
Original dictionary
{'country': 'object',
 'beer_servings': 'int64',
 'spirit_servings': 'int64',
 'wine_servings': 'int64',
 'total_litres_of_pure_alcohol': 'float64',
 'continent': 'object'}
Modified dictionary
{'country': 'category',
 'beer_servings': 'int64',
 'spirit_servings': 'int64',
 'wine_servings': 'int64',
 'total_litres_of_pure_alcohol': 'float64',
 'continent': 'category'}
country                         category
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object
country                         category
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object

Trick 82: Select data by label and position (chained iloc and loc)

Go back to the Table of Contents

df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv", index_col="country")
df.iloc[15:20, :].loc[:, "beer_servings":"wine_servings"]
# iloc is used to filter the rows and loc the columns
beer_servings spirit_servings wine_servings
country
Belarus 142 373 42
Belgium 295 84 212
Belize 263 114 8
Benin 34 4 13
Bhutan 23 0 0

Trick 81: Use apply(type) to see if you have mixed data types

Go back to the Table of Contents

d = {"customer":["A", "B", "C", "D", "E"], "sales":[100, "100", 50, 550.20, "375.25"]}
df = pd.DataFrame(d)
# everything seems  but this operation crashes df["sales"].sum(). We have mixed data types
df.dtypes
df["sales"].apply(type) # Wow we can see that we have int, str, floats inn one column
df["sales"].apply(type).value_counts() # See the number of each value

df["sales"] = df["sales"].astype(float) # convert the data to float
df["sales"].sum()
df["sales"].apply(type).value_counts()
customer    object
sales       object
dtype: object
0      <class 'int'>
1      <class 'str'>
2      <class 'int'>
3    <class 'float'>
4      <class 'str'>
Name: sales, dtype: object
<class 'int'>      2
<class 'str'>      2
<class 'float'>    1
Name: sales, dtype: int64
1175.45
<class 'float'>    5
Name: sales, dtype: int64

Trick 80: Select multiple slices of columns from a df

Go back to the Table of Contents

df = generate_sample_data().T
cols_str = list(map(str, list(df.columns))) # so that we can do df["0"] as string for the example
df.columns = cols_str

# Using pandas concatenation
# if you are ever confused about axis = 1 or axis = 0, just put axis = "columns" or axis = "rows"
pd.concat([df.loc[:, "0":"2"], df.loc[:, "6":"10"], df.loc[:, "16":"19"]], axis = "columns") # ------------------> here we are selecting columns converted to strings

# Using lists
# please ntoe that df.columns is a series with index, so we are using index to filter # -------------------------> here we are selecting the index of columns
df[list(df.columns[0:3]) + list(df.columns[6:11]) + list(df.columns[16:20])]

# Using numpy
df.iloc[:, np.r_[0:3, 6:11, 16:20]] # probably the most beautiful solution
0 1 2 6 7 8 9 10 16 17 18 19
A 13 3 7 10 10 3 10 18 15 13 17 10
B 2 10 3 14 15 15 8 9 6 18 17 4
C 9 12 3 17 16 4 18 9 3 5 10 1
D 7 17 6 16 4 8 12 13 10 1 7 1
E 16 12 13 19 4 5 14 16 13 6 6 16
F 11 1 1 2 3 13 9 13 17 2 15 3
G 16 13 3 4 14 9 13 1 6 6 7 15
0 1 2 6 7 8 9 10 16 17 18 19
A 13 3 7 10 10 3 10 18 15 13 17 10
B 2 10 3 14 15 15 8 9 6 18 17 4
C 9 12 3 17 16 4 18 9 3 5 10 1
D 7 17 6 16 4 8 12 13 10 1 7 1
E 16 12 13 19 4 5 14 16 13 6 6 16
F 11 1 1 2 3 13 9 13 17 2 15 3
G 16 13 3 4 14 9 13 1 6 6 7 15
0 1 2 6 7 8 9 10 16 17 18 19
A 13 3 7 10 10 3 10 18 15 13 17 10
B 2 10 3 14 15 15 8 9 6 18 17 4
C 9 12 3 17 16 4 18 9 3 5 10 1
D 7 17 6 16 4 8 12 13 10 1 7 1
E 16 12 13 19 4 5 14 16 13 6 6 16
F 11 1 1 2 3 13 9 13 17 2 15 3
G 16 13 3 4 14 9 13 1 6 6 7 15

Trick 79: Count of rows that match a condition

Go back to the Table of Contents

df = generate_sample_data()
df.head()
df.shape

# absolute values
(df["A"] < 5).sum()
print("In the columns A we have {} of rows that are below 5".format((df["A"] < 5).sum()))

# percentage
(df["A"] < 5).mean()
print("In the columns A the values that are below 5 represent {}%".format((df["A"] < 5).mean()))
A B C D E F G
0 16 5 8 6 12 10 19
1 8 2 9 12 7 14 11
2 10 6 6 5 8 2 18
3 16 7 3 17 1 9 19
4 17 7 2 18 7 9 16
(20, 7)
3
In the columns A we have 3 of rows that are below 5
0.15
In the columns A the values that are below 5 represent 0.15%

Trick 78: Keep track of where your data is coming when you are using multiple sources

Go back to the Table of Contents

# let's generate some fake data
df1 = generate_sample_data()
df2 = generate_sample_data()
df3 = generate_sample_data()
# df1.head()
# df2.head()
# df3.head()
df1.to_csv("trick78data1.csv")
df2.to_csv("trick78data2.csv")
df3.to_csv("trick78data3.csv")

# Step 1 generate list with the file name
lf = []
for _,_, files in os.walk("/kaggle/working/"):
    for f in files:
        if "trick78" in f:
            lf.append(f)
            
lf

# You can use this on your local machine
#from glob import glob
#files = glob("trick78.csv")

# Step 2: assing create a new column named filename and the value is file
# Other than this we are just concatinating the different dataframes
df = pd.concat((pd.read_csv(file).assign(filename = file) for file in lf), ignore_index = True)
df.sample(10)
['trick78data2.csv', 'trick78data1.csv', 'trick78data3.csv']
Unnamed: 0 A B C D E F G filename
30 10 12 12 11 16 2 4 5 trick78data1.csv
50 10 15 6 14 4 4 5 5 trick78data3.csv
19 19 19 13 13 13 10 14 19 trick78data2.csv
42 2 10 4 16 14 16 10 6 trick78data3.csv
59 19 5 10 17 6 18 9 5 trick78data3.csv
57 17 6 17 3 9 19 14 11 trick78data3.csv
25 5 18 2 4 17 19 10 18 trick78data1.csv
39 19 5 11 1 2 2 6 16 trick78data1.csv
18 18 12 5 3 14 7 3 4 trick78data2.csv
26 6 16 6 2 14 4 5 7 trick78data1.csv

Trick 77: Combine the small categories into a single category named "Others" (using where)

Go back to the Table of Contents

d = {"genre": ["A", "A", "A", "A", "A", "B", "B", "C", "D", "E", "F"]}
df = pd.DataFrame(d)
df["genre"].value_counts()

# Step 1: count the frequencies
top_four = df["genre"].value_counts().nlargest(4).index
top_four

# Step 2: update the df
df_updated = df.where(df["genre"].isin(top_four), other = "Other")
df_updated["genre"].value_counts()
A    5
B    2
D    1
F    1
C    1
E    1
Name: genre, dtype: int64
Index(['A', 'B', 'D', 'F'], dtype='object')
A        5
Other    2
B        2
D        1
F        1
Name: genre, dtype: int64

Trick 76: Filter in pandas only the largest categories.

Go back to the Table of Contents

df = pd.read_csv("../input/imdb-data/IMDB-Movie-Data.csv")
df.columns = map(str.lower, list(df.columns)) # convert headers to lower type
df.shape
# select top 3 genre
top_genre = df["genre"].value_counts().to_frame()[0:3].index

# now let's filter the df with the top genre
df_top = df[df["genre"].isin(top_genre)]
df_top
df_top.shape
df_top["genre"].unique()
(1000, 12)
rank title genre description director actors year runtime (minutes) rating votes revenue (millions) metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
12 13 Rogue One Action,Adventure,Sci-Fi The Rebel Alliance makes a risky move to steal... Gareth Edwards Felicity Jones, Diego Luna, Alan Tudyk, Donnie... 2016 133 7.9 323118 532.17 65.0
21 22 Manchester by the Sea Drama A depressed uncle is asked to take care of his... Kenneth Lonergan Casey Affleck, Michelle Williams, Kyle Chandle... 2016 137 7.9 134213 47.70 96.0
24 25 Independence Day: Resurgence Action,Adventure,Sci-Fi Two decades after the first Independence Day i... Roland Emmerich Liam Hemsworth, Jeff Goldblum, Bill Pullman,Ma... 2016 120 5.3 127553 103.14 32.0
32 33 X-Men: Apocalypse Action,Adventure,Sci-Fi After the re-emergence of the world's first mu... Bryan Singer James McAvoy, Michael Fassbender, Jennifer Law... 2016 144 7.1 275510 155.33 52.0
... ... ... ... ... ... ... ... ... ... ... ... ...
948 949 After Earth Action,Adventure,Sci-Fi A crash landing leaves Kitai Raige and his fat... M. Night Shyamalan Jaden Smith, David Denman, Will Smith,Sophie O... 2013 100 4.9 166512 60.52 33.0
952 953 Sex and the City 2 Comedy,Drama,Romance While wrestling with the pressures of life, lo... Michael Patrick King Sarah Jessica Parker, Kim Cattrall, Kristin Da... 2010 146 4.3 62403 95.33 27.0
964 965 Custody Drama The lives of three women are unexpectedly chan... James Lapine Viola Davis, Hayden Panettiere, Catalina Sandi... 2016 104 6.9 280 NaN 72.0
978 979 It's Only the End of the World Drama Louis (Gaspard Ulliel), a terminally ill write... Xavier Dolan Nathalie Baye, Vincent Cassel, Marion Cotillar... 2016 97 7.0 10658 NaN 48.0
992 993 Take Me Home Tonight Comedy,Drama,Romance Four years after graduation, an awkward high s... Michael Dowse Topher Grace, Anna Faris, Dan Fogler, Teresa P... 2011 97 6.3 45419 6.92 NaN

133 rows × 12 columns

(133, 12)
array(['Action,Adventure,Sci-Fi', 'Drama', 'Comedy,Drama,Romance'],
      dtype=object)

Trick 75: Count the number of words in a pandas series

Go back to the Table of Contents

df = pd.read_csv("../input/imdb-data/IMDB-Movie-Data.csv", usecols=["Title"])
df["Words"] = df["Title"].str.count(" ") + 1
df
Title Words
0 Guardians of the Galaxy 4
1 Prometheus 1
2 Split 1
3 Sing 1
4 Suicide Squad 2
... ... ...
995 Secret in Their Eyes 4
996 Hostel: Part II 3
997 Step Up 2: The Streets 5
998 Search Party 2
999 Nine Lives 2

1000 rows × 2 columns

Trick 74: Webscraping using read_html() and match parameter

Go back to the Table of Contents

# Run this on you local machine
# url = "https://es.wikipedia.org/wiki/Twitter"
# tables = pd.read_html(url)
# len(tables)

# matching_tables = pd.read_html(url, match = "Followers")
# matching_tables[0]

Trick 73: Remove a column and store it as a separate series

Go back to the Table of Contents

df = pd.read_csv("../input/imdb-data/IMDB-Movie-Data.csv")
df.head()

meta = df.pop("Metascore").to_frame()
df.head()
meta.head()
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions)
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02
Metascore
0 76.0
1 65.0
2 62.0
3 59.0
4 40.0

Trick 72: Convert continuos variable to categorical (cut and qcut)

Go back to the Table of Contents

df = pd.read_csv("../input/imdb-data/IMDB-Movie-Data.csv")
df.head()

# Using cut you can specify the bin edges
pd.cut(df["Metascore"], bins = [0, 25, 50, 75, 99]).head()

# Using qcut you can specify the number of bins and it fill generate of aproximate equal size
pd.qcut(df["Metascore"], q = 3).head()

# cut and qcut accept label bin size
pd.qcut(df["Metascore"], q = 4, labels = ["awful", "bad", "average", "good"]).head()
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
0    (75, 99]
1    (50, 75]
2    (50, 75]
3    (50, 75]
4    (25, 50]
Name: Metascore, dtype: category
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50, 75] < (75, 99]]
0     (68.0, 100.0]
1      (51.0, 68.0]
2      (51.0, 68.0]
3      (51.0, 68.0]
4    (10.999, 51.0]
Name: Metascore, dtype: category
Categories (3, interval[float64]): [(10.999, 51.0] < (51.0, 68.0] < (68.0, 100.0]]
0       good
1    average
2    average
3        bad
4      awful
Name: Metascore, dtype: category
Categories (4, object): [awful < bad < average < good]

Trick 71: Read data from a PDF (tabula py)

Go back to the Table of Contents

# you will have to run on your local machine
#from tabula import read_pdf
# df = read_pdf("test.pdf", pages = "all")

Trick 70: Print current version of pandas and it's dependencies

Go back to the Table of Contents

print(pd.__version__)
print(pd.show_versions())
0.25.3

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.19.112+
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : C.UTF-8
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.3
numpy            : 1.17.4
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 19.3.1
setuptools       : 42.0.1.post20191125
Cython           : 0.29.14
pytest           : 5.0.1
hypothesis       : 4.50.6
sphinx           : 2.2.1
blosc            : None
feather          : 0.4.0
xlsxwriter       : 1.2.6
lxml.etree       : 4.4.1
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.9.0
pandas_datareader: 0.8.1
bs4              : 4.8.1
bottleneck       : 1.3.1
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.4.1
matplotlib       : 3.0.3
numexpr          : 2.6.9
odfpy            : None
openpyxl         : 3.0.1
pandas_gbq       : None
pyarrow          : 0.15.1
pytables         : None
s3fs             : 0.4.0
scipy            : 1.3.3
sqlalchemy       : 1.3.11
tables           : 3.5.1
xarray           : 0.14.1
xlrd             : 1.2.0
xlwt             : 1.3.0
xlsxwriter       : 1.2.6
None

Trick 69: Check if 2 series are "similar"

Go back to the Table of Contents

d = {"A":[1, 2, 3, 4,], "B":[1.0, 2.0, 3.0, 4.0], "C":[1.00000, 2.00000, 3.00000, 4.000003], "D":[1.0, 2.0, 3.0, 4.0], "E":[4.0, 2.0, 3.0, 1.0]}
df = pd.DataFrame(d)
df

df["A"].equals(df["B"]) # they requiere identical datatypes
df["B"].equals(df["C"])
df["B"].equals(df["D"])
df["B"].equals(df["E"]) # and the same order

print(pd.testing.assert_series_equal(df["A"], df["B"], check_names=False, check_dtype=False)) # assertion passes
A B C D E
0 1 1.0 1.000000 1.0 4.0
1 2 2.0 2.000000 2.0 2.0
2 3 3.0 3.000000 3.0 3.0
3 4 4.0 4.000003 4.0 1.0
False
False
True
False
None

Trick 68: Webscraping using read_html()

Go back to the Table of Contents

# You will have to run this on you local machine
#apple_stocks = pd.read_html("https://finance.yahoo.com/quote/AAPL/history?p=AAPL")
#pd.concat([apple_stocks[0], apple_stocks[1]])

Trick 67: Create new columns or overwrite using assing and set a title for the df

Go back to the Table of Contents

print_files()

df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv", usecols=["continent", "beer_servings"])
df.head()

(df.assign(continent = df["continent"].str.title(),
           beer_ounces = df["beer_servings"]*12,#                                     this will allow yo set a title
           beer_galons = lambda df: df["beer_ounces"]/128).query("beer_galons > 30").style.set_caption("Average beer consumption"))
/kaggle/input/imdb-data/IMDB-Movie-Data.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv
/kaggle/input/us-accidents/US_Accidents_Dec19.csv
/kaggle/input/drinks-by-country/drinksbycountry.csv
/kaggle/input/imdbmovies/imdb.csv
/kaggle/input/trick58data/trick58data.csv
beer_servings continent
0 0 Asia
1 89 Europe
2 25 Africa
3 245 Europe
4 217 Africa
Average beer consumption
beer_servings continent beer_ounces beer_galons
45 361 Europe 4332 33.8438
62 347 Africa 4164 32.5312
65 346 Europe 4152 32.4375
98 343 Europe 4116 32.1562
117 376 Africa 4512 35.25
135 343 Europe 4116 32.1562
188 333 South America 3996 31.2188

Trick 66: Create a bunch of new columns using a for loop and f-strings df[f'{col}_new']

Go back to the Table of Contents

d = {"state":["ny", "CA", "Tx", "FI"], "country":["USA", "usa", "UsA", "uSa"], "pop":[1000000, 2000000, 30000, 40000]}
df = pd.DataFrame(d)
df

int_types = ["int64"]
# creating new columns
for col in df.columns:
    ctype = str(df[col].dtype)
    if ctype in int_types:
        df[f'{col}_millions'] = df[col]/1000000
    elif ctype == "object":
        df[f'{col}_new'] = df[col].str.upper()
        # you can also drop the columns
        df.drop(col, inplace = True, axis = "columns")
        
df
state country pop
0 ny USA 1000000
1 CA usa 2000000
2 Tx UsA 30000
3 FI uSa 40000
pop state_new country_new pop_millions
0 1000000 NY USA 1.00
1 2000000 CA USA 2.00
2 30000 TX USA 0.03
3 40000 FI USA 0.04

Trick 65: Select columns using f-strings (new in pandas 3.6+)

Go back to the Table of Contents

df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv")
df

drink = "wine"

# allows us to iterate fast over columns
df[f'{drink}_servings'].to_frame()
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
... ... ... ... ... ... ...
188 Venezuela 333 100 3 7.7 South America
189 Vietnam 111 2 1 2.0 Asia
190 Yemen 6 0 0 0.1 Asia
191 Zambia 32 19 4 2.5 Africa
192 Zimbabwe 64 18 4 4.7 Africa

193 rows × 6 columns

wine_servings
0 0
1 54
2 14
3 312
4 45
... ...
188 3
189 1
190 0
191 4
192 4

193 rows × 1 columns

Trick 64: Fixing "SettingWithCopyWarning" when creating a new columns

Go back to the Table of Contents

df = pd.DataFrame({"gender":["Male", "Female", "Female", "Male"]})
df

# Getting this nasty warning
males = df[df["gender"] == "Male"]
males["abbreviation"] = "M"

# Fixing the error
print("Fixing the warning with print")
males = df[df["gender"] == "Male"].copy()
males["abbreviation"] = "M"
males
gender
0 Male
1 Female
2 Female
3 Male
Fixing the warning with print
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
gender abbreviation
0 Male M
3 Male M

Trick 63: Calculate running count with groups using cumcount() + 1

Go back to the Table of Contents

d = {"salesperson":["Nico", "Carlos", "Juan", "Nico", "Nico", "Juan", "Maria", "Carlos"], "item":["Car", "Truck", "Car", "Truck", "cAr", "Car", "Truck", "Moto"]}
df = pd.DataFrame(d)
df

# Fixing columns
df["salesperson"] = df["salesperson"].str.title()
df["item"] = df["item"].str.title()

df["count_by_person"] = df.groupby("salesperson").cumcount() + 1
df["count_by_item"] = df.groupby("item").cumcount() + 1
df["count_by_both"] = df.groupby(["salesperson","item"]).cumcount() + 1
df
salesperson item
0 Nico Car
1 Carlos Truck
2 Juan Car
3 Nico Truck
4 Nico cAr
5 Juan Car
6 Maria Truck
7 Carlos Moto
salesperson item count_by_person count_by_item count_by_both
0 Nico Car 1 1 1
1 Carlos Truck 1 1 1
2 Juan Car 1 2 1
3 Nico Truck 2 2 1
4 Nico Car 3 3 2
5 Juan Car 2 4 2
6 Maria Truck 1 3 1
7 Carlos Moto 2 1 1

Trick 62: Fixing "SettingWithCopyWarning" when changing columns using loc

Go back to the Table of Contents

df = pd.DataFrame({"gender":["Male", "Female", "Female", "Male"]})
df

# Getting this nasty warning
df[df["gender"] == "Male"]["gender"] = 1
df[df["gender"] == "Female"]["gender"] = 0


print("Fix using loc")
df.loc[df["gender"] == "Male", "gender"] = 1
df.loc[df["gender"] == "Female", "gender"] = 0
df
gender
0 Male
1 Female
2 Female
3 Male
Fix using loc
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
gender
0 1
1 0
2 0
3 1

Trick 61: Reading JSON from the web into a df

Go back to the Table of Contents

url = "https://github.com/justmarkham?tab=repositories"

# run it on your local machine
# df = pd.read_json(url)
# df = df[df["fork"] == False]
# df.shape
# df.head()

# lc = ["name", "stargazers_count", "forks_count"]
# df[lc].sort_values("stargazers_count", asending = False).head(10)

Trick 60: Creating running totals with cumsum function

Go back to the Table of Contents

d = {"salesperson":["Nico", "Carlos", "Juan", "Nico", "Nico", "Juan", "Maria", "Carlos"], "item":[10, 120, 130, 200, 300, 550, 12.3, 200]}
df = pd.DataFrame(d)
df

df["running_total"] = df["item"].cumsum()
df["running_total_by_person"] = df.groupby("salesperson")["item"].cumsum()
df

# other useful functions are cummax(), cummin(), cumprod()
salesperson item
0 Nico 10.0
1 Carlos 120.0
2 Juan 130.0
3 Nico 200.0
4 Nico 300.0
5 Juan 550.0
6 Maria 12.3
7 Carlos 200.0
salesperson item running_total running_total_by_person
0 Nico 10.0 10.0 10.0
1 Carlos 120.0 130.0 120.0
2 Juan 130.0 260.0 130.0
3 Nico 200.0 460.0 210.0
4 Nico 300.0 760.0 510.0
5 Juan 550.0 1310.0 680.0
6 Maria 12.3 1322.3 12.3
7 Carlos 200.0 1522.3 320.0

Trick 59: Combine the output of an aggregation with the original df using transform

Go back to the Table of Contents

d = {"orderid":[1, 1, 1, 2, 2, 3, 4, 5], "item":[10, 120, 130, 200, 300, 550, 12.3, 200]}
df = pd.DataFrame(d)
df

print("This is the output we want to aggregate to the original df")
df.groupby("orderid")["item"].sum().to_frame()

df["total_items_sold"] = df.groupby("orderid")["item"].transform(sum)
df
orderid item
0 1 10.0
1 1 120.0
2 1 130.0
3 2 200.0
4 2 300.0
5 3 550.0
6 4 12.3
7 5 200.0
This is the output we want to aggregate to the original df
item
orderid
1 260.0
2 500.0
3 550.0
4 12.3
5 200.0
orderid item total_items_sold
0 1 10.0 260.0
1 1 120.0 260.0
2 1 130.0 260.0
3 2 200.0 500.0
4 2 300.0 500.0
5 3 550.0 550.0
6 4 12.3 12.3
7 5 200.0 200.0

Trick 58: Use header and skiprows to get rid of bad data or empty rows while importing

Go back to the Table of Contents

# we have empty rows and bad data
df = pd.read_csv("/kaggle/input/trick58data/trick58data.csv")
df

# importing correct data
df = pd.read_csv("/kaggle/input/trick58data/trick58data.csv", header = 2, skiprows = [3,4])
df
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 c1 c2 c3
2 NaN NaN NaN
3 NaN NaN NaN
4 10 20 30
5 40 50 60
6 70 80 90
c1 c2 c3
0 10 20 30
1 40 50 60
2 70 80 90

Trick 57: Accesing the groups of a groupby object (get_group())

Go back to the Table of Contents

print_files()

df = pd.read_csv("/kaggle/input/imdb-data/IMDB-Movie-Data.csv")
df

gbdf = df.groupby("Genre")
gbdf.get_group("Horror")
/kaggle/input/imdb-data/IMDB-Movie-Data.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv
/kaggle/input/us-accidents/US_Accidents_Dec19.csv
/kaggle/input/drinks-by-country/drinksbycountry.csv
/kaggle/input/imdbmovies/imdb.csv
/kaggle/input/trick58data/trick58data.csv
Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
1 2 Prometheus Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
2 3 Split Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
3 4 Sing Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
4 5 Suicide Squad Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0
... ... ... ... ... ... ... ... ... ... ... ... ...
995 996 Secret in Their Eyes Crime,Drama,Mystery A tight-knit team of rising investigators, alo... Billy Ray Chiwetel Ejiofor, Nicole Kidman, Julia Roberts... 2015 111 6.2 27585 NaN 45.0
996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0
997 998 Step Up 2: The Streets Drama,Music,Romance Romantic sparks occur between two dance studen... Jon M. Chu Robert Hoffman, Briana Evigan, Cassie Ventura,... 2008 98 6.2 70699 58.01 50.0
998 999 Search Party Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0
999 1000 Nine Lives Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0

1000 rows × 12 columns

Rank Title Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
42 43 Don't Fuck in the Woods Horror A group of friends are going on a camping trip... Shawn Burkett Brittany Blanton, Ayse Howard, Roman Jossart,N... 2016 73 2.7 496 NaN NaN
258 259 Lights Out Horror Rebecca must unlock the terror behind her litt... David F. Sandberg Teresa Palmer, Gabriel Bateman, Maria Bello,Bi... 2016 81 6.4 69823 67.24 58.0
269 270 Satanic Horror Four friends on their way to Coachella stop of... Jeffrey G. Hunt Sarah Hyland, Steven Krueger, Justin Chon, Cla... 2016 85 3.7 2384 NaN NaN
302 303 The Cabin in the Woods Horror Five friends go for a break at a remote cabin,... Drew Goddard Kristen Connolly, Chris Hemsworth, Anna Hutchi... 2012 95 7.0 295554 42.04 72.0
401 402 The Black Room Horror PAUL and JENNIFER HEMDALE have just moved into... Rolfe Kanefsky Natasha Henstridge, Lukas Hassel, Lin Shaye,Do... 2016 91 3.9 240 NaN 71.0
623 624 Friday the 13th Horror A group of young adults discover a boarded up ... Marcus Nispel Jared Padalecki, Amanda Righetti, Derek Mears,... 2009 97 5.6 78631 65.00 34.0
675 676 The Mist Horror A freak storm unleashes a species of bloodthir... Frank Darabont Thomas Jane, Marcia Gay Harden, Laurie Holden,... 2007 126 7.2 233346 25.59 58.0
858 859 The Hills Have Eyes Horror A suburban American family is being stalked by... Alexandre Aja Ted Levine, Kathleen Quinlan, Dan Byrd, Emilie... 2006 107 6.4 136642 41.78 52.0
936 937 The Human Centipede (First Sequence) Horror A mad scientist kidnaps and mutilates a trio o... Tom Six Dieter Laser, Ashley C. Williams, Ashlynn Yenn... 2009 92 4.4 60655 0.18 33.0
988 989 Martyrs Horror A young woman's quest for revenge against the ... Pascal Laugier Morjana Alaoui, Mylène Jampanoï, Catherine Bég... 2008 99 7.1 63785 NaN 89.0
996 997 Hostel: Part II Horror Three American college students studying abroa... Eli Roth Lauren German, Heather Matarazzo, Bijou Philli... 2007 94 5.5 73152 17.54 46.0

Trick 56: Apply a mappings or functions to the whole df (applymap)

Go back to the Table of Contents

df = pd.DataFrame({"A":["Male", "Female", "Female", "Male"], "B":["x", "y", "z", "A"], "C":["male", "female", "male", "female"], "D":[1, 2, 3, 4]})
df

# first let's use applymap to convert to standarize the text
df = df.applymap(lambda x: x.lower() if type(x) == str else x)

mapping = {"male":0, "female":1}

print("PROBLEM: Applies to the whole df but retruns None")
df.applymap(mapping.get)

print("Get the correct result but you have to specify the colums. If you don't want to do this, check the next result")
df[["A", "C"]].applymap(mapping.get)

print("Condtional apply map: if can map --> map else return the same value")
df = df.applymap(lambda x: mapping[x] if x in mapping.keys() else x)
df
A B C D
0 Male x male 1
1 Female y female 2
2 Female z male 3
3 Male A female 4
PROBLEM: Applies to the whole df but retruns None
A B C D
0 0 None 0 None
1 1 None 1 None
2 1 None 0 None
3 0 None 1 None
Get the correct result but you have to specify the colums. If you don't want to do this, check the next result
A C
0 0 0
1 1 1
2 1 0
3 0 1
Condtional apply map: if can map --> map else return the same value
A B C D
0 0 x 0 1
1 1 y 1 2
2 1 z 0 3
3 0 a 1 4

Trick 55: Filtering a df with multiple criteria using reduce

Go back to the Table of Contents

df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv")
df

print("Classical filter hard to read and mantain.")
df[(df["continent"] == "Europe") & (df["beer_servings"] > 150) & (df["wine_servings"] > 50) & (df["spirit_servings"] < 60)]

print("You can split it across multiple lines to make it more readable. But it's still hard to read.")
df[
    (df["continent"] == "Europe") & 
    (df["beer_servings"] > 150) & 
    (df["wine_servings"] > 50) & 
    (df["spirit_servings"] < 60)
]

print("Solution saving criteria as objects")

cr1 = df["continent"] == "Europe"
cr2 = df["beer_servings"] > 150
cr3 = df["wine_servings"] > 50
cr4 = df["spirit_servings"] < 60

df[cr1 & cr2 & cr3 & cr4]

print("Solution using reduce")
from functools import reduce

# creates our criteria usings lambda
# lambda takes 2 parameters, x and y
# reduce combines them & for every cr in the (cr1, cr2, cr3, cr4)
criteria = reduce(lambda x, y: x & y, (cr1, cr2, cr3, cr4))
df[criteria]
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
... ... ... ... ... ... ...
188 Venezuela 333 100 3 7.7 South America
189 Vietnam 111 2 1 2.0 Asia
190 Yemen 6 0 0 0.1 Asia
191 Zambia 32 19 4 2.5 Africa
192 Zimbabwe 64 18 4 4.7 Africa

193 rows × 6 columns

Classical filter hard to read and mantain.
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
156 Slovenia 270 51 276 10.6 Europe
You can split it across multiple lines to make it more readable. But it's still hard to read.
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
156 Slovenia 270 51 276 10.6 Europe
Solution saving criteria as objects
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
156 Slovenia 270 51 276 10.6 Europe
Solution using reduce
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
156 Slovenia 270 51 276 10.6 Europe

Trick 54: Calculate the difference between each row and the previous (diff())

Go back to the Table of Contents

df = generate_sample_data()
df["A_diff"] = df["A"].diff() # calculate the difference between 2 rows
df["A_diff_pct"] = df["A"].pct_change()*100 # calculates the porcentual variation between 2 rows

# add some style
df.style.format({"A_diff_pct":'{:.2f}%'})
A B C D E F G A_diff A_diff_pct
0 6 14 5 8 11 4 2 nan nan%
1 9 11 19 7 12 9 18 3 50.00%
2 8 13 9 15 19 13 11 -1 -11.11%
3 7 10 6 13 7 9 10 -1 -12.50%
4 17 19 1 8 19 7 10 10 142.86%
5 6 6 15 3 8 13 1 -11 -64.71%
6 7 14 18 6 3 13 7 1 16.67%
7 16 16 17 7 12 13 14 9 128.57%
8 12 1 8 4 17 17 16 -4 -25.00%
9 10 3 10 14 9 1 3 -2 -16.67%
10 18 7 9 1 15 1 3 8 80.00%
11 1 5 13 7 1 1 10 -17 -94.44%
12 2 12 7 11 11 10 19 1 100.00%
13 14 10 17 5 15 5 19 12 600.00%
14 1 14 7 3 10 3 16 -13 -92.86%
15 17 17 1 10 7 10 5 16 1600.00%
16 11 12 7 18 12 15 16 -6 -35.29%
17 6 11 3 5 7 13 17 -5 -45.45%
18 8 14 10 7 9 12 7 2 33.33%
19 6 12 11 12 16 6 9 -2 -25.00%

Trick 53: Shuffle rows of a df (df.sample())

Go back to the Table of Contents

df = generate_sample_data()

df.sample(frac = 0.5, random_state = 2)
df.sample(frac = 0.5, random_state = 2).reset_index(drop = True) # reset index after shuffeling
A B C D E F G
12 13 11 16 6 3 15 9
4 3 2 16 2 7 2 1
18 5 11 10 5 5 14 2
0 15 19 4 5 6 7 11
9 13 4 14 5 14 1 9
5 17 2 1 11 19 11 6
3 4 17 10 17 6 11 9
10 11 16 8 6 1 14 19
1 9 7 7 8 1 16 1
17 11 10 17 4 6 9 3
A B C D E F G
0 13 11 16 6 3 15 9
1 3 2 16 2 7 2 1
2 5 11 10 5 5 14 2
3 15 19 4 5 6 7 11
4 13 4 14 5 14 1 9
5 17 2 1 11 19 11 6
6 4 17 10 17 6 11 9
7 11 16 8 6 1 14 19
8 9 7 7 8 1 16 1
9 11 10 17 4 6 9 3

Trick 52: Making plots with pandas

Go back to the Table of Contents

df = generate_sample_data()

df.plot(kind = "line")
df.plot(kind = "bar")
df.plot(kind = "barh")
df.plot(kind = "hist")
df.plot(kind = "box")
df.plot(kind = "kde")
df.plot(kind = "area")

# the following plots requiere x and y
df.plot(x = "A", y = "B", kind = "scatter")
df.plot(x = "A", y = "B", kind = "hexbin")
df.plot(x = "A", y = "B", kind = "pie") # here you can pass only x but you need to add subplots = True

# other plots are available through pd.plotting
# more about plotting https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
<matplotlib.axes._subplots.AxesSubplot at 0x7f3291f694e0>
<matplotlib.axes._subplots.AxesSubplot at 0x7f3291a6fb00>
<matplotlib.axes._subplots.AxesSubplot at 0x7f3291876198>
<matplotlib.axes._subplots.AxesSubplot at 0x7f3291688be0>
<matplotlib.axes._subplots.AxesSubplot at 0x7f3291f226a0>
<matplotlib.axes._subplots.AxesSubplot at 0x7f3291510a20>
<matplotlib.axes._subplots.AxesSubplot at 0x7f328ee9ebe0>
<matplotlib.axes._subplots.AxesSubplot at 0x7f328e9a9550>
<matplotlib.axes._subplots.AxesSubplot at 0x7f328e8f3e48>
<matplotlib.axes._subplots.AxesSubplot at 0x7f328e873e10>

Trick 51: Concatenate 2 column strings

Go back to the Table of Contents

print_files()

df = pd.read_csv("/kaggle/input/titanic/train.csv")

# Solution 1: using str.cat 
df["Name"].str.cat(df["Sex"], sep = ", ").head()

# using + sign
df["Name"] + ", " + df["Sex"].head()
/kaggle/input/imdb-data/IMDB-Movie-Data.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv
/kaggle/input/us-accidents/US_Accidents_Dec19.csv
/kaggle/input/drinks-by-country/drinksbycountry.csv
/kaggle/input/imdbmovies/imdb.csv
/kaggle/input/trick58data/trick58data.csv
0                        Braund, Mr. Owen Harris, male
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                       Heikkinen, Miss. Laina, female
3    Futrelle, Mrs. Jacques Heath (Lily May Peel), ...
4                       Allen, Mr. William Henry, male
Name: Name, dtype: object
0                          Braund, Mr. Owen Harris, male
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                         Heikkinen, Miss. Laina, female
3      Futrelle, Mrs. Jacques Heath (Lily May Peel), ...
4                         Allen, Mr. William Henry, male
                             ...                        
886                                                  NaN
887                                                  NaN
888                                                  NaN
889                                                  NaN
890                                                  NaN
Length: 891, dtype: object

Trick 50: Named aggregation with multiple columns passing tupples (new in pandas 0.25)

Go back to the Table of Contents

df = pd.read_csv("/kaggle/input/titanic/train.csv")

# Typical groupby
print("Problem: MultiIndex")
df.groupby("Pclass").agg({"Age":["mean", "max"], "Survived": "mean"})

# Please note that this has been covered in 86 and 86 bis.
# This is just one more way to do it.
print("Named aggregation")
df.groupby("Pclass").agg(avg_age = ("Age", "mean"),
                        max_age = ("Age", "max"), 
                        survival_rate = ("Survived", "mean"))
Problem: MultiIndex
Age Survived
mean max mean
Pclass
1 38.233441 80.0 0.629630
2 29.877630 70.0 0.472826
3 25.140620 74.0 0.242363
Named aggregation
avg_age max_age survival_rate
Pclass
1 38.233441 80.0 0.629630
2 29.877630 70.0 0.472826
3 25.140620 74.0 0.242363

Trick 49: Sampling with pandas (with replacement and weights)

Go back to the Table of Contents

d = {"A": [100, 200, 300, 400, 100], "W":[10, 5, 0, 3, 8]}
df = pd.DataFrame(d)
df

# with replacement
df.sample(n = 5, replace = True, random_state = 2)

# adding weights
df.sample(n = 5, replace = True, random_state = 2, weights = "W")
A W
0 100 10
1 200 5
2 300 0
3 400 3
4 100 8
A W
0 100 10
0 100 10
3 400 3
2 300 0
3 400 3
A W
1 200 5
0 100 10
1 200 5
1 200 5
1 200 5

Trick 48: Useful parameters when using pd.read_csv()

Go back to the Table of Contents

df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv")
df.head()
df.dtypes

# Let's import the country and beer_servings columns, convert them to string and float64 respectevly
# Import only the first 5 rows and thread 0 as nans
df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv",
                    usecols=["country", "beer_servings"],
                    dtype={"country":"category", "beer_servings":"float64"},
                    nrows = 5,
                    na_values = 0.0)
df.head()
df.dtypes

# more about read_csv on https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object
country beer_servings
0 Afghanistan NaN
1 Albania 89.0
2 Algeria 25.0
3 Andorra 245.0
4 Angola 217.0
country          category
beer_servings     float64
dtype: object

Trick 47: Create one row for each item in a list (explode)

Go back to the Table of Contents

d = {"Team":["FC Barcelona", "FC Real Madrid"], 
    "Players":[["Ter Stegen", "Semedo", "Piqué", "Lenglet", "Alba", "Rakitic", "De Jong", "Sergi Roberto", "Messi", "Suárez", "Griezmann"], \
               ["Courtois", "Carvajal", "Varane", "Sergio Ramos", "Mendy", "Kroos", "Valverde", "Casemiro", "Isco", "Benzema", "Bale"]]}

print("Notice that we have a list of players for each team. Let's generate a row for each player.")
df = pd.DataFrame(d)
df

print("Using explode to generate new rows for each player.")
df1 = df.explode("Players")
df1

print("Reverse this operation with groupby and agg")
df["Imploded"] = df1.groupby(df1.index)["Players"].agg(list)
df
Notice that we have a list of players for each team. Let's generate a row for each player.
Team Players
0 FC Barcelona [Ter Stegen, Semedo, Piqué, Lenglet, Alba, Rak...
1 FC Real Madrid [Courtois, Carvajal, Varane, Sergio Ramos, Men...
Using explode to generate new rows for each player.
Team Players
0 FC Barcelona Ter Stegen
0 FC Barcelona Semedo
0 FC Barcelona Piqué
0 FC Barcelona Lenglet
0 FC Barcelona Alba
0 FC Barcelona Rakitic
0 FC Barcelona De Jong
0 FC Barcelona Sergi Roberto
0 FC Barcelona Messi
0 FC Barcelona Suárez
0 FC Barcelona Griezmann
1 FC Real Madrid Courtois
1 FC Real Madrid Carvajal
1 FC Real Madrid Varane
1 FC Real Madrid Sergio Ramos
1 FC Real Madrid Mendy
1 FC Real Madrid Kroos
1 FC Real Madrid Valverde
1 FC Real Madrid Casemiro
1 FC Real Madrid Isco
1 FC Real Madrid Benzema
1 FC Real Madrid Bale
Reverse this operation with groupby and agg
Team Players Imploded
0 FC Barcelona [Ter Stegen, Semedo, Piqué, Lenglet, Alba, Rak... [Ter Stegen, Semedo, Piqué, Lenglet, Alba, Rak...
1 FC Real Madrid [Courtois, Carvajal, Varane, Sergio Ramos, Men... [Courtois, Carvajal, Varane, Sergio Ramos, Men...

Trick 46: Store NaN in an integer type with Int64 (not int64)

Go back to the Table of Contents

print("Default series")
ser1 = pd.Series([10, 20])
ser1

print("Let's add a NaN to an int64 series")
ser1 = pd.Series([10, 20, np.nan])
ser1 # Notice it has been converted to float64

print("But if we use Int64 than everything will work")
ser1 = pd.Series([10, 20, np.nan], dtype = "Int64")
ser1
Default series
0    10
1    20
dtype: int64
Let's add a NaN to an int64 series
0    10.0
1    20.0
2     NaN
dtype: float64
But if we use Int64 than everything will work
0     10
1     20
2    NaN
dtype: Int64

Trick 45: Create rows for values separated by commas in a cell (assing and explode)

Go back to the Table of Contents

d = {"Team":["FC Barcelona", "FC Real Madrid"], 
    "Players":["Ter Stegen, Semedo, Piqué, Lenglet, Alba, Rakitic, De Jong, Sergi Roberto, Messi, Suárez, Griezmann",
               "Courtois, Carvajal, Varane, Sergio Ramos, Mendy, Kroos, Valverde, Casemiro, Isco, Benzema, Bale"]}

print("Notice that we have a list of players for each team separated by commas. Let's generate a row for each player.")
df = pd.DataFrame(d)
df

print("Notice that we have converted to something similar seen in example 47.")
df.assign(Players = df["Players"].str.split(","))

print("Now add explode and done.")
df.assign(Players = df["Players"].str.split(",")).explode("Players")
Notice that we have a list of players for each team separated by commas. Let's generate a row for each player.
Team Players
0 FC Barcelona Ter Stegen, Semedo, Piqué, Lenglet, Alba, Raki...
1 FC Real Madrid Courtois, Carvajal, Varane, Sergio Ramos, Mend...
Notice that we have converted to something similar seen in example 47.
Team Players
0 FC Barcelona [Ter Stegen, Semedo, Piqué, Lenglet, Alba,...
1 FC Real Madrid [Courtois, Carvajal, Varane, Sergio Ramos, ...
Now add explode and done.
Team Players
0 FC Barcelona Ter Stegen
0 FC Barcelona Semedo
0 FC Barcelona Piqué
0 FC Barcelona Lenglet
0 FC Barcelona Alba
0 FC Barcelona Rakitic
0 FC Barcelona De Jong
0 FC Barcelona Sergi Roberto
0 FC Barcelona Messi
0 FC Barcelona Suárez
0 FC Barcelona Griezmann
1 FC Real Madrid Courtois
1 FC Real Madrid Carvajal
1 FC Real Madrid Varane
1 FC Real Madrid Sergio Ramos
1 FC Real Madrid Mendy
1 FC Real Madrid Kroos
1 FC Real Madrid Valverde
1 FC Real Madrid Casemiro
1 FC Real Madrid Isco
1 FC Real Madrid Benzema
1 FC Real Madrid Bale

Trick 44: Use a local variable within a query in pandas (using @)

Go back to the Table of Contents

df = generate_sample_data()
df

# create a local variable mean
mean = df["A"].mean()

# now let's use in inside a query of pandas using @
df.query("A > @mean")
A B C D E F G
0 1 16 18 4 9 18 4
1 13 4 7 17 17 19 8
2 15 3 4 8 14 17 3
3 17 17 12 4 4 6 10
4 2 15 3 10 10 8 17
5 11 8 5 10 14 16 17
6 19 15 7 13 7 3 12
7 15 16 17 17 7 17 3
8 13 8 15 8 18 1 8
9 7 1 4 11 1 10 6
10 2 14 6 12 3 16 17
11 10 13 19 15 17 18 4
12 5 3 19 12 10 6 18
13 10 18 12 3 11 12 11
14 13 3 19 6 7 7 2
15 2 15 12 7 9 1 4
16 14 14 19 6 2 13 9
17 7 13 5 7 10 4 3
18 15 2 1 15 19 9 12
19 14 13 3 6 10 5 17
A B C D E F G
1 13 4 7 17 17 19 8
2 15 3 4 8 14 17 3
3 17 17 12 4 4 6 10
5 11 8 5 10 14 16 17
6 19 15 7 13 7 3 12
7 15 16 17 17 7 17 3
8 13 8 15 8 18 1 8
14 13 3 19 6 7 7 2
16 14 14 19 6 2 13 9
18 15 2 1 15 19 9 12
19 14 13 3 6 10 5 17

Trick 43: Create one row for each item in a list (explode) !!!duplicated Trick 47!!!

Go back to the Table of Contents

# It seems that this trick is duplicated, skip to the next one
# I decided to keep in, so in the future there will be no confusion if you consult the original material
# and this kernel
d = {"Team":["FC Barcelona", "FC Real Madrid"], 
    "Players":[["Ter Stegen", "Semedo", "Piqué", "Lenglet", "Alba", "Rakitic", "De Jong", "Sergi Roberto", "Messi", "Suárez", "Griezmann"], \
               ["Courtois", "Carvajal", "Varane", "Sergio Ramos", "Mendy", "Kroos", "Valverde", "Casemiro", "Isco", "Benzema", "Bale"]]}

print("Notice that we have a list of players for each team. Let's generate a row for each player.")
df = pd.DataFrame(d)
df

print("Using explode to generate new rows for each player.")
df1 = df.explode("Players")
df1

print("Reverse this operation with groupby and agg")
df["Imploded"] = df1.groupby(df1.index)["Players"].agg(list)
df
Notice that we have a list of players for each team. Let's generate a row for each player.
Team Players
0 FC Barcelona [Ter Stegen, Semedo, Piqué, Lenglet, Alba, Rak...
1 FC Real Madrid [Courtois, Carvajal, Varane, Sergio Ramos, Men...
Using explode to generate new rows for each player.
Team Players
0 FC Barcelona Ter Stegen
0 FC Barcelona Semedo
0 FC Barcelona Piqué
0 FC Barcelona Lenglet
0 FC Barcelona Alba
0 FC Barcelona Rakitic
0 FC Barcelona De Jong
0 FC Barcelona Sergi Roberto
0 FC Barcelona Messi
0 FC Barcelona Suárez
0 FC Barcelona Griezmann
1 FC Real Madrid Courtois
1 FC Real Madrid Carvajal
1 FC Real Madrid Varane
1 FC Real Madrid Sergio Ramos
1 FC Real Madrid Mendy
1 FC Real Madrid Kroos
1 FC Real Madrid Valverde
1 FC Real Madrid Casemiro
1 FC Real Madrid Isco
1 FC Real Madrid Benzema
1 FC Real Madrid Bale
Reverse this operation with groupby and agg
Team Players Imploded
0 FC Barcelona [Ter Stegen, Semedo, Piqué, Lenglet, Alba, Rak... [Ter Stegen, Semedo, Piqué, Lenglet, Alba, Rak...
1 FC Real Madrid [Courtois, Carvajal, Varane, Sergio Ramos, Men... [Courtois, Carvajal, Varane, Sergio Ramos, Men...

Trick 42: New aggregation function --> last()

Go back to the Table of Contents

d = {"patient":[1, 2, 3, 1, 1, 2], "visit":[2015, 2016, 2014, 2016, 2017, 2020]}
df = pd.DataFrame(d)
df.sort_values("visit")

print("Let's get the last visit for each patient")
df.groupby("patient")["visit"].last().to_frame()
patient visit
2 3 2014
0 1 2015
1 2 2016
3 1 2016
4 1 2017
5 2 2020
Let's get the last visit for each patient
visit
patient
1 2017
2 2020
3 2014

Trick 41: Ordered categories (from pandas.api.types import CategoricalDtypee)

Go back to the Table of Contents

import pandas as pd
from pandas.api.types import CategoricalDtype
d = {"ID":[100, 101, 102, 103], "quality":["bad", "very good", "good", "excellent"]}
df = pd.DataFrame(d)
df

print("Let's create our own categorical order.")
cat_type = CategoricalDtype(["bad", "good", "very good", "excellent"], ordered = True)
df["quality"] = df["quality"].astype(cat_type)
df

print("Now we can use logical sorting.")
df = df.sort_values("quality", ascending = True)
df

print("We can also filter this as if they are numbers. AMAZING.")
df[df["quality"] > "bad"]
ID quality
0 100 bad
1 101 very good
2 102 good
3 103 excellent
Let's create our own categorical order.
ID quality
0 100 bad
1 101 very good
2 102 good
3 103 excellent
Now we can use logical sorting.
ID quality
0 100 bad
2 102 good
1 101 very good
3 103 excellent
We can also filter this as if they are numbers. AMAZING.
ID quality
2 102 good
1 101 very good
3 103 excellent

Trick 40: Style you df fast with hide_index() and set_caption()

Go back to the Table of Contents

df = generate_sample_data()
print("Original df")
df

df.style.hide_index().set_caption("Styled df with no index and a caption")
Original df
A B C D E F G
0 7 7 5 14 11 14 3
1 1 16 18 12 5 5 19
2 11 3 12 15 11 19 10
3 12 13 11 5 4 9 15
4 13 1 19 16 18 5 6
5 12 15 11 12 19 5 5
6 8 13 3 16 14 4 19
7 2 9 18 12 2 7 11
8 17 2 10 1 13 2 11
9 3 11 1 5 18 11 2
10 13 1 17 8 1 14 8
11 17 5 10 13 1 11 7
12 8 13 9 4 18 1 8
13 14 17 17 16 18 7 3
14 15 19 9 15 16 13 19
15 12 18 1 7 3 12 11
16 6 10 7 4 3 1 18
17 17 9 5 19 17 5 11
18 2 9 18 6 1 2 3
19 6 18 13 9 5 8 11
Styled df with no index and a caption
A B C D E F G
7 7 5 14 11 14 3
1 16 18 12 5 5 19
11 3 12 15 11 19 10
12 13 11 5 4 9 15
13 1 19 16 18 5 6
12 15 11 12 19 5 5
8 13 3 16 14 4 19
2 9 18 12 2 7 11
17 2 10 1 13 2 11
3 11 1 5 18 11 2
13 1 17 8 1 14 8
17 5 10 13 1 11 7
8 13 9 4 18 1 8
14 17 17 16 18 7 3
15 19 9 15 16 13 19
12 18 1 7 3 12 11
6 10 7 4 3 1 18
17 9 5 19 17 5 11
2 9 18 6 1 2 3
6 18 13 9 5 8 11

Trick 39: One hot encoding (get_dummies())

Go back to the Table of Contents

df = pd.read_csv("/kaggle/input/titanic/train.csv", usecols = [2, 4, 5, 11], nrows = 10)
df

pd.get_dummies(df) # Notice that we can eliminate one column of each since this information is contained in the others

pd.get_dummies(df, drop_first=True)
Pclass Sex Age Embarked
0 3 male 22.0 S
1 1 female 38.0 C
2 3 female 26.0 S
3 1 female 35.0 S
4 3 male 35.0 S
5 3 male NaN Q
6 1 male 54.0 S
7 3 male 2.0 S
8 3 female 27.0 S
9 2 female 14.0 C
Pclass Age Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 3 22.0 0 1 0 0 1
1 1 38.0 1 0 1 0 0
2 3 26.0 1 0 0 0 1
3 1 35.0 1 0 0 0 1
4 3 35.0 0 1 0 0 1
5 3 NaN 0 1 0 1 0
6 1 54.0 0 1 0 0 1
7 3 2.0 0 1 0 0 1
8 3 27.0 1 0 0 0 1
9 2 14.0 1 0 1 0 0
Pclass Age Sex_male Embarked_Q Embarked_S
0 3 22.0 1 0 1
1 1 38.0 0 0 0
2 3 26.0 0 0 1
3 1 35.0 0 0 1
4 3 35.0 1 0 1
5 3 NaN 1 1 0
6 1 54.0 1 0 1
7 3 2.0 1 0 1
8 3 27.0 0 0 1
9 2 14.0 0 0 0

Trick 38: Pandas datetime (lot's of examples)

Go back to the Table of Contents

df = generate_sample_data_datetime().reset_index()
df = df.sample(500)
df["Year"] = df["index"].dt.year
df["Month"] = df["index"].dt.month
df["Day"] = df["index"].dt.day
df["Hour"] = df["index"].dt.hour
df["Minute"] = df["index"].dt.minute
df["Second"] = df["index"].dt.second
df["Nanosecond"] = df["index"].dt.nanosecond
df["Date"] = df["index"].dt.date
df["Time"] = df["index"].dt.time
df["Time_Time_Zone"] = df["index"].dt.timetz
df["Day_Of_Year"] = df["index"].dt.dayofyear
df["Week_Of_Year"] = df["index"].dt.weekofyear
df["Week"] = df["index"].dt.week
df["Day_Of_week"] = df["index"].dt.dayofweek
df["Week_Day"] = df["index"].dt.weekday
df["Week_Day_Name"] = df["index"].dt.weekday_name
df["Quarter"] = df["index"].dt.quarter
df["Days_In_Month"] = df["index"].dt.days_in_month
df["Is_Month_Start"] = df["index"].dt.is_month_start
df["Is_Month_End"] = df["index"].dt.is_month_end
df["Is_Quarter_Start"] = df["index"].dt.is_quarter_start
df["Is_Quarter_End"] = df["index"].dt.is_quarter_end
df["Is_Leap_Year"] = df["index"].dt.is_leap_year
df
index sales customers Year Month Day Hour Minute Second Nanosecond ... Day_Of_week Week_Day Week_Day_Name Quarter Days_In_Month Is_Month_Start Is_Month_End Is_Quarter_Start Is_Quarter_End Is_Leap_Year
8540 2000-12-21 20:00:00 5 5 2000 12 21 20 0 0 0 ... 3 3 Thursday 4 31 False False False False True
2428 2000-04-11 04:00:00 16 9 2000 4 11 4 0 0 0 ... 1 1 Tuesday 2 30 False False False False True
6820 2000-10-11 04:00:00 14 10 2000 10 11 4 0 0 0 ... 2 2 Wednesday 4 31 False False False False True
1374 2000-02-27 06:00:00 5 9 2000 2 27 6 0 0 0 ... 6 6 Sunday 1 29 False False False False True
648 2000-01-28 00:00:00 3 7 2000 1 28 0 0 0 0 ... 4 4 Friday 1 31 False False False False True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1205 2000-02-20 05:00:00 13 19 2000 2 20 5 0 0 0 ... 6 6 Sunday 1 29 False False False False True
4572 2000-07-09 12:00:00 6 5 2000 7 9 12 0 0 0 ... 6 6 Sunday 3 31 False False False False True
2757 2000-04-24 21:00:00 3 17 2000 4 24 21 0 0 0 ... 0 0 Monday 2 30 False False False False True
822 2000-02-04 06:00:00 5 14 2000 2 4 6 0 0 0 ... 4 4 Friday 1 29 False False False False True
1364 2000-02-26 20:00:00 17 17 2000 2 26 20 0 0 0 ... 5 5 Saturday 1 29 False False False False True

500 rows × 26 columns

Trick 37: Pandas slicing loc and iloc (6 examples)

Go back to the Table of Contents

df = generate_sample_data()
df

# using loc --> labels
df.loc[0, "A"]

# using iloc --> position
df.iloc[0, 0]

# mixing labels and position with loc
df.loc[0, df.columns[0]]

# mixing labels and position with loc
df.loc[df.index[0], "A"]

# mixing labels and position with iloc
df.iloc[0, df.columns.get_loc("A")]

# mixing labels and position with iloc
df.iloc[df.index.get_loc(0), 0]
A B C D E F G
0 10 13 4 17 11 3 17
1 18 16 18 16 3 17 18
2 9 12 1 13 19 15 11
3 5 7 6 7 8 18 7
4 3 1 12 7 7 14 19
5 14 12 2 1 19 10 15
6 16 2 17 7 8 6 12
7 9 15 8 4 12 3 10
8 1 17 14 12 8 2 10
9 14 15 15 1 14 18 18
10 2 5 5 1 4 1 19
11 14 11 4 3 4 12 3
12 8 18 11 12 18 10 18
13 3 11 8 7 5 11 7
14 3 7 15 10 3 9 2
15 1 4 5 4 1 7 6
16 7 5 18 18 4 4 3
17 19 10 6 13 13 18 13
18 11 13 12 16 2 18 3
19 3 8 13 7 5 13 5
10
10
10
10
10
10

Trick 36: Convert from UTC to another timezone

Go back to the Table of Contents

s = pd.Series(range(1552194000, 1552212001, 3600))
s = pd.to_datetime(s, unit = "s")
s

# set timezome to current time zone (UTC)
s = s.dt.tz_localize("UTC")
s

# set timezome to another time zone (Chicago)
s = s.dt.tz_convert("America/Chicago")
s
0   2019-03-10 05:00:00
1   2019-03-10 06:00:00
2   2019-03-10 07:00:00
3   2019-03-10 08:00:00
4   2019-03-10 09:00:00
5   2019-03-10 10:00:00
dtype: datetime64[ns]
0   2019-03-10 05:00:00+00:00
1   2019-03-10 06:00:00+00:00
2   2019-03-10 07:00:00+00:00
3   2019-03-10 08:00:00+00:00
4   2019-03-10 09:00:00+00:00
5   2019-03-10 10:00:00+00:00
dtype: datetime64[ns, UTC]
0   2019-03-09 23:00:00-06:00
1   2019-03-10 00:00:00-06:00
2   2019-03-10 01:00:00-06:00
3   2019-03-10 03:00:00-05:00
4   2019-03-10 04:00:00-05:00
5   2019-03-10 05:00:00-05:00
dtype: datetime64[ns, America/Chicago]

Trick 35: Query a column that has spaces in the name (using backticks)

Go back to the Table of Contents

d = {"colum_without_space":np.array([1, 2, 3, 4, 5, 6]), "column with space":np.array([1, 2, 3, 4, 5, 6])*2}
df = pd.DataFrame(d)
df

print("Query a column without space")
df.query("colum_without_space > 4")
print("Query a column with space using backticks ``")
print("This is a backtick ``")
df.query("`column with space` > 8")
colum_without_space column with space
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Query a column without space
colum_without_space column with space
4 5 10
5 6 12
Query a column with space using backticks ``
This is a backtick ``
colum_without_space column with space
4 5 10
5 6 12

Trick 34: Explore a dataset with profiling

Go back to the Table of Contents

import pandas_profiling

df = generate_sample_data()

df

print("Generating report with pandas profiling")
df.profile_report()
A B C D E F G
0 6 18 9 11 5 17 14
1 3 16 8 11 15 1 15
2 2 3 18 19 7 13 3
3 11 7 4 19 4 7 13
4 10 7 14 3 2 5 3
5 11 10 10 1 5 2 6
6 11 8 6 12 19 16 6
7 1 8 5 7 1 15 4
8 18 7 19 10 17 14 17
9 16 14 5 19 18 19 8
10 13 7 10 15 13 7 4
11 7 13 17 6 8 19 2
12 12 19 13 3 10 1 10
13 9 18 14 13 4 15 2
14 3 2 16 13 7 7 16
15 17 18 14 4 14 19 10
16 8 16 4 18 11 9 1
17 13 2 14 2 12 12 8
18 15 18 1 11 12 5 13
19 3 15 15 16 4 5 12
Generating report with pandas profiling

Trick 33: Pandas display options

Go back to the Table of Contents

# use pd.describe_option() to see all
# max_rows
# max_columns
# max_colwidth
# precision
# date_dayfirst
# date_yearfirst

df = generate_sample_data_datetime()[:10].reset_index()
df["sales"] = df["sales"].astype("float")
df

pd.set_option("display.max_rows",5)
pd.set_option("display.max_columns",3)
pd.set_option('display.width', 1000)
pd.set_option('display.date_dayfirst', True)
pd.describe_option()

pd.reset_option('^display.', silent=True) # restore to default
#pd.reset_option('display.width') # restore one by one
index sales customers
0 2000-01-01 00:00:00 13.0 7
1 2000-01-01 01:00:00 4.0 16
2 2000-01-01 02:00:00 17.0 18
3 2000-01-01 03:00:00 7.0 6
4 2000-01-01 04:00:00 10.0 9
5 2000-01-01 05:00:00 2.0 3
6 2000-01-01 06:00:00 2.0 1
7 2000-01-01 07:00:00 13.0 7
8 2000-01-01 08:00:00 8.0 8
9 2000-01-01 09:00:00 12.0 19
compute.use_bottleneck : bool
    Use the bottleneck library to accelerate if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]compute.use_numexpr : bool
    Use the numexpr library to accelerate computation if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]display.chop_threshold : float or None
    if set to a float value, all float values smaller then the given threshold
    will be displayed as exactly 0 by repr and friends.
    [default: None] [currently: None]display.colheader_justify : 'left'/'right'
    Controls the justification of column headers. used by DataFrameFormatter.
    [default: right] [currently: right]display.column_space No description available.
    [default: 12] [currently: 12]display.date_dayfirst : boolean
    When True, prints and parses dates with the day first, eg 20/01/2005
    [default: False] [currently: True]display.date_yearfirst : boolean
    When True, prints and parses dates with the year first, eg 2005/01/20
    [default: False] [currently: False]display.encoding : str/unicode
    Defaults to the detected encoding of the console.
    Specifies the encoding to be used for strings returned by to_string,
    these are generally strings meant to be displayed on the console.
    [default: UTF-8] [currently: UTF-8]display.expand_frame_repr : boolean
    Whether to print out the full DataFrame repr for wide DataFrames across
    multiple lines, `max_columns` is still respected, but the output will
    wrap-around across multiple "pages" if its width exceeds `display.width`.
    [default: True] [currently: True]display.float_format : callable
    The callable should accept a floating point number and return
    a string with the desired format of the number. This is used
    in some places like SeriesFormatter.
    See formats.format.EngFormatter for an example.
    [default: None] [currently: None]display.html.border : int
    A ``border=value`` attribute is inserted in the ``<table>`` tag
    for the DataFrame HTML repr.
    [default: 1] [currently: 1]display.html.table_schema : boolean
    Whether to publish a Table Schema representation for frontends
    that support it.
    (default: False)
    [default: False] [currently: False]display.html.use_mathjax : boolean
    When True, Jupyter notebook will process table contents using MathJax,
    rendering mathematical expressions enclosed by the dollar symbol.
    (default: True)
    [default: True] [currently: True]display.large_repr : 'truncate'/'info'
    For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
    show a truncated table (the default from 0.13), or switch to the view from
    df.info() (the behaviour in earlier versions of pandas).
    [default: truncate] [currently: truncate]display.latex.escape : bool
    This specifies if the to_latex method of a Dataframe uses escapes special
    characters.
    Valid values: False,True
    [default: True] [currently: True]display.latex.longtable :bool
    This specifies if the to_latex method of a Dataframe uses the longtable
    format.
    Valid values: False,True
    [default: False] [currently: False]display.latex.multicolumn : bool
    This specifies if the to_latex method of a Dataframe uses multicolumns
    to pretty-print MultiIndex columns.
    Valid values: False,True
    [default: True] [currently: True]display.latex.multicolumn_format : bool
    This specifies if the to_latex method of a Dataframe uses multicolumns
    to pretty-print MultiIndex columns.
    Valid values: False,True
    [default: l] [currently: l]display.latex.multirow : bool
    This specifies if the to_latex method of a Dataframe uses multirows
    to pretty-print MultiIndex rows.
    Valid values: False,True
    [default: False] [currently: False]display.latex.repr : boolean
    Whether to produce a latex DataFrame representation for jupyter
    environments that support it.
    (default: False)
    [default: False] [currently: False]display.max_categories : int
    This sets the maximum number of categories pandas should output when
    printing out a `Categorical` or a Series of dtype "category".
    [default: 8] [currently: 8]display.max_columns : int
    If max_cols is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the width of the terminal and print a truncated object which fits
    the screen width. The IPython notebook, IPython qtconsole, or IDLE
    do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 20] [currently: 3]display.max_colwidth : int
    The maximum width in characters of a column in the repr of
    a pandas data structure. When the column overflows, a "..."
    placeholder is embedded in the output.
    [default: 50] [currently: 50]display.max_info_columns : int
    max_info_columns is used in DataFrame.info method to decide if
    per column information will be printed.
    [default: 100] [currently: 100]display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions than
    specified.
    [default: 1690785] [currently: 1690785]display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 5]display.max_seq_items : int or None
    when pretty-printing a long sequence, no more then `max_seq_items`
    will be printed. If items are omitted, they will be denoted by the
    addition of "..." to the resulting string.

    If set to None, the number of items to be printed is unlimited.
    [default: 100] [currently: 100]display.memory_usage : bool, string or None
    This specifies if the memory usage of a DataFrame should be displayed when
    df.info() is called. Valid values True,False,'deep'
    [default: True] [currently: True]display.min_rows : int
    The numbers of rows to show in a truncated view (when `max_rows` is
    exceeded). Ignored when `max_rows` is set to None or 0. When set to
    None, follows the value of `max_rows`.
    [default: 10] [currently: 10]display.multi_sparse : boolean
    "sparsify" MultiIndex display (don't display repeated
    elements in outer levels within groups)
    [default: True] [currently: True]display.notebook_repr_html : boolean
    When True, IPython notebook will use html representation for
    pandas objects (if it is available).
    [default: True] [currently: True]display.pprint_nest_depth : int
    Controls the number of nested levels to process when pretty-printing
    [default: 3] [currently: 3]display.precision : int
    Floating point output precision (number of significant digits). This is
    only a suggestion
    [default: 6] [currently: 6]display.show_dimensions : boolean or 'truncate'
    Whether to print out dimensions at the end of DataFrame repr.
    If 'truncate' is specified, only print out the dimensions if the
    frame is truncated (e.g. not display all rows and/or columns)
    [default: truncate] [currently: truncate]display.unicode.ambiguous_as_wide : boolean
    Whether to use the Unicode East Asian Width to calculate the display text
    width.
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]display.unicode.east_asian_width : boolean
    Whether to use the Unicode East Asian Width to calculate the display text
    width.
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]display.width : int
    Width of the display in characters. In case python/IPython is running in
    a terminal this can be set to None and pandas will correctly auto-detect
    the width.
    Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
    terminal and hence it is not possible to correctly detect the width.
    [default: 80] [currently: 1000]io.excel.ods.reader : string
    The default Excel reader engine for 'ods' files. Available options:
    auto, odf.
    [default: auto] [currently: auto]io.excel.xls.reader : string
    The default Excel reader engine for 'xls' files. Available options:
    auto, xlrd.
    [default: auto] [currently: auto]io.excel.xls.writer : string
    The default Excel writer engine for 'xls' files. Available options:
    auto, xlwt.
    [default: auto] [currently: auto]io.excel.xlsm.reader : string
    The default Excel reader engine for 'xlsm' files. Available options:
    auto, xlrd, openpyxl.
    [default: auto] [currently: auto]io.excel.xlsm.writer : string
    The default Excel writer engine for 'xlsm' files. Available options:
    auto, openpyxl.
    [default: auto] [currently: auto]io.excel.xlsx.reader : string
    The default Excel reader engine for 'xlsx' files. Available options:
    auto, xlrd, openpyxl.
    [default: auto] [currently: auto]io.excel.xlsx.writer : string
    The default Excel writer engine for 'xlsx' files. Available options:
    auto, openpyxl, xlsxwriter.
    [default: auto] [currently: auto]io.hdf.default_format : format
    default format writing format, if None, then
    put will default to 'fixed' and append will default to 'table'
    [default: None] [currently: None]io.hdf.dropna_table : boolean
    drop ALL nan rows when appending to a table
    [default: False] [currently: False]io.parquet.engine : string
    The default parquet reader/writer engine. Available options:
    'auto', 'pyarrow', 'fastparquet', the default is 'auto'
    [default: auto] [currently: auto]mode.chained_assignment : string
    Raise an exception, warn, or no action if trying to use chained assignment,
    The default is warn
    [default: warn] [currently: warn]mode.sim_interactive : boolean
    Whether to simulate interactive mode for purposes of testing
    [default: False] [currently: False]mode.use_inf_as_na : boolean
    True means treat None, NaN, INF, -INF as NA (old way),
    False means None and NaN are null, but INF, -INF are not NA
    (new way).
    [default: False] [currently: False]mode.use_inf_as_null : boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.
    [default: False] [currently: False]
    (Deprecated, use `mode.use_inf_as_na` instead.)plotting.backend : str
    The plotting backend to use. The default value is "matplotlib", the
    backend provided with pandas. Other backends can be specified by
    prodiving the name of the module that implements the backend.
    [default: matplotlib] [currently: matplotlib]plotting.matplotlib.register_converters : bool
    Whether to register converters with matplotlib's units registry for
    dates, times, datetimes, and Periods. Toggling to False will remove
    the converters, restoring any converters that pandas overwrote.
    [default: True] [currently: True]

Trick 32: Filter a df with query and avoid intermediate variables

Go back to the Table of Contents

df = generate_sample_data()[:10]
df["A"] = pd.Series(["APP", "GOO", "APP", "GOO", "MIC", "MIC", "APP", "GOO", "MIC", "APP"])
df.rename(columns = {"A":"stock"}, inplace = True)
print("Original df")
df

print("Filter data using intermediate variables")
temp = df.groupby("stock").mean()
temp 

fv = temp["B"].sort_values(ascending = False)[1] # filter by the second greates. This way every time we generate sample data we will have a result
temp[temp["B"] < fv]

print("Filter using query")
df.groupby("stock").mean().query("B < {}".format(fv))
df.groupby("stock").mean().query("B < @fv")
df.groupby("stock").mean().query("B < 10")
Original df
stock B C D E F G
0 APP 12 19 2 18 13 14
1 GOO 8 2 17 18 8 15
2 APP 13 4 10 19 11 11
3 GOO 2 8 3 15 11 6
4 MIC 5 5 18 3 12 11
5 MIC 8 18 9 14 15 11
6 APP 3 3 7 14 1 13
7 GOO 9 13 8 13 16 9
8 MIC 17 6 8 1 12 9
9 APP 15 18 11 17 10 11
Filter data using intermediate variables
B C D E F G
stock
APP 10.750000 11.000000 7.500000 17.000000 8.750000 12.250000
GOO 6.333333 7.666667 9.333333 15.333333 11.666667 10.000000
MIC 10.000000 9.666667 11.666667 6.000000 13.000000 10.333333
B C D E F G
stock
GOO 6.333333 7.666667 9.333333 15.333333 11.666667 10.0
Filter using query
B C D E F G
stock
GOO 6.333333 7.666667 9.333333 15.333333 11.666667 10.0
B C D E F G
stock
GOO 6.333333 7.666667 9.333333 15.333333 11.666667 10.0
B C D E F G
stock
GOO 6.333333 7.666667 9.333333 15.333333 11.666667 10.0

Trick 31: See all the columns of a big df

Go back to the Table of Contents

pd.reset_option('^display.', silent=True) # restore to default

df = generate_sample_data()
df1 = df.copy(deep = True)
df = df.append(df1)

print("Imagine we have a big df where we can see all the columns ...")
df.T.head() # we are trasposing JUST TO CREATE A GIANT DF

# Solution 1
print("Solution 1 using pd.set_option display.max_columns")
pd.set_option("display.max_columns", None)
df.T.head()
pd.reset_option('^display.', silent=True) # restore to default

# Solution 2
print("Another clever solution using Traspose")
df.T.head().T
Imagine we have a big df where we can see all the columns ...
0 1 2 3 4 5 6 7 8 9 ... 10 11 12 13 14 15 16 17 18 19
A 18 7 13 8 11 5 13 14 2 5 ... 11 7 4 19 6 15 1 4 5 19
B 15 18 5 6 8 15 4 2 12 12 ... 16 15 4 10 19 2 2 10 5 2
C 14 3 8 6 5 6 4 19 18 19 ... 9 15 1 13 6 5 4 3 7 17
D 9 3 16 8 2 15 11 15 6 3 ... 15 3 14 16 8 9 16 9 5 1
E 15 10 19 6 6 9 10 1 4 18 ... 11 14 5 5 6 11 10 7 7 6

5 rows × 40 columns

Solution 1 using pd.set_option display.max_columns
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
A 18 7 13 8 11 5 13 14 2 5 11 7 4 19 6 15 1 4 5 19 18 7 13 8 11 5 13 14 2 5 11 7 4 19 6 15 1 4 5 19
B 15 18 5 6 8 15 4 2 12 12 16 15 4 10 19 2 2 10 5 2 15 18 5 6 8 15 4 2 12 12 16 15 4 10 19 2 2 10 5 2
C 14 3 8 6 5 6 4 19 18 19 9 15 1 13 6 5 4 3 7 17 14 3 8 6 5 6 4 19 18 19 9 15 1 13 6 5 4 3 7 17
D 9 3 16 8 2 15 11 15 6 3 15 3 14 16 8 9 16 9 5 1 9 3 16 8 2 15 11 15 6 3 15 3 14 16 8 9 16 9 5 1
E 15 10 19 6 6 9 10 1 4 18 11 14 5 5 6 11 10 7 7 6 15 10 19 6 6 9 10 1 4 18 11 14 5 5 6 11 10 7 7 6
Another clever solution using Traspose
A B C D E
0 18 15 14 9 15
1 7 18 3 3 10
2 13 5 8 16 19
3 8 6 6 8 6
4 11 8 5 2 6
5 5 15 6 15 9
6 13 4 4 11 10
7 14 2 19 15 1
8 2 12 18 6 4
9 5 12 19 3 18
10 11 16 9 15 11
11 7 15 15 3 14
12 4 4 1 14 5
13 19 10 13 16 5
14 6 19 6 8 6
15 15 2 5 9 11
16 1 2 4 16 10
17 4 10 3 9 7
18 5 5 7 5 7
19 19 2 17 1 6
0 18 15 14 9 15
1 7 18 3 3 10
2 13 5 8 16 19
3 8 6 6 8 6
4 11 8 5 2 6
5 5 15 6 15 9
6 13 4 4 11 10
7 14 2 19 15 1
8 2 12 18 6 4
9 5 12 19 3 18
10 11 16 9 15 11
11 7 15 15 3 14
12 4 4 1 14 5
13 19 10 13 16 5
14 6 19 6 8 6
15 15 2 5 9 11
16 1 2 4 16 10
17 4 10 3 9 7
18 5 5 7 5 7
19 19 2 17 1 6

Trick 30: Pandas merge --> see where the columns are coming from (indicator = True)

Go back to the Table of Contents

df = generate_sample_data()
df1 = df.copy(deep = True)
df1 = df1.drop([0, 1, 2], axis = "rows") # drop some index just to see the example workings
df.head()
df1.head()

pd.merge(df, df1, how = "left", indicator = True)
A B C D E F G
0 1 4 11 8 2 6 13
1 7 10 3 1 14 15 6
2 8 14 1 13 7 19 1
3 11 17 16 18 17 18 11
4 4 12 1 10 3 11 2
A B C D E F G
3 11 17 16 18 17 18 11
4 4 12 1 10 3 11 2
5 8 5 1 4 13 18 4
6 14 1 2 5 5 9 6
7 8 6 1 3 11 8 12
A B C D E F G _merge
0 1 4 11 8 2 6 13 left_only
1 7 10 3 1 14 15 6 left_only
2 8 14 1 13 7 19 1 left_only
3 11 17 16 18 17 18 11 both
4 4 12 1 10 3 11 2 both
5 8 5 1 4 13 18 4 both
6 14 1 2 5 5 9 6 both
7 8 6 1 3 11 8 12 both
8 12 13 12 7 19 4 8 both
9 17 16 10 6 8 13 10 both
10 12 2 5 16 3 15 16 both
11 1 18 9 9 6 12 16 both
12 2 3 18 9 19 3 19 both
13 17 19 9 10 16 12 8 both
14 19 19 18 14 4 9 9 both
15 17 7 11 8 2 5 10 both
16 15 15 14 6 5 10 8 both
17 4 3 5 1 17 11 17 both
18 4 6 6 4 12 11 12 both
19 11 16 17 7 3 3 18 both

Trick 29: Access numpy within pandas (without importing numpy as np)

Go back to the Table of Contents

# Pandas is built upon numpy, so we can acess all numpy functionality from pandas
pd.np.random.rand(2, 3)
pd.np.nan
array([[0.18376862, 0.31772758, 0.29151369],
       [0.67558827, 0.56642693, 0.12838044]])
nan

Trick 28: Aggregating by multiple columns (using agg)

Go back to the Table of Contents

df = pd.read_csv("/kaggle/input/drinks-by-country/drinksbycountry.csv")
print("Original df")
df

print("Groupby continent beer_servings")
df.groupby("continent")["beer_servings"].mean()

print("Using agg to pass multiple functions")
df.groupby("continent")["beer_servings"].agg(["mean", "count"])

print("Using describe over a groupby object")
df.groupby("continent")["beer_servings"].describe()
Original df
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
... ... ... ... ... ... ...
188 Venezuela 333 100 3 7.7 South America
189 Vietnam 111 2 1 2.0 Asia
190 Yemen 6 0 0 0.1 Asia
191 Zambia 32 19 4 2.5 Africa
192 Zimbabwe 64 18 4 4.7 Africa

193 rows × 6 columns

Groupby continent beer_servings
continent
Africa            61.471698
Asia              37.045455
Europe           193.777778
North America    145.434783
Oceania           89.687500
South America    175.083333
Name: beer_servings, dtype: float64
Using agg to pass multiple functions
mean count
continent
Africa 61.471698 53
Asia 37.045455 44
Europe 193.777778 45
North America 145.434783 23
Oceania 89.687500 16
South America 175.083333 12
Using describe over a groupby object
count mean std min 25% 50% 75% max
continent
Africa 53.0 61.471698 80.557816 0.0 15.00 32.0 76.00 376.0
Asia 44.0 37.045455 49.469725 0.0 4.25 17.5 60.50 247.0
Europe 45.0 193.777778 99.631569 0.0 127.00 219.0 270.00 361.0
North America 23.0 145.434783 79.621163 1.0 80.00 143.0 198.00 285.0
Oceania 16.0 89.687500 96.641412 0.0 21.00 52.5 125.75 306.0
South America 12.0 175.083333 65.242845 93.0 129.50 162.5 198.00 333.0

Trick 27: Aggregation over timeseries (resample)

Go back to the Table of Contents

df = generate_sample_data_datetime()

print("Original df")
df
print("Let's resample/groupby by month")
df.resample("M")["sales"].sum()

print("Let's resample/groupby by day")
df.resample("D")["sales"].sum()
Original df
sales customers
2000-01-01 00:00:00 2 13
2000-01-01 01:00:00 10 9
2000-01-01 02:00:00 13 7
2000-01-01 03:00:00 19 7
2000-01-01 04:00:00 9 5
... ... ...
2000-12-30 19:00:00 10 1
2000-12-30 20:00:00 14 16
2000-12-30 21:00:00 11 16
2000-12-30 22:00:00 9 16
2000-12-30 23:00:00 1 14

8760 rows × 2 columns

Let's resample/groupby by month
2000-01-31    7580
2000-02-29    6953
2000-03-31    7389
2000-04-30    7328
2000-05-31    7340
2000-06-30    7114
2000-07-31    7503
2000-08-31    7475
2000-09-30    6990
2000-10-31    7544
2000-11-30    7075
2000-12-31    6959
Freq: M, Name: sales, dtype: int64
Let's resample/groupby by day
2000-01-01    248
2000-01-02    272
2000-01-03    258
2000-01-04    262
2000-01-05    253
             ... 
2000-12-26    254
2000-12-27    202
2000-12-28    186
2000-12-29    200
2000-12-30    227
Freq: D, Name: sales, Length: 365, dtype: int64

Trick 26: Formatting different columns of a df (using dictionaries)

Go back to the Table of Contents

df = generate_sample_data_datetime().reset_index()[:10]
df.rename(columns = {"index":"time"}, inplace = True)
df["sales_100"] = df["sales"]*100
print("Original df")
df.head()

# declare a formatting dict: individual for each column
fd = {"time":"{:%d/%m/%y}", "sales":"${:.2f}", "customers":"{:,}"}
df.style.format(fd)
df

# add some more formattin
(df.style.format(fd)
 .hide_index()
 .highlight_min("sales", color ="red")
 .highlight_max("sales", color ="green")
 .background_gradient(subset = "sales_100", cmap ="Blues")
 .bar("customers", color = "lightblue", align = "zero")
 .set_caption("A df with different stylings")
)
Original df
time sales customers sales_100
0 2000-01-01 00:00:00 11 19 1100
1 2000-01-01 01:00:00 4 1 400
2 2000-01-01 02:00:00 14 14 1400
3 2000-01-01 03:00:00 17 19 1700
4 2000-01-01 04:00:00 12 10 1200
time sales customers sales_100
0 01/01/00 $11.00 19 1100
1 01/01/00 $4.00 1 400
2 01/01/00 $14.00 14 1400
3 01/01/00 $17.00 19 1700
4 01/01/00 $12.00 10 1200
5 01/01/00 $19.00 1 1900
6 01/01/00 $14.00 2 1400
7 01/01/00 $4.00 12 400
8 01/01/00 $9.00 13 900
9 01/01/00 $7.00 4 700
time sales customers sales_100
0 2000-01-01 00:00:00 11 19 1100
1 2000-01-01 01:00:00 4 1 400
2 2000-01-01 02:00:00 14 14 1400
3 2000-01-01 03:00:00 17 19 1700
4 2000-01-01 04:00:00 12 10 1200
5 2000-01-01 05:00:00 19 1 1900
6 2000-01-01 06:00:00 14 2 1400
7 2000-01-01 07:00:00 4 12 400
8 2000-01-01 08:00:00 9 13 900
9 2000-01-01 09:00:00 7 4 700
A df with different stylings
time sales customers sales_100
01/01/00 $11.00 19 1100
01/01/00 $4.00 1 400
01/01/00 $14.00 14 1400
01/01/00 $17.00 19 1700
01/01/00 $12.00 10 1200
01/01/00 $19.00 1 1900
01/01/00 $14.00 2 1400
01/01/00 $4.00 12 400
01/01/00 $9.00 13 900
01/01/00 $7.00 4 700

Trick 25: 3 ways of renaming columns names

Go back to the Table of Contents

df = generate_sample_data()
df.head(2)

# Solution 1
df.rename({"A":"col_1", "B":"col_2"}, axis = "columns", inplace = True)
df.head(2)

# Solution 2
df.columns = ["col1", "col2", "col3", "col4","col5", "col6", "col7"] # list must be equal to the columns number
df.head(2)

# Solution 3
df.columns = df.columns.str.title() # apply any string method to the columns names
df.head(2)
A B C D E F G
0 7 9 8 13 1 7 14
1 4 2 13 5 18 13 18
col_1 col_2 C D E F G
0 7 9 8 13 1 7 14
1 4 2 13 5 18 13 18
col1 col2 col3 col4 col5 col6 col7
0 7 9 8 13 1 7 14
1 4 2 13 5 18 13 18
Col1 Col2 Col3 Col4 Col5 Col6 Col7
0 7 9 8 13 1 7 14
1 4 2 13 5 18 13 18

Trick 24: Copy data from Excel into pandas quick (read_clipboard())

Go back to the Table of Contents

# You will have to check this on your local machine
# Useful for fast importing
# Step 1: copy a table from excel sheet using ctrl + c (to the clipboard)
# Step 2: run this command
# df = pd.read_clipboard()

Trick 23: Fill missing values in time series data (interpolate())

Go back to the Table of Contents

d = {"col1":[100, 120 ,140, np.nan, 160], "col2":[9, 10, np.nan, 7.5, 6.5]}
df = pd.DataFrame(d)
df.index = pd.util.testing.makeDateIndex()[0:5]
print("Original df")
df
print("DataFrame after interpolate")
df.interpolate()
Original df
col1 col2
2000-01-03 100.0 9.0
2000-01-04 120.0 10.0
2000-01-05 140.0 NaN
2000-01-06 NaN 7.5
2000-01-07 160.0 6.5
DataFrame after interpolate
col1 col2
2000-01-03 100.0 9.00
2000-01-04 120.0 10.00
2000-01-05 140.0 8.75
2000-01-06 150.0 7.50
2000-01-07 160.0 6.50

Trick 22: Create DataFrames for testing

Go back to the Table of Contents

print("Contains random values")
df1 = pd.util.testing.makeDataFrame() # contains random values
df1
print("Contains missing values")
df2 = pd.util.testing.makeMissingDataframe() # contains missing values
df2
print("Contains datetime values")
df3 = pd.util.testing.makeTimeDataFrame() # contains datetime values
df3
print("Contains mixed values")
df4 = pd.util.testing.makeMixedDataFrame() # contains mixed values
df4
Contains random values
A B C D
iemziQOmnA -0.018835 0.138304 0.054042 1.347041
kQxAqdccCl 0.629247 0.492528 -0.685268 0.552325
541KYBxvN7 0.454780 0.630357 -0.399344 1.110901
Rf4CpjrRIa -0.865854 0.435139 0.948668 1.495367
jHhGV8qbPb -0.049998 0.533395 -0.800003 -0.936772
Yos9mpCj7Z -0.844179 1.693645 0.031442 -0.632623
ECmyb6wxAb 0.069292 0.916040 0.501867 -2.263605
GYAMtDSfV8 0.249819 -0.336130 0.035940 0.082028
fWobLLoKEo 1.725233 0.819056 0.283991 0.001558
c6dWDciSsQ -0.178713 1.707434 -0.080673 0.062964
XjhBuC4glO 0.064985 0.024078 -0.459211 -0.166176
VCYC3okhHU -0.996718 1.041660 -1.031159 -1.583670
BQEqYnsI4b -0.068889 -0.647761 -0.003470 2.163698
1BJ3ULTLFE -1.181160 0.806627 -1.821844 0.270477
Yc1VJxXgG0 1.095399 -0.666123 0.249561 0.625758
Ls8su4dTdq 1.004251 0.362740 -1.442212 -2.688180
KBVJfbVH9J 0.077480 0.420453 0.587073 -1.756041
9sQdMDQGdK 2.249203 0.251991 -0.954003 0.631213
E7uBPqGNsp -1.171347 0.465484 0.248735 -0.007709
9YUSqGCikF -1.309513 1.787180 0.597656 1.180486
WFnrmLgK1A 1.503773 -0.728682 -0.049142 -0.365630
aCx7ZO7m7q 0.472976 -0.115539 0.231383 0.977117
Ox9UJ86eFh -0.421978 0.738787 0.906320 -0.286335
ZClI5hjW4j 0.900205 -0.537221 0.607072 1.235179
q3sShulVIS -0.929941 -0.411518 0.469776 0.593687
cYwQEzoQET 1.560887 -2.219262 0.594793 0.148106
oTWQjDwszf 0.662286 -0.014289 0.336011 0.193085
jbJZIFxpqn -0.120235 0.171354 -0.871320 -0.688293
Kw8DrihRT0 0.091881 -0.067312 1.849581 0.674779
lVlMXoYW0S 0.601484 -0.529806 -0.628734 0.074307
Contains missing values
A B C D
XdokncblhW 2.035548 1.094053 0.708857 0.669957
Jy27Od3vTR 0.633053 0.341701 0.816031 1.966359
yFmRYOJ6oc NaN 0.582203 NaN 0.657928
n6pB3fhG3n -0.771161 0.842254 0.990680 NaN
vrx7gKTXHg -0.914512 NaN 0.485505 NaN
yRfNopdIEl -0.102205 -1.117124 0.968440 NaN
dnwQFwnUfC -1.721469 1.367850 0.293670 0.443552
De5lPaeu43 1.271096 0.389973 -0.471623 0.350967
BVOJHHYoHL 0.437732 0.150793 1.308058 -0.439796
kW75AXL1tX -0.709431 -1.276799 -0.390954 0.832945
7Y2LsJb8uF 1.503275 0.400263 -0.610125 0.110761
woftUva0nl 1.362725 -0.400364 -0.016175 1.387835
JbAqtIagoU -0.474429 0.428467 0.152137 1.221926
jOFO8kelsV -1.219965 1.520225 -0.285234 0.730180
ybKbi7uZZN NaN -0.114221 NaN -1.180047
X0DPXEHMz8 -0.010234 0.421133 1.041333 -2.095775
JF99rZD7Pz NaN -0.330715 1.370903 0.864329
m5Mx5fsMk6 2.080326 -0.761478 0.256339 -0.621514
EHCUd5Tt80 NaN 1.905951 0.403925 -1.531046
QyuHLKaUy5 -0.698399 -0.799211 -1.657850 -1.147409
a6w2FMwlCb -1.585317 1.625140 -0.555646 0.040598
uXOhvhUrPk -0.525261 -1.068731 0.562528 2.716433
eRyU0GEfxt 0.796820 0.978389 0.064504 0.039413
62D97OZh91 1.389901 1.033826 0.084123 -0.784459
yVpjwdbPUs -1.378581 0.023519 0.232445 0.997671
9Uxq635Qoh -0.688115 -0.501850 0.448501 1.499331
cVnhCAJDOC -0.818257 0.604812 1.212600 -0.193976
4DtcFnbS7R -0.271737 -0.271426 -1.550968 NaN
Qx3NYRcSiM NaN 1.633423 -0.065663 0.106926
QTa8C71jES 0.023088 -0.670918 0.433008 0.808237
Contains datetime values
A B C D
2000-01-03 1.445706 -1.353331 1.165386 0.880761
2000-01-04 -0.149897 -0.372328 1.897515 0.881319
2000-01-05 -0.636263 0.111965 0.058803 -0.483541
2000-01-06 -1.266426 0.928829 0.581432 -0.425007
2000-01-07 -1.899367 -0.150057 -0.059025 -0.690115
2000-01-10 -0.698875 -0.204040 0.627140 -0.057270
2000-01-11 -0.794645 0.541554 0.218404 -1.947169
2000-01-12 1.333345 0.705871 -1.433714 0.104247
2000-01-13 -0.635115 -1.035685 1.368574 -0.024654
2000-01-14 -0.251699 1.876108 -0.124366 1.607618
2000-01-17 0.231413 0.379985 -0.655714 0.110139
2000-01-18 -0.603751 0.752618 1.424388 -0.595827
2000-01-19 -0.592408 1.333954 1.142796 0.137626
2000-01-20 0.713400 -1.682703 1.626093 0.759561
2000-01-21 -1.168555 -0.058453 -1.032029 -0.270963
2000-01-24 -0.096425 0.012959 0.713009 2.383044
2000-01-25 -0.576389 0.731732 -1.250500 -0.341308
2000-01-26 0.783341 1.192012 2.497985 0.983259
2000-01-27 -0.173925 1.226357 0.443635 2.187678
2000-01-28 0.563522 -0.032797 -0.009738 -0.097024
2000-01-31 -0.749619 0.145713 -0.536961 0.467059
2000-02-01 1.658369 0.033691 -1.061166 0.146395
2000-02-02 -0.870733 0.522847 2.163658 -0.361528
2000-02-03 -1.493919 -0.061150 0.240106 -0.688362
2000-02-04 -1.464016 1.015647 -0.069368 1.447646
2000-02-07 1.351198 0.012420 -0.195001 1.128699
2000-02-08 -0.328917 1.738869 -0.639959 0.247021
2000-02-09 0.650489 0.788300 -0.064558 -0.024048
2000-02-10 -0.232675 0.146647 -0.581446 -0.627599
2000-02-11 0.731539 2.195488 -1.130289 1.443619
Contains mixed values
A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07

Trick 21: Split a string column into multiple columns

Go back to the Table of Contents

d = {"name":["John Artur Doe", "Jane Ann Smith", "Nico P"], "location":["Los Angeles, CA", "Washington, DC", "Barcelona, Spain"]}
df = pd.DataFrame(d)
df

df[["first", "middle", "last"]] = df["name"].str.split(" ", expand = True)
df["city"] = df["location"].str.split(",", expand = True)[0]
df
name location
0 John Artur Doe Los Angeles, CA
1 Jane Ann Smith Washington, DC
2 Nico P Barcelona, Spain
name location first middle last city
0 John Artur Doe Los Angeles, CA John Artur Doe Los Angeles
1 Jane Ann Smith Washington, DC Jane Ann Smith Washington
2 Nico P Barcelona, Spain Nico P None Barcelona

Trick 20: Create a datetime columns from multiple columns

Go back to the Table of Contents

d = {"day":[1, 2, 10 ,25, 12], "month":[1, 2, 4, 5, 6], "year":[2000, 2001, 2010, 2015, 2020]}
df = pd.DataFrame(d)
df["date"] = pd.to_datetime(df[["day", "month", "year"]])
df
df.dtypes
day month year date
0 1 1 2000 2000-01-01
1 2 2 2001 2001-02-02
2 10 4 2010 2010-04-10
3 25 5 2015 2015-05-25
4 12 6 2020 2020-06-12
day               int64
month             int64
year              int64
date     datetime64[ns]
dtype: object

Trick 19: Show memory usage of a df and every column

Go back to the Table of Contents

df = generate_sample_data_datetime().reset_index()
df.columns = ["date", "sales", "customers"]
df

print("Show the global usage of memory of the df")
df.info(memory_usage = "deep")
print()
print("Show the usage of memory of every column")
df.memory_usage(deep = True)
date sales customers
0 2000-01-01 00:00:00 10 12
1 2000-01-01 01:00:00 15 9
2 2000-01-01 02:00:00 16 11
3 2000-01-01 03:00:00 6 9
4 2000-01-01 04:00:00 8 18
... ... ... ...
8755 2000-12-30 19:00:00 11 18
8756 2000-12-30 20:00:00 18 16
8757 2000-12-30 21:00:00 7 1
8758 2000-12-30 22:00:00 13 18
8759 2000-12-30 23:00:00 10 11

8760 rows × 3 columns

Show the global usage of memory of the df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 3 columns):
date         8760 non-null datetime64[ns]
sales        8760 non-null int64
customers    8760 non-null int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 205.4 KB

Show the usage of memory of every column
Index          128
date         70080
sales        70080
customers    70080
dtype: int64

Trick 18: Read and write to a compressed file (csv.zip)

Go back to the Table of Contents

df = generate_sample_data()
df.head()

print("Writing data to a csv.zip file")
df.to_csv("trick18data.csv.zip")

print("Deleting df")
del df

print("Importing data from a csv.zip file")
df = pd.read_csv("/kaggle/working/trick18data.csv.zip", index_col=0)
df.head()

# other compression files supported .gz, .bz2, .xz
A B C D E F G
0 3 14 18 7 15 4 11
1 13 12 14 18 10 4 11
2 11 14 6 4 7 9 18
3 19 7 17 5 16 1 14
4 16 17 6 6 9 14 15
Writing data to a csv.zip file
Deleting df
Importing data from a csv.zip file
A B C D E F G
0 3 14 18 7 15 4 11
1 13 12 14 18 10 4 11
2 11 14 6 4 7 9 18
3 19 7 17 5 16 1 14
4 16 17 6 6 9 14 15

Trick 17: Select multiple rows and columns with loc

Go back to the Table of Contents

df = generate_sample_data()
print("Original df")
df

print("Using a slice (inclusive)")
df.loc[0:4, "A":"E"]

print("Using a list")
df.loc[[0,4], ["A","E"]]

print("Using a condition")
df.loc[df["A"] > 10, ["A","E"]]
Original df
A B C D E F G
0 3 1 11 12 5 16 9
1 7 16 15 15 5 1 5
2 14 5 3 11 9 17 9
3 9 5 3 3 5 13 2
4 14 18 6 10 17 19 16
5 8 3 1 16 5 7 4
6 11 4 17 16 12 17 5
7 10 18 7 8 7 1 10
8 18 15 5 10 13 17 4
9 4 7 19 11 8 4 14
10 13 16 13 9 1 13 8
11 14 1 6 17 15 14 18
12 2 14 13 19 19 13 3
13 10 5 19 4 4 6 14
14 17 15 1 10 13 9 1
15 1 17 14 2 14 12 18
16 14 18 11 4 8 6 8
17 8 15 11 2 17 2 9
18 9 9 19 9 6 18 1
19 9 9 17 10 12 10 17
Using a slice (inclusive)
A B C D E
0 3 1 11 12 5
1 7 16 15 15 5
2 14 5 3 11 9
3 9 5 3 3 5
4 14 18 6 10 17
Using a list
A E
0 3 5
4 14 17
Using a condition
A E
2 14 9
4 14 17
6 11 12
8 18 13
10 13 1
11 14 15
14 17 13
16 14 8

Trick 16: Convert continuos values to categorical (cut())

Go back to the Table of Contents

df = generate_sample_data()
df["A"] = df["A"] + 5
df.rename(columns = {"A":"age"}, inplace = True)
df.sample(5)

df["age_groups"] = pd.cut(df["age"], bins = [0, 18, 65, 99], labels = ["kids", "adult", "elderly"])
df
age B C D E F G
12 14 8 1 7 9 4 5
14 15 3 12 15 10 6 12
17 18 11 4 19 14 12 5
4 22 10 12 13 8 14 18
16 15 12 4 5 5 9 13
age B C D E F G age_groups
0 16 17 5 9 9 18 14 kids
1 10 10 10 8 11 11 16 kids
2 23 5 11 4 4 15 8 adult
3 22 16 8 19 9 19 14 adult
4 22 10 12 13 8 14 18 adult
5 17 4 17 10 9 13 11 kids
6 6 2 2 6 8 14 16 kids
7 8 5 11 9 3 12 6 kids
8 13 19 1 5 12 19 6 kids
9 12 13 13 16 1 15 14 kids
10 11 8 5 8 7 17 15 kids
11 14 4 10 15 19 1 13 kids
12 14 8 1 7 9 4 5 kids
13 22 2 2 12 19 11 12 adult
14 15 3 12 15 10 6 12 kids
15 7 16 6 5 16 14 15 kids
16 15 12 4 5 5 9 13 kids
17 18 11 4 19 14 12 5 kids
18 6 8 8 1 7 3 19 kids
19 8 12 15 3 13 9 11 kids

Trick 15: Reshape a MultiIndex df (unstack())

Go back to the Table of Contents

df = pd.read_csv("/kaggle/input/titanic/train.csv")
print("Original df")
df.head()

print("Groupby and create a MultiIndex df")
print("Notice we have a df with MultiIndex (Sex and Pclass)")
df.groupby(["Sex", "Pclass"])["Survived"].mean().to_frame()

print("Reshaping using unstack")
print("Now we can interact with it like with a normal df")
df.groupby(["Sex", "Pclass"])["Survived"].mean().unstack()
Original df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Groupby and create a MultiIndex df
Notice we have a df with MultiIndex (Sex and Pclass)
Survived
Sex Pclass
female 1 0.968085
2 0.921053
3 0.500000
male 1 0.368852
2 0.157407
3 0.135447
Reshaping using unstack
Now we can interact with it like with a normal df
Pclass 1 2 3
Sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

Trick 14: Creating toy df (3 methods)

Go back to the Table of Contents

# Method 1: from a dict
pd.DataFrame({"A":[10 ,20], "B":[30, 40]})

# Method 2: using numpy
pd.DataFrame(np.random.rand(2, 3), columns = list("ABC"))

# Method 3: using pandas builtin functionalities
pd.util.testing.makeMixedDataFrame()
A B
0 10 30
1 20 40
A B C
0 0.850267 0.996769 0.995248
1 0.331817 0.009783 0.261080
A B C D
0 0.0 0.0 foo1 2009-01-01
1 1.0 1.0 foo2 2009-01-02
2 2.0 0.0 foo3 2009-01-05
3 3.0 1.0 foo4 2009-01-06
4 4.0 0.0 foo5 2009-01-07

Trick 13: Avoid the series of lists TRAP

Go back to the Table of Contents

d = {"A":[1, 2, 3], "B":[[10, 20], [40, 50], [60, 70]]}
df = pd.DataFrame(d)
print("Notice that the column B has as values lists")
df
print("Convert it to normal series")
df_ = df["B"].apply(pd.Series)
df_

print("Join the 2 df")
pd.merge(df, df_, left_index = True, right_index = True)
Notice that the column B has as values lists
A B
0 1 [10, 20]
1 2 [40, 50]
2 3 [60, 70]
Convert it to normal series
0 1
0 10 20
1 40 50
2 60 70
Join the 2 df
A B 0 1
0 1 [10, 20] 10 20
1 2 [40, 50] 40 50
2 3 [60, 70] 60 70

Trick 12: Merging datasets and check uniqueness

Go back to the Table of Contents

df = generate_sample_data()[:10]
df1 = df.copy(deep = True)
df = df.drop([0, 1, 2])
df1 = df1.drop([8, 9])
df
df1

df_one_to_one = pd.merge(df, df1, validate = "one_to_one")
df_one_to_one

df_one_to_many = pd.merge(df, df1, validate = "one_to_many")
df_one_to_many

df_many_to_one = pd.merge(df, df1, validate = "many_to_one")
df_many_to_one
A B C D E F G
3 1 10 7 15 17 3 18
4 7 1 4 3 6 8 9
5 3 15 5 6 10 18 18
6 19 2 9 14 13 7 7
7 3 2 14 19 5 5 15
8 5 1 5 3 8 8 12
9 14 12 1 16 17 3 11
A B C D E F G
0 2 18 8 3 7 18 6
1 11 17 11 2 17 11 6
2 7 14 13 2 17 9 14
3 1 10 7 15 17 3 18
4 7 1 4 3 6 8 9
5 3 15 5 6 10 18 18
6 19 2 9 14 13 7 7
7 3 2 14 19 5 5 15
A B C D E F G
0 1 10 7 15 17 3 18
1 7 1 4 3 6 8 9
2 3 15 5 6 10 18 18
3 19 2 9 14 13 7 7
4 3 2 14 19 5 5 15
A B C D E F G
0 1 10 7 15 17 3 18
1 7 1 4 3 6 8 9
2 3 15 5 6 10 18 18
3 19 2 9 14 13 7 7
4 3 2 14 19 5 5 15
A B C D E F G
0 1 10 7 15 17 3 18
1 7 1 4 3 6 8 9
2 3 15 5 6 10 18 18
3 19 2 9 14 13 7 7
4 3 2 14 19 5 5 15

Trick 11: Rename all columns with the same pattern

Go back to the Table of Contents

print_files()
df = pd.read_csv("/kaggle/input/titanic/train.csv")
df.columns = ["Passenger ID", "Survived", "Pclass", "Name         ", "Sex", "Age", "Sib SP", "Parch", "Ticket", "Fare", "Cabin", "Embarked"] # creating column names for the example
df
df1 = df.copy(deep = True)

print("Replace all spaces with undescore and convert to lower")
print("Notice the Passenger and Sib SP column now has underscore")
df.columns = df.columns.str.replace(" ", "_").str.lower()
df.head()

print("Remove trailing (at the end) whitesapce and convert to lower")
print("Notice the Passenger and Sib SP column now has underscore")
df1.columns = df1.columns.str.lower().str.rstrip()
df1.head()
/kaggle/input/imdb-data/IMDB-Movie-Data.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv
/kaggle/input/us-accidents/US_Accidents_Dec19.csv
/kaggle/input/drinks-by-country/drinksbycountry.csv
/kaggle/input/imdbmovies/imdb.csv
/kaggle/input/trick58data/trick58data.csv
Passenger ID Survived Pclass Name Sex Age Sib SP Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Replace all spaces with undescore and convert to lower
Notice the Passenger and Sib SP column now has underscore
passenger_id survived pclass name_________ sex age sib_sp parch ticket fare cabin embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Remove trailing (at the end) whitesapce and convert to lower
Notice the Passenger and Sib SP column now has underscore
passenger id survived pclass name sex age sib sp parch ticket fare cabin embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Trick 10: Check the equality of 2 series

Go back to the Table of Contents

df = generate_sample_data()[["A", "B"]][:5]
df["A"] = pd.Series([15, 15, 18, np.nan, 12])
df["B"] = pd.Series([15, 15, 18, np.nan, 12])
df

print("Don't use ==, it does not handle NaN properly")
print("Notice that element 4 of each list is np.nan but == still returns False")
df["A"] == df["B"]

print("Using equals. Now we get True, so the 2 series are equal")
df["A"].equals(df["B"])

print("Equals also works for df")
df1 = df.copy(deep = True)
df.equals(df1)

print("== of df has the same issue as for series")
df == df1
A B
0 15.0 15.0
1 15.0 15.0
2 18.0 18.0
3 NaN NaN
4 12.0 12.0
Don't use ==, it does not handle NaN properly
Notice that element 4 of each list is np.nan but == still returns False
0     True
1     True
2     True
3    False
4     True
dtype: bool
Using equals. Now we get True, so the 2 series are equal
True
Equals also works for df
True
== of df has the same issue as for series
A B
0 True True
1 True True
2 True True
3 False False
4 True True

Trick 9: Reduce memory usage of a df while importing !!!duplicated Trick 83!!!

Go back to the Table of Contents

print_files()

df = pd.read_csv("/kaggle/input/imdb-data/IMDB-Movie-Data.csv", \
                 usecols = ["Title", "Genre", "Year", "Metascore", "Revenue (Millions)"])
df.dtypes
df.memory_usage(deep = True)

print("Importing only a few columns and converting to proper dtype")
df = pd.read_csv("/kaggle/input/imdb-data/IMDB-Movie-Data.csv", \
                 usecols = ["Title", "Genre", "Year", "Metascore", "Revenue (Millions)"], \
                dtype = {"Genre":"category", "Metascore":"Int64", "Year":"int8"})
df.dtypes
df.memory_usage(deep = True) # notice how Genre and Year are consuming now less memory
/kaggle/input/imdb-data/IMDB-Movie-Data.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv
/kaggle/input/us-accidents/US_Accidents_Dec19.csv
/kaggle/input/drinks-by-country/drinksbycountry.csv
/kaggle/input/imdbmovies/imdb.csv
/kaggle/input/trick58data/trick58data.csv
Title                  object
Genre                  object
Year                    int64
Revenue (Millions)    float64
Metascore             float64
dtype: object
Index                   128
Title                 71731
Genre                 75095
Year                   8000
Revenue (Millions)     8000
Metascore              8000
dtype: int64
Importing only a few columns and converting to proper dtype
Title                   object
Genre                 category
Year                      int8
Revenue (Millions)     float64
Metascore                Int64
dtype: object
Index                   128
Title                 71731
Genre                 28020
Year                   1000
Revenue (Millions)     8000
Metascore              9000
dtype: int64

Trick 8: Using glob to generate a df from multiple files !!!duplicated Trick 78!!!

Go back to the Table of Contents

# let's generate some fake data
df1 = generate_sample_data()
df2 = generate_sample_data()
df3 = generate_sample_data()
# df1.head()
# df2.head()
# df3.head()
df1.to_csv("trick8data1.csv", index = False)
df2.to_csv("trick8data2.csv", index = False)
df3.to_csv("trick8data3.csv", index = False)

# Step 1 generate list with the file name
lf = []
for _,_, files in os.walk("/kaggle/working/"):
    for f in files:
        if "trick8data" in f:
            lf.append(f)
            
lf

# You can use this on your local machine
#from glob import glob
#files = glob("trick8.csv")

# Step 2: we do the same as in trick 78 except we don't create a new column of the rows origin (file they came from)
df = pd.concat((pd.read_csv(file) for file in lf), ignore_index = True)
df
['trick8data3.csv', 'trick8data2.csv', 'trick8data1.csv']
A B C D E F G
0 7 18 15 6 5 3 13
1 17 8 9 19 15 18 5
2 19 11 17 18 1 10 11
3 17 9 1 13 19 11 5
4 18 14 2 11 13 17 5
5 17 5 8 7 16 16 7
6 9 1 6 16 8 4 9
7 7 12 11 13 17 19 18
8 3 1 3 16 18 14 7
9 16 17 17 9 16 18 12
10 8 7 4 11 7 13 10
11 13 18 1 17 2 19 8
12 15 2 2 15 1 10 16
13 9 13 3 10 1 1 19
14 14 3 4 17 1 2 8
15 17 5 6 6 3 7 7
16 13 7 15 6 9 5 18
17 18 11 9 19 8 16 8
18 12 19 1 15 7 18 7
19 14 8 2 8 19 16 8
20 3 3 14 1 1 14 15
21 17 14 1 13 1 16 9
22 8 11 19 13 17 3 10
23 4 17 6 9 5 17 15
24 8 11 5 3 8 11 2
25 13 17 14 8 9 10 4
26 16 19 8 10 9 10 9
27 8 10 10 11 18 19 4
28 2 13 13 19 10 16 14
29 15 8 14 10 5 6 19
30 1 12 19 1 11 10 4
31 12 15 13 3 7 13 1
32 12 8 10 9 18 11 3
33 10 3 3 5 3 7 15
34 16 6 6 8 7 7 4
35 1 12 13 5 18 6 15
36 11 19 19 19 16 6 18
37 18 9 4 12 16 1 12
38 1 10 13 1 16 4 1
39 11 3 18 9 16 10 8
40 19 17 16 14 5 16 12
41 8 16 5 10 5 10 11
42 12 16 12 5 5 8 5
43 1 13 18 2 19 2 7
44 7 11 18 8 14 17 7
45 6 13 4 5 11 4 14
46 2 12 4 13 6 12 13
47 11 15 3 19 9 13 2
48 18 13 11 7 17 13 5
49 17 15 10 5 8 18 10
50 8 10 13 17 10 15 7
51 12 11 9 1 6 12 7
52 6 13 18 4 7 17 4
53 14 1 19 4 18 2 13
54 18 19 5 10 4 3 3
55 15 17 6 14 7 7 11
56 6 11 9 2 15 1 1
57 10 8 12 16 17 4 17
58 18 5 5 6 18 8 6
59 13 8 16 19 17 3 4

Trick 7: Dealing with missing values (NaN)

Go back to the Table of Contents

df = pd.util.testing.makeMissingDataframe().reset_index() # contains missing values
df.rename(columns = {"index":"A"})
df1 = df.copy(deep = True)
df

print("Calculate the % of missing values in each row")
df.isna().mean() # calculate the % of missing values in each row
print("Droping any columns that have missing values. Only column A wil remain")
df.dropna(axis = "columns") # drop any column that has missing values
print("Droping any rows that have missing values.")
df1.dropna(axis = "rows") # drop any row that has missing values
print("Droping column where missing values are above a threshold")
df.dropna(thresh = len(df)*0.95, axis = "columns") # drop any row that has missing values
A A B C D
0 FTwK90BRwl 1.042050 0.408007 -1.147120 -0.564567
1 VC1JaJR8Tr -0.604667 -0.395873 1.437012 -1.066924
2 4ldakZDlTn NaN 1.650980 0.187050 1.146064
3 88YTk4a4fT 1.984666 -0.634435 -0.002925 0.792842
4 dOKulhCyER 2.003698 0.430636 0.420581 -0.334050
5 2JiimW6kNU NaN 0.488798 -0.059070 -1.221668
6 tQcgfLGwWf 0.950519 0.634764 -0.030417 -0.068450
7 px36j6UN0c 0.594503 -0.587349 -0.116367 0.653159
8 cis6GIEh7N -0.616240 -0.282388 -0.893210 -0.614391
9 WFdRebhDqQ -0.975334 0.975863 1.612233 -0.145750
10 gawAWG9prf -0.055484 1.177242 -1.193360 -1.703635
11 a2nzw9dyCo -0.770340 -1.239756 NaN -0.221478
12 QlULxltAJa 1.761485 NaN 1.611870 -1.225177
13 uaoJBS9Vdz -0.262542 0.621016 0.515486 0.903102
14 zZ7selauoP 0.191567 -0.470315 0.670099 0.929469
15 qKTv7Kuvfe -1.212089 -1.268654 -1.704230 -0.456514
16 c7znWElcfJ -0.063340 -0.404707 -0.886941 -1.474963
17 xVv1LU9Bo1 1.060293 NaN 0.593362 NaN
18 MnPExNoCwN 1.108309 NaN 1.876407 0.367690
19 yqeGxDHWtf -1.745035 1.579823 1.271711 NaN
20 1PVdW9klzR -2.025657 NaN 0.110363 NaN
21 sDx6W9Gyz0 1.772609 0.157053 0.633882 0.873281
22 xzivmL9h6q 1.506744 -0.058896 0.100016 -0.606444
23 rREE9R8Ji9 -0.518568 0.658888 1.523436 0.199041
24 ZeQO22I6H2 1.578719 0.271076 0.139141 NaN
25 JfyBs5Nscx -0.829749 1.999536 -0.576630 NaN
26 M8OWlHU7gM -0.913917 -1.393292 -0.315702 -0.183039
27 Mc6NPdg9wK 0.346991 0.120889 0.790559 -0.690845
28 Rgm1pCHg0b -0.307749 0.711548 1.807490 -0.391109
29 OUmCvUHky6 -1.200078 -1.327379 -0.792885 -0.447456
index A B C D
0 FTwK90BRwl 1.042050 0.408007 -1.147120 -0.564567
1 VC1JaJR8Tr -0.604667 -0.395873 1.437012 -1.066924
2 4ldakZDlTn NaN 1.650980 0.187050 1.146064
3 88YTk4a4fT 1.984666 -0.634435 -0.002925 0.792842
4 dOKulhCyER 2.003698 0.430636 0.420581 -0.334050
5 2JiimW6kNU NaN 0.488798 -0.059070 -1.221668
6 tQcgfLGwWf 0.950519 0.634764 -0.030417 -0.068450
7 px36j6UN0c 0.594503 -0.587349 -0.116367 0.653159
8 cis6GIEh7N -0.616240 -0.282388 -0.893210 -0.614391
9 WFdRebhDqQ -0.975334 0.975863 1.612233 -0.145750
10 gawAWG9prf -0.055484 1.177242 -1.193360 -1.703635
11 a2nzw9dyCo -0.770340 -1.239756 NaN -0.221478
12 QlULxltAJa 1.761485 NaN 1.611870 -1.225177
13 uaoJBS9Vdz -0.262542 0.621016 0.515486 0.903102
14 zZ7selauoP 0.191567 -0.470315 0.670099 0.929469
15 qKTv7Kuvfe -1.212089 -1.268654 -1.704230 -0.456514
16 c7znWElcfJ -0.063340 -0.404707 -0.886941 -1.474963
17 xVv1LU9Bo1 1.060293 NaN 0.593362 NaN
18 MnPExNoCwN 1.108309 NaN 1.876407 0.367690
19 yqeGxDHWtf -1.745035 1.579823 1.271711 NaN
20 1PVdW9klzR -2.025657 NaN 0.110363 NaN
21 sDx6W9Gyz0 1.772609 0.157053 0.633882 0.873281
22 xzivmL9h6q 1.506744 -0.058896 0.100016 -0.606444
23 rREE9R8Ji9 -0.518568 0.658888 1.523436 0.199041
24 ZeQO22I6H2 1.578719 0.271076 0.139141 NaN
25 JfyBs5Nscx -0.829749 1.999536 -0.576630 NaN
26 M8OWlHU7gM -0.913917 -1.393292 -0.315702 -0.183039
27 Mc6NPdg9wK 0.346991 0.120889 0.790559 -0.690845
28 Rgm1pCHg0b -0.307749 0.711548 1.807490 -0.391109
29 OUmCvUHky6 -1.200078 -1.327379 -0.792885 -0.447456
Calculate the % of missing values in each row
index    0.000000
A        0.066667
B        0.133333
C        0.033333
D        0.166667
dtype: float64
Droping any columns that have missing values. Only column A wil remain
index
0 FTwK90BRwl
1 VC1JaJR8Tr
2 4ldakZDlTn
3 88YTk4a4fT
4 dOKulhCyER
5 2JiimW6kNU
6 tQcgfLGwWf
7 px36j6UN0c
8 cis6GIEh7N
9 WFdRebhDqQ
10 gawAWG9prf
11 a2nzw9dyCo
12 QlULxltAJa
13 uaoJBS9Vdz
14 zZ7selauoP
15 qKTv7Kuvfe
16 c7znWElcfJ
17 xVv1LU9Bo1
18 MnPExNoCwN
19 yqeGxDHWtf
20 1PVdW9klzR
21 sDx6W9Gyz0
22 xzivmL9h6q
23 rREE9R8Ji9
24 ZeQO22I6H2
25 JfyBs5Nscx
26 M8OWlHU7gM
27 Mc6NPdg9wK
28 Rgm1pCHg0b
29 OUmCvUHky6
Droping any rows that have missing values.
index A B C D
0 FTwK90BRwl 1.042050 0.408007 -1.147120 -0.564567
1 VC1JaJR8Tr -0.604667 -0.395873 1.437012 -1.066924
3 88YTk4a4fT 1.984666 -0.634435 -0.002925 0.792842
4 dOKulhCyER 2.003698 0.430636 0.420581 -0.334050
6 tQcgfLGwWf 0.950519 0.634764 -0.030417 -0.068450
7 px36j6UN0c 0.594503 -0.587349 -0.116367 0.653159
8 cis6GIEh7N -0.616240 -0.282388 -0.893210 -0.614391
9 WFdRebhDqQ -0.975334 0.975863 1.612233 -0.145750
10 gawAWG9prf -0.055484 1.177242 -1.193360 -1.703635
13 uaoJBS9Vdz -0.262542 0.621016 0.515486 0.903102
14 zZ7selauoP 0.191567 -0.470315 0.670099 0.929469
15 qKTv7Kuvfe -1.212089 -1.268654 -1.704230 -0.456514
16 c7znWElcfJ -0.063340 -0.404707 -0.886941 -1.474963
21 sDx6W9Gyz0 1.772609 0.157053 0.633882 0.873281
22 xzivmL9h6q 1.506744 -0.058896 0.100016 -0.606444
23 rREE9R8Ji9 -0.518568 0.658888 1.523436 0.199041
26 M8OWlHU7gM -0.913917 -1.393292 -0.315702 -0.183039
27 Mc6NPdg9wK 0.346991 0.120889 0.790559 -0.690845
28 Rgm1pCHg0b -0.307749 0.711548 1.807490 -0.391109
29 OUmCvUHky6 -1.200078 -1.327379 -0.792885 -0.447456
Droping column where missing values are above a threshold
index C
0 FTwK90BRwl -1.147120
1 VC1JaJR8Tr 1.437012
2 4ldakZDlTn 0.187050
3 88YTk4a4fT -0.002925
4 dOKulhCyER 0.420581
5 2JiimW6kNU -0.059070
6 tQcgfLGwWf -0.030417
7 px36j6UN0c -0.116367
8 cis6GIEh7N -0.893210
9 WFdRebhDqQ 1.612233
10 gawAWG9prf -1.193360
11 a2nzw9dyCo NaN
12 QlULxltAJa 1.611870
13 uaoJBS9Vdz 0.515486
14 zZ7selauoP 0.670099
15 qKTv7Kuvfe -1.704230
16 c7znWElcfJ -0.886941
17 xVv1LU9Bo1 0.593362
18 MnPExNoCwN 1.876407
19 yqeGxDHWtf 1.271711
20 1PVdW9klzR 0.110363
21 sDx6W9Gyz0 0.633882
22 xzivmL9h6q 0.100016
23 rREE9R8Ji9 1.523436
24 ZeQO22I6H2 0.139141
25 JfyBs5Nscx -0.576630
26 M8OWlHU7gM -0.315702
27 Mc6NPdg9wK 0.790559
28 Rgm1pCHg0b 1.807490
29 OUmCvUHky6 -0.792885

Trick 6: Split a df into 2 random subsets

Go back to the Table of Contents

df = generate_sample_data()
df_1 = df.sample(frac = 0.7)
df_2 = df.drop(df_1.index) # only works if the df index is unique

df.shape
df_1.shape
df_2.shape
(20, 7)
(14, 7)
(6, 7)

Trick 5: Convert numbers stored as strings (coerce)

Go back to the Table of Contents

d = {"col1":["1", "2", "3", "stuff"], "col2":["1", "2", "3", "4"]}
df = pd.DataFrame(d)
df.astype({"col2":"int"}) # this will fail for col1 --> ValueError: invalid literal for int() with base 10: 'stuff'

print("Notice that now stuff got converted to NaN")
df.apply(pd.to_numeric, errors = "coerce")
col1 col2
0 1 1
1 2 2
2 3 3
3 stuff 4
Notice that now stuff got converted to NaN
col1 col2
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4

Trick 4: Select columns by dtype

Go back to the Table of Contents

df = generate_sample_data_datetime()[:10].reset_index()
df["string_col"] = list("ABCDEABCDE")
df["sales"] = df["sales"].astype("float")
print("Original df")
df

print("Select numerical columns")
df.select_dtypes(include = "number")

print("Select string columns")
df.select_dtypes(include = "object")

print("Select datetime columns")
df.select_dtypes(include = ["datetime", "timedelta"])

print("Select miscelaneous")
df.select_dtypes(include = ["number", "object", "datetime", "timedelta"])

print("Select by passing the dtypes you need")
df.select_dtypes(include = ["int8", "int16", "int32", "int64", "float"])
Original df
index sales customers string_col
0 2000-01-01 00:00:00 12.0 17 A
1 2000-01-01 01:00:00 5.0 3 B
2 2000-01-01 02:00:00 3.0 19 C
3 2000-01-01 03:00:00 12.0 12 D
4 2000-01-01 04:00:00 11.0 3 E
5 2000-01-01 05:00:00 12.0 6 A
6 2000-01-01 06:00:00 5.0 10 B
7 2000-01-01 07:00:00 17.0 17 C
8 2000-01-01 08:00:00 13.0 3 D
9 2000-01-01 09:00:00 19.0 5 E
Select numerical columns
sales customers
0 12.0 17
1 5.0 3
2 3.0 19
3 12.0 12
4 11.0 3
5 12.0 6
6 5.0 10
7 17.0 17
8 13.0 3
9 19.0 5
Select string columns
string_col
0 A
1 B
2 C
3 D
4 E
5 A
6 B
7 C
8 D
9 E
Select datetime columns
index
0 2000-01-01 00:00:00
1 2000-01-01 01:00:00
2 2000-01-01 02:00:00
3 2000-01-01 03:00:00
4 2000-01-01 04:00:00
5 2000-01-01 05:00:00
6 2000-01-01 06:00:00
7 2000-01-01 07:00:00
8 2000-01-01 08:00:00
9 2000-01-01 09:00:00
Select miscelaneous
index sales customers string_col
0 2000-01-01 00:00:00 12.0 17 A
1 2000-01-01 01:00:00 5.0 3 B
2 2000-01-01 02:00:00 3.0 19 C
3 2000-01-01 03:00:00 12.0 12 D
4 2000-01-01 04:00:00 11.0 3 E
5 2000-01-01 05:00:00 12.0 6 A
6 2000-01-01 06:00:00 5.0 10 B
7 2000-01-01 07:00:00 17.0 17 C
8 2000-01-01 08:00:00 13.0 3 D
9 2000-01-01 09:00:00 19.0 5 E
Select by passing the dtypes you need
sales customers
0 12.0 17
1 5.0 3
2 3.0 19
3 12.0 12
4 11.0 3
5 12.0 6
6 5.0 10
7 17.0 17
8 13.0 3
9 19.0 5

Trick 3: Filter a df by multiple conditions (isin and inverse using ~)

Go back to the Table of Contents

df = generate_sample_data()[:5]
df["A"] = [1, 2, 3, 4, 5]

print("Filter using multiple |")
df[(df["A"] == 1) | (df["A"] == 3)]

print("Filter using isin")
df[df["A"].isin([1, 3])]

print("Invert using ~ (ctrl + alt + 4)")
df[~df["A"].isin([1, 3])]
Filter using multiple |
A B C D E F G
0 1 10 10 17 1 8 17
2 3 3 4 19 18 15 8
Filter using isin
A B C D E F G
0 1 10 10 17 1 8 17
2 3 3 4 19 18 15 8
Invert using ~ (ctrl + alt + 4)
A B C D E F G
1 2 17 14 9 17 5 2
3 4 17 3 2 8 12 1
4 5 2 17 12 2 10 17

Trick 2: Reverse order of a df

Go back to the Table of Contents

df = generate_sample_data()[:5]
df

print("Reverse column order")
df.loc[:, ::-1]

print("Reverse row order")
df.loc[::-1]

print("Reverse row order and reset index")
df.loc[::-1].reset_index(drop = True)
A B C D E F G
0 3 8 12 2 16 1 19
1 17 12 18 11 9 17 2
2 13 1 2 18 3 2 9
3 11 10 9 4 16 8 14
4 3 5 9 6 5 19 6
Reverse column order
G F E D C B A
0 19 1 16 2 12 8 3
1 2 17 9 11 18 12 17
2 9 2 3 18 2 1 13
3 14 8 16 4 9 10 11
4 6 19 5 6 9 5 3
Reverse row order
A B C D E F G
4 3 5 9 6 5 19 6
3 11 10 9 4 16 8 14
2 13 1 2 18 3 2 9
1 17 12 18 11 9 17 2
0 3 8 12 2 16 1 19
Reverse row order and reset index
A B C D E F G
0 3 5 9 6 5 19 6
1 11 10 9 4 16 8 14
2 13 1 2 18 3 2 9
3 17 12 18 11 9 17 2
4 3 8 12 2 16 1 19

Trick 1: Add a prefix or suffix to all columns

Go back to the Table of Contents

df = generate_sample_data()[:5]
print("Original df")
df

print("Add prefix")
df.add_prefix("1_")

print("Add suffix")
df.add_suffix("_Z")
Original df
A B C D E F G
0 7 19 9 18 5 1 18
1 6 17 4 17 18 3 2
2 7 1 14 2 2 13 9
3 15 15 16 2 14 2 7
4 17 19 16 19 19 8 19
Add prefix
1_A 1_B 1_C 1_D 1_E 1_F 1_G
0 7 19 9 18 5 1 18
1 6 17 4 17 18 3 2
2 7 1 14 2 2 13 9
3 15 15 16 2 14 2 7
4 17 19 16 19 19 8 19
Add suffix
A_Z B_Z C_Z D_Z E_Z F_Z G_Z
0 7 19 9 18 5 1 18
1 6 17 4 17 18 3 2
2 7 1 14 2 2 13 9
3 15 15 16 2 14 2 7
4 17 19 16 19 19 8 19

The End

Thanks a lot. If you made till the end you have learned a lot of pandas