Pandas Cheat Sheet

Essential commands for Python Pandas: DataFrames, Series, data manipulation, filtering, grouping, merging, pivoting, and I/O operations. Quick reference for data analysis.

Creating DataFrames

import pandas as pd
import numpy as np

# From a dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})

# From a list of lists
df = pd.DataFrame(
    [['Alice', 25], ['Bob', 30]],
    columns=['name', 'age']
)

# From a list of dicts
df = pd.DataFrame([
    {'name': 'Alice', 'age': 25},
    {'name': 'Bob', 'age': 30}
])

# From a NumPy array
df = pd.DataFrame(
    np.random.randn(5, 3),
    columns=['a', 'b', 'c']
)

# From files
df = pd.read_csv('data.csv')
df = pd.read_json('data.json')
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df = pd.read_sql('SELECT * FROM users', connection)

# Create a Series
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='values')

Inspecting Data

CommandDescription
df.head(n)First n rows (default 5)
df.tail(n)Last n rows (default 5)
df.shapeTuple of (rows, columns): (100, 5)
df.info()Column names, dtypes, non-null counts, memory usage
df.describe()Summary statistics (count, mean, std, min, max, quartiles)
df.dtypesData type of each column
df.columnsIndex of column names
df.indexIndex (row labels)
df.valuesUnderlying NumPy array
df.nunique()Number of unique values per column
df['col'].value_counts()Frequency count of unique values
df.sample(n)Random sample of n rows
df.memory_usage(deep=True)Memory usage per column in bytes
len(df)Number of rows

Selecting Data

CommandDescription
df['col']Select single column (returns Series)
df[['col1', 'col2']]Select multiple columns (returns DataFrame)
df.loc[label]Select row by label
df.loc[label, 'col']Select specific value by row label and column name
df.loc['a':'c']Slice rows by label (inclusive on both ends)
df.loc[:, 'col1':'col3']Slice columns by label
df.iloc[0]Select row by integer position
df.iloc[0:3]Slice rows by position (exclusive end)
df.iloc[0, 2]Select value by row and column position
df.iloc[:, 1:4]Slice columns by position
df.at['row', 'col']Fast scalar access by label
df.iat[0, 2]Fast scalar access by position

Filtering

# Boolean condition
df[df['age'] > 30]

# Multiple conditions (use & for AND, | for OR, ~ for NOT)
df[(df['age'] > 25) & (df['city'] == 'NYC')]
df[(df['age'] < 20) | (df['age'] > 60)]
df[~df['city'].isin(['LA', 'Chicago'])]

# isin — match against a list
df[df['city'].isin(['NYC', 'LA', 'Chicago'])]

# between — inclusive range
df[df['age'].between(25, 35)]

# query — string expression (cleaner syntax)
df.query('age > 25 and city == "NYC"')
df.query('age in [25, 30, 35]')

# String filtering
df[df['name'].str.contains('ali', case=False)]
df[df['name'].str.startswith('A')]
df[df['email'].str.endswith('.com')]
df[df['name'].str.match(r'^[A-C]')]   # regex match

# Null filtering
df[df['col'].isna()]        # rows where col is NaN
df[df['col'].notna()]       # rows where col is not NaN

# nlargest / nsmallest (filter + sort in one step)
df.nlargest(10, 'salary')
df.nsmallest(5, 'age')

Modifying Data

# Add a new column
df['full_name'] = df['first'] + ' ' + df['last']
df['tax'] = df['salary'] * 0.2

# Conditional column (where / np.where)
df['senior'] = np.where(df['age'] >= 60, True, False)

# assign — returns a new DataFrame (chainable)
df = df.assign(bonus=df['salary'] * 0.1, active=True)

# Drop columns
df = df.drop(columns=['temp_col', 'unused'])

# Drop rows by index
df = df.drop(index=[0, 5, 10])

# Rename columns
df = df.rename(columns={'old_name': 'new_name', 'col2': 'column_two'})
df.columns = ['a', 'b', 'c']  # rename all at once

# Replace values
df['status'] = df['status'].replace({'Y': 'Yes', 'N': 'No'})

# Fill missing values
df['col'] = df['col'].fillna(0)
df['col'] = df['col'].fillna(df['col'].mean())
df = df.fillna(method='ffill')   # forward fill
df = df.fillna(method='bfill')   # backward fill

# Change data types
df['age'] = df['age'].astype(int)
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Apply a function
df['name'] = df['name'].apply(str.upper)
df['category'] = df['score'].apply(lambda x: 'High' if x > 80 else 'Low')

Sorting

CommandDescription
df.sort_values('col')Sort by column ascending
df.sort_values('col', ascending=False)Sort descending
df.sort_values(['a', 'b'], ascending=[True, False])Multi-column sort
df.sort_index()Sort by row index
df.sort_index(axis=1)Sort columns alphabetically
df.nlargest(10, 'col')Top 10 rows by column value
df.nsmallest(5, 'col')Bottom 5 rows by column value
df.rank()Rank values (1 = smallest by default)
df.rank(ascending=False, method='dense')Dense rank descending (no gaps)

Grouping & Aggregation

# Basic groupby
df.groupby('city')['salary'].mean()
df.groupby('dept')['salary'].agg(['mean', 'median', 'count'])

# Multiple groupby columns
df.groupby(['dept', 'role'])['salary'].sum()

# Named aggregations (clean output)
df.groupby('dept').agg(
    avg_salary=('salary', 'mean'),
    headcount=('name', 'count'),
    max_age=('age', 'max')
)

# Custom aggregation functions
df.groupby('dept')['salary'].agg(lambda x: x.max() - x.min())

# transform — returns same-shaped result (for broadcasting)
df['dept_avg'] = df.groupby('dept')['salary'].transform('mean')
df['pct_of_dept'] = df['salary'] / df.groupby('dept')['salary'].transform('sum')

# filter — keep groups matching a condition
df.groupby('dept').filter(lambda g: g['salary'].mean() > 50000)

# Pivot table
pd.pivot_table(df,
    values='salary',
    index='dept',
    columns='role',
    aggfunc='mean',
    fill_value=0
)

# Crosstab
pd.crosstab(df['dept'], df['role'], margins=True)

Merging & Joining

# merge — SQL-style joins on columns
pd.merge(df1, df2, on='id')                          # inner join
pd.merge(df1, df2, on='id', how='left')               # left join
pd.merge(df1, df2, on='id', how='outer')              # full outer join
pd.merge(df1, df2, left_on='emp_id', right_on='id')  # different column names

# join — merge on index
df1.join(df2, how='left')
df1.join(df2, on='key_col')  # join df2 index to df1 column

# concat — stack DataFrames
pd.concat([df1, df2])                        # vertical stack (axis=0)
pd.concat([df1, df2], ignore_index=True)     # reset index after concat
pd.concat([df1, df2], axis=1)                # horizontal stack (side by side)

# Merge indicator — see where rows came from
pd.merge(df1, df2, on='id', how='outer', indicator=True)
# _merge column: 'left_only', 'right_only', 'both'

String Operations

CommandDescription
df['col'].str.lower()Convert to lowercase
df['col'].str.upper()Convert to uppercase
df['col'].str.strip()Remove leading/trailing whitespace
df['col'].str.replace('a', 'b')Replace substring
df['col'].str.contains('pattern')Boolean: contains substring/regex
df['col'].str.startswith('A')Boolean: starts with prefix
df['col'].str.split(',', expand=True)Split into separate columns
df['col'].str.extract(r'(\d+)')Extract regex capture group
df['col'].str.len()Length of each string
df['col'].str.slice(0, 5)Substring by position
df['col'].str.cat(sep=', ')Concatenate all values into one string
df['col'].str.pad(10, side='left')Pad strings to a width
df['col'].str.get_dummies(sep=',')One-hot encode delimited strings

DateTime Operations

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Extract components via dt accessor
df['year']    = df['date'].dt.year
df['month']   = df['date'].dt.month
df['day']     = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()     # 'Monday', 'Tuesday', ...
df['hour']    = df['date'].dt.hour
df['quarter'] = df['date'].dt.quarter

# Date arithmetic
df['days_ago'] = pd.Timestamp.now() - df['date']
df['next_week'] = df['date'] + pd.Timedelta(days=7)

# Filter by date range
df[df['date'].between('2024-01-01', '2024-12-31')]

# Resample time series (requires DatetimeIndex)
df.set_index('date').resample('M').sum()     # monthly totals
df.set_index('date').resample('W').mean()    # weekly averages

# Rolling window
df['7d_avg'] = df['value'].rolling(window=7).mean()

# Date range generation
dates = pd.date_range('2024-01-01', periods=12, freq='MS')  # month starts

I/O Operations

CommandDescription
pd.read_csv('file.csv')Read CSV file
pd.read_csv('f.csv', usecols=['a','b'])Read only specific columns
pd.read_csv('f.csv', nrows=1000)Read first 1000 rows
pd.read_csv('f.csv', dtype={'id': str})Specify column dtypes
pd.read_csv('f.csv', parse_dates=['date'])Parse date columns
df.to_csv('out.csv', index=False)Write CSV (no index column)
pd.read_json('data.json')Read JSON file
df.to_json('out.json', orient='records')Write JSON as list of records
pd.read_excel('data.xlsx')Read Excel file (needs openpyxl)
df.to_excel('out.xlsx', index=False)Write Excel file
pd.read_parquet('data.parquet')Read Parquet (fast columnar format)
df.to_parquet('out.parquet')Write Parquet file
pd.read_clipboard()Read data from system clipboard
pd.read_sql(query, connection)Read from SQL database

Common Patterns

# Duplicates
df.duplicated()                        # Boolean series
df.drop_duplicates()                   # Remove all duplicate rows
df.drop_duplicates(subset=['name'])    # Deduplicate by specific columns
df.drop_duplicates(keep='last')        # Keep last occurrence

# Missing data
df.isna().sum()                 # Count NaN per column
df.dropna()                     # Drop rows with any NaN
df.dropna(subset=['age'])       # Drop rows where 'age' is NaN
df.dropna(thresh=3)             # Keep rows with at least 3 non-NaN values

# apply / map
df['col'].apply(lambda x: x ** 2)              # Apply to each element
df.apply(lambda row: row['a'] + row['b'], axis=1)  # Apply to each row
df['grade'] = df['score'].map({90: 'A', 80: 'B', 70: 'C'})

# Binning
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100],
                          labels=['Youth', 'Young Adult', 'Adult', 'Senior'])

# One-hot encoding
pd.get_dummies(df, columns=['city'], drop_first=True)

# Reset and set index
df = df.reset_index(drop=True)
df = df.set_index('id')

# Chaining operations
result = (
    df
    .query('age > 25')
    .assign(bonus=lambda x: x['salary'] * 0.1)
    .groupby('dept')
    .agg(total_bonus=('bonus', 'sum'))
    .sort_values('total_bonus', ascending=False)
)

# Pipe — pass DataFrame to a function
def clean_data(df):
    return df.dropna().drop_duplicates()

df = df.pipe(clean_data)

Frequently Asked Questions

What is the difference between a Pandas DataFrame and a Series?

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, etc.). A DataFrame is a two-dimensional labeled data structure with columns that can be of different types -- essentially a table or spreadsheet. A DataFrame is made up of multiple Series objects, one per column. You can extract a single column from a DataFrame as a Series using df['column_name'].

When should I use loc vs iloc in Pandas?

Use loc for label-based indexing -- it selects rows and columns by their names or boolean conditions (e.g., df.loc['row_label', 'col_name']). Use iloc for integer position-based indexing -- it selects by numerical index positions starting from 0 (e.g., df.iloc[0, 2]). A simple rule: loc uses labels, iloc uses integers.

How do I handle missing data (NaN) in Pandas?

Pandas provides several methods: df.isna() or df.isnull() detects missing values, df.dropna() removes rows or columns with missing values, and df.fillna(value) replaces NaN with a specified value. You can also use df.fillna(method='ffill') to forward-fill or df.interpolate() for numeric interpolation. Always check with df.isna().sum() first to understand the scope.

What is the difference between merge, join, and concat in Pandas?

pd.concat() stacks DataFrames vertically (axis=0) or horizontally (axis=1) -- use it to combine datasets with similar structures. df.merge() performs SQL-style joins on columns or indexes using keys (inner, left, right, outer joins). df.join() is a convenience method that merges on indexes by default. Use concat for simple stacking, merge for relational-style joins on specific columns, and join when combining on index values.