Pandas Cheat Sheet

Q: What is the difference between a Pandas DataFrame and a Series?

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, etc.). A DataFrame is a two-dimensional labeled data structure with columns that can be of different types — essentially a table or spreadsheet. A DataFrame is made up of multiple Series objects, one per column. You can extract a single column from a DataFrame as a Series using df['column_name'].

Q: When should I use loc vs iloc in Pandas?

Use loc for label-based indexing — it selects rows and columns by their names or boolean conditions (e.g., df.loc['row_label', 'col_name']). Use iloc for integer position-based indexing — it selects by numerical index positions starting from 0 (e.g., df.iloc[0, 2]). A simple rule: loc uses labels, iloc uses integers.

Q: How do I handle missing data (NaN) in Pandas?

Pandas provides several methods for handling missing data: df.isna() or df.isnull() detects missing values, df.dropna() removes rows or columns with missing values, and df.fillna(value) replaces NaN with a specified value. You can also use df.fillna(method='ffill') to forward-fill or df.interpolate() for numeric interpolation. Always check for missing data first with df.isna().sum() to understand the scope before choosing a strategy.

Q: What is the difference between merge, join, and concat in Pandas?

pd.concat() stacks DataFrames vertically (axis=0) or horizontally (axis=1) — use it to combine datasets with similar structures. df.merge() performs SQL-style joins on columns or indexes using keys (inner, left, right, outer joins). df.join() is a convenience method that merges on indexes by default. Use concat for simple stacking, merge for relational-style joins on specific columns, and join when combining on index values.

Essential commands for Python Pandas: DataFrames, Series, data manipulation, filtering, grouping, merging, pivoting, and I/O operations. Quick reference for data analysis.

Creating DataFrames

import pandas as pd
import numpy as np

# From a dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})

# From a list of lists
df = pd.DataFrame(
    [['Alice', 25], ['Bob', 30]],
    columns=['name', 'age']
)

# From a list of dicts
df = pd.DataFrame([
    {'name': 'Alice', 'age': 25},
    {'name': 'Bob', 'age': 30}
])

# From a NumPy array
df = pd.DataFrame(
    np.random.randn(5, 3),
    columns=['a', 'b', 'c']
)

# From files
df = pd.read_csv('data.csv')
df = pd.read_json('data.json')
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df = pd.read_sql('SELECT * FROM users', connection)

# Create a Series
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='values')

Inspecting Data

Command	Description
`df.head(n)`	First `n` rows (default 5)
`df.tail(n)`	Last `n` rows (default 5)
`df.shape`	Tuple of (rows, columns): `(100, 5)`
`df.info()`	Column names, dtypes, non-null counts, memory usage
`df.describe()`	Summary statistics (count, mean, std, min, max, quartiles)
`df.dtypes`	Data type of each column
`df.columns`	Index of column names
`df.index`	Index (row labels)
`df.values`	Underlying NumPy array
`df.nunique()`	Number of unique values per column
`df['col'].value_counts()`	Frequency count of unique values
`df.sample(n)`	Random sample of `n` rows
`df.memory_usage(deep=True)`	Memory usage per column in bytes
`len(df)`	Number of rows

Selecting Data

Command	Description
`df['col']`	Select single column (returns Series)
`df[['col1', 'col2']]`	Select multiple columns (returns DataFrame)
`df.loc[label]`	Select row by label
`df.loc[label, 'col']`	Select specific value by row label and column name
`df.loc['a':'c']`	Slice rows by label (inclusive on both ends)
`df.loc[:, 'col1':'col3']`	Slice columns by label
`df.iloc[0]`	Select row by integer position
`df.iloc[0:3]`	Slice rows by position (exclusive end)
`df.iloc[0, 2]`	Select value by row and column position
`df.iloc[:, 1:4]`	Slice columns by position
`df.at['row', 'col']`	Fast scalar access by label
`df.iat[0, 2]`	Fast scalar access by position

Filtering

# Boolean condition
df[df['age'] > 30]

# Multiple conditions (use & for AND, | for OR, ~ for NOT)
df[(df['age'] > 25) & (df['city'] == 'NYC')]
df[(df['age'] < 20) | (df['age'] > 60)]
df[~df['city'].isin(['LA', 'Chicago'])]

# isin — match against a list
df[df['city'].isin(['NYC', 'LA', 'Chicago'])]

# between — inclusive range
df[df['age'].between(25, 35)]

# query — string expression (cleaner syntax)
df.query('age > 25 and city == "NYC"')
df.query('age in [25, 30, 35]')

# String filtering
df[df['name'].str.contains('ali', case=False)]
df[df['name'].str.startswith('A')]
df[df['email'].str.endswith('.com')]
df[df['name'].str.match(r'^[A-C]')]   # regex match

# Null filtering
df[df['col'].isna()]        # rows where col is NaN
df[df['col'].notna()]       # rows where col is not NaN

# nlargest / nsmallest (filter + sort in one step)
df.nlargest(10, 'salary')
df.nsmallest(5, 'age')

Modifying Data

# Add a new column
df['full_name'] = df['first'] + ' ' + df['last']
df['tax'] = df['salary'] * 0.2

# Conditional column (where / np.where)
df['senior'] = np.where(df['age'] >= 60, True, False)

# assign — returns a new DataFrame (chainable)
df = df.assign(bonus=df['salary'] * 0.1, active=True)

# Drop columns
df = df.drop(columns=['temp_col', 'unused'])

# Drop rows by index
df = df.drop(index=[0, 5, 10])

# Rename columns
df = df.rename(columns={'old_name': 'new_name', 'col2': 'column_two'})
df.columns = ['a', 'b', 'c']  # rename all at once

# Replace values
df['status'] = df['status'].replace({'Y': 'Yes', 'N': 'No'})

# Fill missing values
df['col'] = df['col'].fillna(0)
df['col'] = df['col'].fillna(df['col'].mean())
df = df.fillna(method='ffill')   # forward fill
df = df.fillna(method='bfill')   # backward fill

# Change data types
df['age'] = df['age'].astype(int)
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Apply a function
df['name'] = df['name'].apply(str.upper)
df['category'] = df['score'].apply(lambda x: 'High' if x > 80 else 'Low')

Sorting

Command	Description
`df.sort_values('col')`	Sort by column ascending
`df.sort_values('col', ascending=False)`	Sort descending
`df.sort_values(['a', 'b'], ascending=[True, False])`	Multi-column sort
`df.sort_index()`	Sort by row index
`df.sort_index(axis=1)`	Sort columns alphabetically
`df.nlargest(10, 'col')`	Top 10 rows by column value
`df.nsmallest(5, 'col')`	Bottom 5 rows by column value
`df.rank()`	Rank values (1 = smallest by default)
`df.rank(ascending=False, method='dense')`	Dense rank descending (no gaps)

Grouping & Aggregation

# Basic groupby
df.groupby('city')['salary'].mean()
df.groupby('dept')['salary'].agg(['mean', 'median', 'count'])

# Multiple groupby columns
df.groupby(['dept', 'role'])['salary'].sum()

# Named aggregations (clean output)
df.groupby('dept').agg(
    avg_salary=('salary', 'mean'),
    headcount=('name', 'count'),
    max_age=('age', 'max')
)

# Custom aggregation functions
df.groupby('dept')['salary'].agg(lambda x: x.max() - x.min())

# transform — returns same-shaped result (for broadcasting)
df['dept_avg'] = df.groupby('dept')['salary'].transform('mean')
df['pct_of_dept'] = df['salary'] / df.groupby('dept')['salary'].transform('sum')

# filter — keep groups matching a condition
df.groupby('dept').filter(lambda g: g['salary'].mean() > 50000)

# Pivot table
pd.pivot_table(df,
    values='salary',
    index='dept',
    columns='role',
    aggfunc='mean',
    fill_value=0
)

# Crosstab
pd.crosstab(df['dept'], df['role'], margins=True)

Merging & Joining

# merge — SQL-style joins on columns
pd.merge(df1, df2, on='id')                          # inner join
pd.merge(df1, df2, on='id', how='left')               # left join
pd.merge(df1, df2, on='id', how='outer')              # full outer join
pd.merge(df1, df2, left_on='emp_id', right_on='id')  # different column names

# join — merge on index
df1.join(df2, how='left')
df1.join(df2, on='key_col')  # join df2 index to df1 column

# concat — stack DataFrames
pd.concat([df1, df2])                        # vertical stack (axis=0)
pd.concat([df1, df2], ignore_index=True)     # reset index after concat
pd.concat([df1, df2], axis=1)                # horizontal stack (side by side)

# Merge indicator — see where rows came from
pd.merge(df1, df2, on='id', how='outer', indicator=True)
# _merge column: 'left_only', 'right_only', 'both'

String Operations

Command	Description
`df['col'].str.lower()`	Convert to lowercase
`df['col'].str.upper()`	Convert to uppercase
`df['col'].str.strip()`	Remove leading/trailing whitespace
`df['col'].str.replace('a', 'b')`	Replace substring
`df['col'].str.contains('pattern')`	Boolean: contains substring/regex
`df['col'].str.startswith('A')`	Boolean: starts with prefix
`df['col'].str.split(',', expand=True)`	Split into separate columns
`df['col'].str.extract(r'(\d+)')`	Extract regex capture group
`df['col'].str.len()`	Length of each string
`df['col'].str.slice(0, 5)`	Substring by position
`df['col'].str.cat(sep=', ')`	Concatenate all values into one string
`df['col'].str.pad(10, side='left')`	Pad strings to a width
`df['col'].str.get_dummies(sep=',')`	One-hot encode delimited strings

DateTime Operations

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Extract components via dt accessor
df['year']    = df['date'].dt.year
df['month']   = df['date'].dt.month
df['day']     = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()     # 'Monday', 'Tuesday', ...
df['hour']    = df['date'].dt.hour
df['quarter'] = df['date'].dt.quarter

# Date arithmetic
df['days_ago'] = pd.Timestamp.now() - df['date']
df['next_week'] = df['date'] + pd.Timedelta(days=7)

# Filter by date range
df[df['date'].between('2024-01-01', '2024-12-31')]

# Resample time series (requires DatetimeIndex)
df.set_index('date').resample('M').sum()     # monthly totals
df.set_index('date').resample('W').mean()    # weekly averages

# Rolling window
df['7d_avg'] = df['value'].rolling(window=7).mean()

# Date range generation
dates = pd.date_range('2024-01-01', periods=12, freq='MS')  # month starts

I/O Operations

Command	Description
`pd.read_csv('file.csv')`	Read CSV file
`pd.read_csv('f.csv', usecols=['a','b'])`	Read only specific columns
`pd.read_csv('f.csv', nrows=1000)`	Read first 1000 rows
`pd.read_csv('f.csv', dtype={'id': str})`	Specify column dtypes
`pd.read_csv('f.csv', parse_dates=['date'])`	Parse date columns
`df.to_csv('out.csv', index=False)`	Write CSV (no index column)
`pd.read_json('data.json')`	Read JSON file
`df.to_json('out.json', orient='records')`	Write JSON as list of records
`pd.read_excel('data.xlsx')`	Read Excel file (needs openpyxl)
`df.to_excel('out.xlsx', index=False)`	Write Excel file
`pd.read_parquet('data.parquet')`	Read Parquet (fast columnar format)
`df.to_parquet('out.parquet')`	Write Parquet file
`pd.read_clipboard()`	Read data from system clipboard
`pd.read_sql(query, connection)`	Read from SQL database

Common Patterns

# Duplicates
df.duplicated()                        # Boolean series
df.drop_duplicates()                   # Remove all duplicate rows
df.drop_duplicates(subset=['name'])    # Deduplicate by specific columns
df.drop_duplicates(keep='last')        # Keep last occurrence

# Missing data
df.isna().sum()                 # Count NaN per column
df.dropna()                     # Drop rows with any NaN
df.dropna(subset=['age'])       # Drop rows where 'age' is NaN
df.dropna(thresh=3)             # Keep rows with at least 3 non-NaN values

# apply / map
df['col'].apply(lambda x: x ** 2)              # Apply to each element
df.apply(lambda row: row['a'] + row['b'], axis=1)  # Apply to each row
df['grade'] = df['score'].map({90: 'A', 80: 'B', 70: 'C'})

# Binning
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100],
                          labels=['Youth', 'Young Adult', 'Adult', 'Senior'])

# One-hot encoding
pd.get_dummies(df, columns=['city'], drop_first=True)

# Reset and set index
df = df.reset_index(drop=True)
df = df.set_index('id')

# Chaining operations
result = (
    df
    .query('age > 25')
    .assign(bonus=lambda x: x['salary'] * 0.1)
    .groupby('dept')
    .agg(total_bonus=('bonus', 'sum'))
    .sort_values('total_bonus', ascending=False)
)

# Pipe — pass DataFrame to a function
def clean_data(df):
    return df.dropna().drop_duplicates()

df = df.pipe(clean_data)

Frequently Asked Questions

What is the difference between a Pandas DataFrame and a Series?

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, etc.). A DataFrame is a two-dimensional labeled data structure with columns that can be of different types -- essentially a table or spreadsheet. A DataFrame is made up of multiple Series objects, one per column. You can extract a single column from a DataFrame as a Series using df['column_name'].

When should I use loc vs iloc in Pandas?

Use loc for label-based indexing -- it selects rows and columns by their names or boolean conditions (e.g., df.loc['row_label', 'col_name']). Use iloc for integer position-based indexing -- it selects by numerical index positions starting from 0 (e.g., df.iloc[0, 2]). A simple rule: loc uses labels, iloc uses integers.

How do I handle missing data (NaN) in Pandas?

Pandas provides several methods: df.isna() or df.isnull() detects missing values, df.dropna() removes rows or columns with missing values, and df.fillna(value) replaces NaN with a specified value. You can also use df.fillna(method='ffill') to forward-fill or df.interpolate() for numeric interpolation. Always check with df.isna().sum() first to understand the scope.

What is the difference between merge, join, and concat in Pandas?

pd.concat() stacks DataFrames vertically (axis=0) or horizontally (axis=1) -- use it to combine datasets with similar structures. df.merge() performs SQL-style joins on columns or indexes using keys (inner, left, right, outer joins). df.join() is a convenience method that merges on indexes by default. Use concat for simple stacking, merge for relational-style joins on specific columns, and join when combining on index values.