Useful Pandas functions
-
pd.read_csv('file.csv')
for loading aDataFrame
from a CSV -
apply
df.apply(f)
appliesf
to every row (aSeries
object) ofdf
df["col"].apply(f)
appliesf
to every individual value insidedf["col"]
-
groupby
:df.groupby("col")
ordf.groupby(fn)
-
Filtering: Pandas allows filtering with masks, like APL
df[<some boolean Series>]
will select only rows ofdf
where the corresponding item in<some boolean Series>
is true- e.g.
df[df["Age"] > 30]
to only select rows where people’s ages are greater than 30 - Use
&
and|
for multiple conditions
-
value_counts
: BothDataFrame
andSeries
havevalue_counts()
methods- With a
Series
, it returns anotherSeries
containing counts of unique values (with the index being the original values) - With a
DataFrame
, it returns aSeries
containing the count of each unique row
- With a
-
astype
(can be applied to bothDataFrame
andSeries
)- The data type given has to be a Numpy or Python type
- e.g.
df.astype('int32')
to turn all columns intoint32
- e.g.
df.astype({'col1': 'int32'})
to turn onlycol1
intoint32
- e.g.
ser.astype('category')
to turn into categorical type
-
unique()
/drop_duplicates()
to drop duplicate rows -
dropna()
to drop any rows with anyNaN
values -
fillna('foo')
to replaceNaN
values with'foo'
-
len(df)
to get the number of rows -
hist()
can generate a histogram of aSeries