Useful Pandas functions
-
pd.read_csv('file.csv')for loading aDataFramefrom a CSV -
applydf.apply(f)appliesfto every row (aSeriesobject) ofdfdf["col"].apply(f)appliesfto every individual value insidedf["col"]
-
groupby:df.groupby("col")ordf.groupby(fn) -
Filtering: Pandas allows filtering with masks, like APL
df[<some boolean Series>]will select only rows ofdfwhere the corresponding item in<some boolean Series>is true- e.g.
df[df["Age"] > 30]to only select rows where people’s ages are greater than 30 - Use
&and|for multiple conditions
-
value_counts: BothDataFrameandSerieshavevalue_counts()methods- With a
Series, it returns anotherSeriescontaining counts of unique values (with the index being the original values) - With a
DataFrame, it returns aSeriescontaining the count of each unique row
- With a
-
astype(can be applied to bothDataFrameandSeries)- The data type given has to be a Numpy or Python type
- e.g.
df.astype('int32')to turn all columns intoint32 - e.g.
df.astype({'col1': 'int32'})to turn onlycol1intoint32 - e.g.
ser.astype('category')to turn into categorical type
-
unique()/drop_duplicates()to drop duplicate rows -
dropna()to drop any rows with anyNaNvalues -
fillna('foo')to replaceNaNvalues with'foo' -
len(df)to get the number of rows -
hist()can generate a histogram of aSeries