Pandas Basics 2
Pandas is an open source library that allows you to explore, analyze and manipulate data. One main use of pandas is to transform our data for easier use with machine learning algorithms.
import as pandas as pd
# use .describe() to give an overview of the data of the Boston Housing train data
boston_housing.describe()
ID CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
count 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000
mean 203.500000 3.827366 11.623153 11.316305 0.076355 0.557776 6.249941 69.139409 3.763943 9.770936 412.322660 18.529310 357.324286 12.984926 22.087192
Pandas makes it easy to check for the number of row entries there are in the dataset and also, the missing values in the data. Calling the .info()
method on the dataset will also reveal the dtypes of each column as well as the range of each column. Note that, if there are any missing values, the number of entries will be the same as the number of non-null for each column.
boston_housing.info()
RangeIndex: 406 entries, 0 to 405
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 406 non-null int64
1 CRIM 406 non-null float64
2 ZN 406 non-null float64
3 INDUS 406 non-null float64
4 CHAS 406 non-null int64
...
With pandas, calling statistical methods on the dataset becomes straightforward.
# Mean
boston_housing.mean()
There are several ways to select particular data from the dataset.
- head(): shows the top 5 rows, pass in an integer to display the desired rows
- tail(): shows the bottom 5 rows, pass in an integer to display the desired rows
- iloc: iloc retrives the exact index positions
- loc: it takes an integer as input and retrieves the data's index based on that integer
- crosstab: can be used to view columns together and compare them
# a sample series sports = pd.Series(['NBA', 'Football', 'Hockey', 'Volleyball'], index =[0,3,6,1]) sports 0 NBA 3 Football 6 Hockey 1 Volleyball dtype: object
Unlike loc, iloc retrives the exact index position# view index 1 sports.loc[1] 'Volleyball'
Usesports.iloc[3] 'Volleyball'
.crosstab
to compare columns and.groupby
to compare a column in context with other columns.
# group by age and compare with the mean of other columns
df.groupby(df['AGE']).mean()
ID CRIM ZN INDUS CHAS NOX RM DIS RAD TAX PTRATIO B LSTAT MEDV
AGE
2.9 378.0 0.1 0.0 6.9 0.0 0.4 6.8 5.7 3.0 233.0 17.9 385.4 4.8 26.6
6.0 275.0 0.1 0.0 12.8 0.0 0.4 6.3 4.3 5.0 398.0 18.7 394.9 6.8 24.1
6.2 68.0 0.2 0.0 10.8 0.0 0.4 6.2 5.3 4.0 305.0 19.2 377.2 7.5 23.4
Pandas also allows for quick plotting of the columns in the data.
import matplotlib.pyplot as plt
# tells jupyter notebook to show your plots
%matplotlib inline
df['AGE'].plot()
# Plotting a histogram
df['AGE'].hist()
This is the end of our basics tutorial. There is a lot more to pandas and I encourage you to take a look at the panda documentation here.