Pandas Basics 2

·

3 min read

Pandas is an open source library that allows you to explore, analyze and manipulate data. One main use of pandas is to transform our data for easier use with machine learning algorithms.

import as pandas as pd

# use .describe() to give an overview of the data of the Boston Housing train data
boston_housing.describe()
ID    CRIM    ZN    INDUS    CHAS    NOX    RM    AGE    DIS    RAD    TAX    PTRATIO    B    LSTAT    MEDV
count    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000    406.000000
mean    203.500000    3.827366    11.623153    11.316305    0.076355    0.557776    6.249941    69.139409    3.763943    9.770936    412.322660    18.529310    357.324286    12.984926    22.087192

Pandas makes it easy to check for the number of row entries there are in the dataset and also, the missing values in the data. Calling the .info() method on the dataset will also reveal the dtypes of each column as well as the range of each column. Note that, if there are any missing values, the number of entries will be the same as the number of non-null for each column.

boston_housing.info()
RangeIndex: 406 entries, 0 to 405
Data columns (total 15 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ID       406 non-null    int64  
 1   CRIM     406 non-null    float64
 2   ZN       406 non-null    float64
 3   INDUS    406 non-null    float64
 4   CHAS     406 non-null    int64 
...

With pandas, calling statistical methods on the dataset becomes straightforward.

# Mean
boston_housing.mean()

There are several ways to select particular data from the dataset.

  • head(): shows the top 5 rows, pass in an integer to display the desired rows
  • tail(): shows the bottom 5 rows, pass in an integer to display the desired rows
  • iloc: iloc retrives the exact index positions
  • loc: it takes an integer as input and retrieves the data's index based on that integer
  • crosstab: can be used to view columns together and compare them
    # a sample series
    sports = pd.Series(['NBA', 'Football', 'Hockey', 'Volleyball'], index =[0,3,6,1])
    sports
    0           NBA
    3      Football
    6        Hockey
    1    Volleyball
    dtype: object
    
    # view index 1
    sports.loc[1]
    'Volleyball'
    
    Unlike loc, iloc retrives the exact index position
    sports.iloc[3]
    'Volleyball'
    
    Use .crosstab to compare columns and .groupby to compare a column in context with other columns.
# group by age and compare with the mean of other columns
df.groupby(df['AGE']).mean()

ID    CRIM    ZN    INDUS    CHAS    NOX    RM    DIS    RAD    TAX    PTRATIO    B    LSTAT    MEDV
AGE                                                        
2.9    378.0    0.1    0.0    6.9    0.0    0.4    6.8    5.7    3.0    233.0    17.9    385.4    4.8    26.6
6.0    275.0    0.1    0.0    12.8    0.0    0.4    6.3    4.3    5.0    398.0    18.7    394.9    6.8    24.1
6.2    68.0    0.2    0.0    10.8    0.0    0.4    6.2    5.3    4.0    305.0    19.2    377.2    7.5    23.4

Pandas also allows for quick plotting of the columns in the data.

import matplotlib.pyplot as plt
# tells jupyter notebook to show your plots
%matplotlib inline
df['AGE'].plot()

plot.png

# Plotting a histogram
df['AGE'].hist()

histo.png

This is the end of our basics tutorial. There is a lot more to pandas and I encourage you to take a look at the panda documentation here.