DataFrames from CSV - Python Notebook

Import libs¶

First let’s import both pandas and numpy.

import pandas as pd
import numpy as np

The read_csv method will read in a CSV file and returns to us a DataFrame

CSV read¶

states = pd.read_csv("data/states.csv")
states

King County Home Sales Data Set¶

This dataset includes the home sales from 2014-2015 in King County, WA (the county Seattle is located in)

id - house’s unique id
date - sale date
price - sale price
bedrooms - number of bedrooms
bathrooms - numbers of bathrooms
sqft_living - living space square footage
sqft_lot - total lot square footage
floors - numbers of floors
waterfront - is the house waterfront (1) or not (0)
view - rating from 0 to 4 of how good the view from the house is
condition - rating from 1 (poor) to 5 (very good) of the condition of the house
grade - rating from 1-13 representing the construction quality of improvements. 1-3 Falls short of minimum building standards (cabins, etc.) 7 is avg grade, 11-13 have high-quality design & construction
sqft_above - square footage of the interior that is above ground level
sqft_basement - square footage of the interior that is below ground level
yr_built - year the house was initially built
yr_renovated - The year of the house’s last renovation (if any)
zipcode - zipcode that the house is located in
lat - the property’s latitude
long - the property’s longitude
sqft_living15 - average interior space square footage of the nearest 15 neighbors
sqft_lot15 - average lot square footage of the nearest 15 neighbors

houses = pd.read_csv("data/kc_house_data.csv")

Printing data¶

houses.head()

houses

type(houses)

pandas.core.frame.DataFrame

API reference of dataframe

https://pandas.pydata.org/docs/reference/frame.html

houses.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

states.columns

Index(['State', 'Abbrev', 'Code'], dtype='object')

len(houses)

21613

houses.shape

(21613, 21)

houses.size

453873

# Defines how many hourses to be rendered while printing
pd.options.display.min_rows = 15

houses

The head() method returns the first n rows in a NEW DataFrame

first_5 = houses.head()
first_5

type(first_5)

pandas.core.frame.DataFrame

houses.head(200)

The tail() method is like head() but it works from the END of a DataFrame. By default it returns the last 5 rows in a new DataFrame.

houses.tail()

houses.tail(9)

houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long           21613 non-null  float64
 19  sqft_living15  21613 non-null  int64  
 20  sqft_lot15     21613 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB

states.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   State   51 non-null     object
 1   Abbrev  51 non-null     object
 2   Code    51 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB

houses.dtypes

id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

states.dtypes

State     object
Abbrev    object
Code      object
dtype: object

houses.describe()

states.describe()

Introducing The Titanic Dataset¶

titanic = pd.read_csv("data/titanic.csv")
titanic.head()

titanic.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survived - Survival (0 = No; 1 = Yes)
name - Name
sex - Sex
age - Age
sibsp - Number of Siblings/Spouses Aboard
parch - Number of Parents/Children Aboard
ticket - Ticket Number
fare - Passenger Fare
cabin - Cabin
embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat - Lifeboat (if survived)
body - Body number (if did not survive and body was recovered)
home.dest - Home/Destination

titanic.tail(10)

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB

titanic.head()

read_csv with non-comma separators¶

The netflix_titles.csv file is actually pipe (|) separated. We can tell read_csv to expect this using the sep="|" option.

Additionally, the netflix dataset includes an index column as the very first column. We can let read_csv know using index_col=0

# If we try without defining `index_col`, then,
pd.read_csv("data/netflix_titles.csv", sep="|")

netflix = pd.read_csv("data/netflix_titles.csv", sep="|", index_col=0)

netflix.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 894.5+ KB

netflix.head()

Providing Column Names¶

We can provide a list of custom column names to read_csv using the names parameter. Additionally, we need to specify header=0 to tell read_csv that the file still contains the original headers on the first line and that it should ignore them!

names = ['sumlev', 'region', 'division', 'state', 'name', 'census2010pop', 'estimatesbase2010', 'popestimate2010', 'popestimate2011', 'popestimate2012', 'popestimate2013', 'popestimate2014', 'popestimate2015', 'popestimate2016', 'popestimate2017', 'popestimate2018', 'popestimate2019', 'popestimate042020', 'popestimate2020']
state_pops = pd.read_csv("data/nst-est2020.csv", names=names, header=0)
state_pops