Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

DataFrames from CSV

Import libs

First let’s import both pandas and numpy.

import pandas as pd
import numpy as np

The read_csv method will read in a CSV file and returns to us a DataFrame

CSV read

states = pd.read_csv("data/states.csv")
states
Loading...

King County Home Sales Data Set

This dataset includes the home sales from 2014-2015 in King County, WA (the county Seattle is located in)

  • id - house’s unique id

  • date - sale date

  • price - sale price

  • bedrooms - number of bedrooms

  • bathrooms - numbers of bathrooms

  • sqft_living - living space square footage

  • sqft_lot - total lot square footage

  • floors - numbers of floors

  • waterfront - is the house waterfront (1) or not (0)

  • view - rating from 0 to 4 of how good the view from the house is

  • condition - rating from 1 (poor) to 5 (very good) of the condition of the house

  • grade - rating from 1-13 representing the construction quality of improvements. 1-3 Falls short of minimum building standards (cabins, etc.) 7 is avg grade, 11-13 have high-quality design & construction

  • sqft_above - square footage of the interior that is above ground level

  • sqft_basement - square footage of the interior that is below ground level

  • yr_built - year the house was initially built

  • yr_renovated - The year of the house’s last renovation (if any)

  • zipcode - zipcode that the house is located in

  • lat - the property’s latitude

  • long - the property’s longitude

  • sqft_living15 - average interior space square footage of the nearest 15 neighbors

  • sqft_lot15 - average lot square footage of the nearest 15 neighbors

houses = pd.read_csv("data/kc_house_data.csv")

Printing data

houses.head()
Loading...
houses
Loading...
type(houses)
pandas.core.frame.DataFrame
houses.columns
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15'], dtype='object')
states.columns
Index(['State', 'Abbrev', 'Code'], dtype='object')
len(houses)
21613
houses.shape
(21613, 21)
houses.size
453873
# Defines how many hourses to be rendered while printing
pd.options.display.min_rows = 15
houses
Loading...

The head() method returns the first n rows in a NEW DataFrame

first_5 = houses.head()
first_5
Loading...
type(first_5)
pandas.core.frame.DataFrame
houses.head(200)
Loading...

The tail() method is like head() but it works from the END of a DataFrame. By default it returns the last 5 rows in a new DataFrame.

houses.tail()
Loading...
houses.tail(9)
Loading...
houses.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long           21613 non-null  float64
 19  sqft_living15  21613 non-null  int64  
 20  sqft_lot15     21613 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
states.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   State   51 non-null     object
 1   Abbrev  51 non-null     object
 2   Code    51 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB
houses.dtypes
id int64 date object price float64 bedrooms int64 bathrooms float64 sqft_living int64 sqft_lot int64 floors float64 waterfront int64 view int64 condition int64 grade int64 sqft_above int64 sqft_basement int64 yr_built int64 yr_renovated int64 zipcode int64 lat float64 long float64 sqft_living15 int64 sqft_lot15 int64 dtype: object
states.dtypes
State object Abbrev object Code object dtype: object
houses.describe()
Loading...
states.describe()
Loading...

Introducing The Titanic Dataset

titanic = pd.read_csv("data/titanic.csv")
titanic.head()
Loading...
titanic.columns
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'], dtype='object')
  • pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

  • survived - Survival (0 = No; 1 = Yes)

  • name - Name

  • sex - Sex

  • age - Age

  • sibsp - Number of Siblings/Spouses Aboard

  • parch - Number of Parents/Children Aboard

  • ticket - Ticket Number

  • fare - Passenger Fare

  • cabin - Cabin

  • embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

  • boat - Lifeboat (if survived)

  • body - Body number (if did not survive and body was recovered)

  • home.dest - Home/Destination

titanic.tail(10)
Loading...
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB
titanic.head()
Loading...

read_csv with non-comma separators

The netflix_titles.csv file is actually pipe (|) separated. We can tell read_csv to expect this using the sep="|" option.

Additionally, the netflix dataset includes an index column as the very first column. We can let read_csv know using index_col=0

# If we try without defining `index_col`, then,
pd.read_csv("data/netflix_titles.csv", sep="|")
Loading...
netflix = pd.read_csv("data/netflix_titles.csv", sep="|", index_col=0)
netflix.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 894.5+ KB
netflix.head()
Loading...

Providing Column Names

We can provide a list of custom column names to read_csv using the names parameter. Additionally, we need to specify header=0 to tell read_csv that the file still contains the original headers on the first line and that it should ignore them!

names = ['sumlev', 'region', 'division', 'state', 'name', 'census2010pop', 'estimatesbase2010', 'popestimate2010', 'popestimate2011', 'popestimate2012', 'popestimate2013', 'popestimate2014', 'popestimate2015', 'popestimate2016', 'popestimate2017', 'popestimate2018', 'popestimate2019', 'popestimate042020', 'popestimate2020']
state_pops = pd.read_csv("data/nst-est2020.csv", names=names, header=0)
state_pops
Loading...