Import libs¶
First let’s import both pandas and numpy.
import pandas as pd
import numpy as npThe read_csv method will read in a CSV file and returns to us a DataFrame
CSV read¶
states = pd.read_csv("data/states.csv")
statesKing County Home Sales Data Set¶
This dataset includes the home sales from 2014-2015 in King County, WA (the county Seattle is located in)
id- house’s unique iddate- sale dateprice- sale pricebedrooms- number of bedroomsbathrooms- numbers of bathroomssqft_living- living space square footagesqft_lot- total lot square footagefloors- numbers of floorswaterfront- is the house waterfront (1) or not (0)view- rating from 0 to 4 of how good the view from the house iscondition- rating from 1 (poor) to 5 (very good) of the condition of the housegrade- rating from 1-13 representing the construction quality of improvements. 1-3 Falls short of minimum building standards (cabins, etc.) 7 is avg grade, 11-13 have high-quality design & constructionsqft_above- square footage of the interior that is above ground levelsqft_basement- square footage of the interior that is below ground levelyr_built- year the house was initially builtyr_renovated- The year of the house’s last renovation (if any)zipcode- zipcode that the house is located inlat- the property’s latitudelong- the property’s longitudesqft_living15- average interior space square footage of the nearest 15 neighborssqft_lot15- average lot square footage of the nearest 15 neighbors
houses = pd.read_csv("data/kc_house_data.csv")Printing data¶
houses.head()housestype(houses)pandas.core.frame.DataFrameAPI reference of dataframe
houses.columnsIndex(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
'lat', 'long', 'sqft_living15', 'sqft_lot15'],
dtype='object')states.columnsIndex(['State', 'Abbrev', 'Code'], dtype='object')len(houses)21613houses.shape(21613, 21)houses.size453873# Defines how many hourses to be rendered while printing
pd.options.display.min_rows = 15housesThe head() method returns the first n rows in a NEW DataFrame
first_5 = houses.head()
first_5type(first_5)pandas.core.frame.DataFramehouses.head(200)The tail() method is like head() but it works from the END of a DataFrame. By default it returns the last 5 rows in a new DataFrame.
houses.tail()houses.tail(9)houses.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 21613 non-null int64
1 date 21613 non-null object
2 price 21613 non-null float64
3 bedrooms 21613 non-null int64
4 bathrooms 21613 non-null float64
5 sqft_living 21613 non-null int64
6 sqft_lot 21613 non-null int64
7 floors 21613 non-null float64
8 waterfront 21613 non-null int64
9 view 21613 non-null int64
10 condition 21613 non-null int64
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
states.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 State 51 non-null object
1 Abbrev 51 non-null object
2 Code 51 non-null object
dtypes: object(3)
memory usage: 1.3+ KB
houses.dtypesid int64
date object
price float64
bedrooms int64
bathrooms float64
sqft_living int64
sqft_lot int64
floors float64
waterfront int64
view int64
condition int64
grade int64
sqft_above int64
sqft_basement int64
yr_built int64
yr_renovated int64
zipcode int64
lat float64
long float64
sqft_living15 int64
sqft_lot15 int64
dtype: objectstates.dtypesState object
Abbrev object
Code object
dtype: objecthouses.describe()states.describe()Introducing The Titanic Dataset¶
titanic = pd.read_csv("data/titanic.csv")
titanic.head()titanic.columnsIndex(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
dtype='object')pclass- Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)survived- Survival (0 = No; 1 = Yes)name- Namesex- Sexage- Agesibsp- Number of Siblings/Spouses Aboardparch- Number of Parents/Children Aboardticket- Ticket Numberfare- Passenger Farecabin- Cabinembarked- Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)boat- Lifeboat (if survived)body- Body number (if did not survive and body was recovered)home.dest- Home/Destination
titanic.tail(10)titanic.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 survived 1309 non-null int64
2 name 1309 non-null object
3 sex 1309 non-null object
4 age 1309 non-null object
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 ticket 1309 non-null object
8 fare 1309 non-null object
9 cabin 1309 non-null object
10 embarked 1309 non-null object
11 boat 1309 non-null object
12 body 1309 non-null object
13 home.dest 1309 non-null object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB
titanic.head()read_csv with non-comma separators¶
The netflix_titles.csv file is actually pipe (|) separated. We can tell read_csv to expect this using the sep="|" option.
Additionally, the netflix dataset includes an index column as the very first column. We can let read_csv know using index_col=0
# If we try without defining `index_col`, then,
pd.read_csv("data/netflix_titles.csv", sep="|")netflix = pd.read_csv("data/netflix_titles.csv", sep="|", index_col=0)netflix.info()<class 'pandas.core.frame.DataFrame'>
Index: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 894.5+ KB
netflix.head()Providing Column Names¶
We can provide a list of custom column names to read_csv using the names parameter. Additionally, we need to specify header=0 to tell read_csv that the file still contains the original headers on the first line and that it should ignore them!
names = ['sumlev', 'region', 'division', 'state', 'name', 'census2010pop', 'estimatesbase2010', 'popestimate2010', 'popestimate2011', 'popestimate2012', 'popestimate2013', 'popestimate2014', 'popestimate2015', 'popestimate2016', 'popestimate2017', 'popestimate2018', 'popestimate2019', 'popestimate042020', 'popestimate2020']
state_pops = pd.read_csv("data/nst-est2020.csv", names=names, header=0)
state_pops