Introduction to Statistical Learning Using Python¶

Table of Contents¶

LAB1: Introduction to Python
- Basic Elements of Python
- Working with Lists
- IF Statement
- FOR and While Loops
- Functions

LAB2: Dealing With Data
- Importing and Manipulating Data
- Dealing with Missing Values
- Summary Statistics
- Understanding Data
- Data Interpolation
- Visualizing Data

LAB3: Real World Data
- Logistic Regression
- Linear and Quadratic Discriminant Analysis
- Classification Trees, Random Forest

Getting Started with Pandas¶

Panda is a library, which contains high-level data structure that designed to make data analysis fast and easy in Python.¶

Main two type structures:
- Series
- DataFrame

Series¶

Series is a one dimensional array-like object containing an array of data and an associated array of data labels, called index

from pandas import Series, DataFrame
import pandas as pd
import os

obj = Series ([4,7,-5,3])

print(obj)

0    4
1    7
2   -5
3    3
dtype: int64

obj.values

array([ 4,  7, -5,  3], dtype=int64)

obj.index

RangeIndex(start=0, stop=4, step=1)

Selecting Value by index¶

obj[1]

7

obj[[1,3]]

1    7
3    3
dtype: int64

Operations with Series¶

obj[obj>0]

0    4
1    7
3    3
dtype: int64

obj[obj>3]

0    4
1    7
dtype: int64

obj*2

0     8
1    14
2   -10
3     6
dtype: int64

4 in obj.values

True

DataFrame¶

A DataFrame represents spreadsheet -like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, etc.)

data= {
       'Student': ["Muhammad Abdullah","Kayli Abernathy","Jacqueline Alvarez",
                    "Daniel Chavez","Aramayis Dallakyan"],
       'UIN' : [925024924,925024924,925024924,915140123,925124214],
       'Graduation_Year' : [2017,2016,2018,2020,2018]
}
frame = DataFrame(data)

frame

Seperating First and Last Name¶

frame.loc[frame['Student'].index, 'Last_name'] = frame['Student'].str.split().str[-1]
frame.loc[frame['Student'].index, 'First_name'] = frame['Student'].str.split().str[0]

frame

frame.drop(["Student"],axis=1,inplace= True)

frame["First_name"]

0      Muhammad
1         Kayli
2    Jacqueline
3        Daniel
4      Aramayis
Name: First_name, dtype: object

frame[frame["First_name"]=="Aramayis"]

Inserting Data¶

Define Directory

os.chdir("C:\Users\dallakyan1988\Documents\Python_Presentation")

Import Data using read_csv() function

read_csv(filepath, sep=', ', header='infer', names=None, index_col=None, usecols=None,skiprows=None, nrows=None, na_values=None)

data1 = pd.read_csv("Orange_data.csv",sep = ",",na_values=' ')
data2 = pd.read_csv("Income.csv",sep = ",")

data2.set_index(pd.DatetimeIndex(data2['Date']),inplace = True)

data1.head()

data2.head()

del data2.index.name

It is always good idea to check data types of variables

data1.dtypes

Month                int64
Year                 int64
Per_Capita_Cons    float64
Real_Price         float64
dtype: object

data2.dtypes

Quarter                int64
Year                   int64
Per_Capita_Income    float64
Date                  object
dtype: object

Dealing with missing values

Function to count missing values

def missing_count(data):
    missing = pd.DataFrame(data.isnull().sum(),columns=['Total_missing'])
    return missing

missing_count(data1)

missing_count(data2)

data1[data1["Per_Capita_Cons"].isnull()==True]

Possible Solutions
- Drop all NAN values (Not always the best)

cleaned = data1.dropna()

missing_count(cleaned)

NOTE:You have several options for dropping. You can drop if¶

* all raw is **`NAN`**
* all column is **`NAN`**
* or by threshhold

from numpy import nan as NA
example = DataFrame([[2.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,4.,4.]])

example

example.dropna()

example.dropna(how="all")

example.dropna(thresh=2)

Method 2

Fill NAN values with mean

last_item_filled=data1.fillna(method='ffill')

mean_filled=data1.fillna(data1['Per_Capita_Cons'].mean())

data1[data1["Per_Capita_Cons"].isnull()==True]

last_item_filled.ix[data1[data1["Per_Capita_Cons"].isnull()==True].index,:]

mean_filled.ix[data1[data1["Per_Capita_Cons"].isnull()==True].index,:]

Method 3

Interpolation (Not usable for all variables) Interpolation is method to create a function based on fixed data points which can be evaluated anywhere within the domain defined by the given data using linear, Cubic spline interpolation.

import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate
from scipy.interpolate import Rbf,InterpolatedUnivariateSpline

x = np.arange(0, 2*np.pi+np.pi/4, 2*np.pi/8)
y = 0.8*np.power(x,5)- 3*np.power(x,2)
tck = interpolate.splrep(x, y, s=0)
xnew = np.arange(0, 2*np.pi, np.pi/50)
ynew = interpolate.splev(xnew, tck, der=0)

plt.figure()
plt.plot(x, y, 'x', xnew, ynew, xnew, 0.8*np.power(xnew,5)- 3* np.power(xnew,2), x, y, 'b')
plt.legend(['Linear', 'Cubic Spline', 'True','RBF'])
plt.axis([-0.05, 3.05, -4.05, 10])
plt.title('Cubic-spline interpolation')
plt.show()

Linear and Cubic Spline for Orange data¶

linear_interpolated= data1['Per_Capita_Cons'].interpolate(method='linear')
spline_interpolated= data1['Per_Capita_Cons'].interpolate(method='spline',order=3)

plt.figure
plt.plot(xi, data1["Per_Capita_Cons"],xi, mean_filled["Per_Capita_Cons"],'g',xi,linear_interpolated,'b',xi,spline_interpolated,'r')
plt.legend(["Real",'mean_filled','Linear',"Spline"])
plt.show()

data1['Per_Capita_Cons'] = spline_interpolated

Notice that Income data and Consumption data have different time frequences.¶

We want to match the frequences of Income and Consumption

data2.head()

data2.tail()

To start interpolation from Quartely frequency to Monthly frequency, first we need to proper index

from pandas import datetime
period = 7*12 + 9-2
index = pd.date_range('3/1/1995', periods=period, freq='m')

upsampled = data2["Per_Capita_Income"].resample('M')
interpolated = upsampled.interpolate(method='spline', order=3)
print(interpolated.head(10))
interpolated.plot()
plt.show()

1995-03-31    13409.870000
1995-04-30    13301.960115
1995-05-31    13248.374849
1995-06-30    13239.510000
1995-07-31    13261.310071
1995-08-31    13300.635785
1995-09-30    13342.370000
1995-10-31    13377.273067
1995-11-30    13399.240400
1995-12-31    13406.830000
Freq: M, Name: Per_Capita_Income, dtype: float64

interpolated = upsampled.interpolate(method='linear', order=3)
print(interpolated.head(10))
interpolated.plot()
plt.show()

1995-03-31    13409.870000
1995-04-30    13353.083333
1995-05-31    13296.296667
1995-06-30    13239.510000
1995-07-31    13273.796667
1995-08-31    13308.083333
1995-09-30    13342.370000
1995-10-31    13363.856667
1995-11-30    13385.343333
1995-12-31    13406.830000
Freq: M, Name: Per_Capita_Income, dtype: float64

interpolated=pd.DataFrame(interpolated)
interpolated.head()

Now we want to merge Interpolated income data with Orange consumption data¶

data1.head()

period = 7*12 + 9
data1index = pd.date_range('1/1/1995', periods=period, freq='m')

data1.set_index(data1index,inplace = True)

data1.head()

Merge by index¶

Merged =DataFrame(pd.merge(data1,interpolated,left_index=True,right_index=True))

Merged.head()

Merged=Merged.drop(["Year","Month "],axis=1)

Merged.head()

Merged.columns

Index([u'Month ', u'Year', u'Per_Capita_Cons', u'Real_Price',
       u'Per_Capita_Income'],
      dtype='object')

Other examples of Merging¶

df1 = pd.DataFrame({'A': np.random.uniform(size=4),
                     'B': np.random.uniform(size=4),
                     'C': np.random.uniform(size=4),
                     'D': np.random.uniform(size=4)},
                     index=[0, 1, 2, 3])


df2 = pd.DataFrame({'A': np.random.normal(size=4),
                        'B': np.random.normal(size=4),
                        'C': np.random.normal(size=4),
                        'D': np.random.normal(size=4)},
                         index=[4, 5, 6, 7])


df3 = pd.DataFrame({'A':np.random.uniform(size=4),
                     'B': np.random.uniform(size=4),
                     'C': np.random.uniform(size=4),
                     'D': np.random.uniform(size=4)},
                     index=[8, 9, 10, 11])

df1

df2

df3

frames = [df1, df2, df3]

result = pd.concat(frames)

result

df4 = pd.DataFrame({'B': np.random.uniform(size=4),
                 'D':np.random.normal(size=4),
                 'F': np.random.normal(size=4)},
                  index=[2, 3, 6, 7])

df4

df1

result = pd.concat([df1, df4], axis=1)

result

result = df1.append(df2)

result

Merging based on key

left = pd.DataFrame({'key': ['A', 'G', 'E', 'C'],
                    'A': np.random.normal(size=4),
                     'B': np.random.normal(size=4)})

right = pd.DataFrame({'key': ['A', 'G', 'E', 'C'],
                      'C': np.random.normal(size=4),
                      'D': np.random.normal(size=4)})

result = pd.merge(left, right, on='key')

left

right

result

Descriptive Statistics¶

Merged.describe()

Correlation and Covariance Matrix¶

returns= Merged.pct_change()

returns

plt.figure()
plt.plot(returns["Real_Price"].cumsum(),"b")
plt.xlabel("Time")
plt.ylabel("Percentage change")
plt.title("Cumulatice Percentage Change")
plt.show()

Merged.corr()

Merged.cov()

Merged["Per_Capita_Cons"].hist(bins=20)
plt.show()

Merged["Per_Capita_Cons"].hist(bins=20,normed= True)
Merged["Per_Capita_Cons"].plot(kind="kde")
plt.show()

comp1=np.random.normal(Merged["Per_Capita_Cons"].mean(),Merged["Per_Capita_Cons"].std(),size=200)
values = Series(comp1)

Merged["Per_Capita_Cons"].hist(bins=20,normed= True)
values.plot(kind="kde")
plt.show()

pd.scatter_matrix(Merged, diagonal="kde",alpha=0.5)
plt.show()

	Month	Year	Per_Capita_Cons	Real_Price
0	1	1995	0.280531	2.512980
1	2	1995	0.245859	2.541516
2	3	1995	0.270097	2.580534
3	4	1995	0.246311	2.647361
4	5	1995	0.248911	2.681170

	Month	Year	Per_Capita_Cons	Real_Price
6	7	1995	NaN	2.673106
21	10	1996	NaN	2.969037
38	3	1998	NaN	2.767408
49	2	1999	NaN	3.179427
63	4	2000	NaN	3.152969
68	9	2000	NaN	3.134388
76	5	2001	NaN	3.164449
84	1	2002	NaN	3.141355
88	5	2002	NaN	3.158666

	0	1	2
0	2.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	4.0	4.0

	0	1	2
0	2.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	4.0	4.0

	Month	Year	Per_Capita_Cons	Real_Price
6	7	1995	NaN	2.673106
21	10	1996	NaN	2.969037
38	3	1998	NaN	2.767408
49	2	1999	NaN	3.179427
63	4	2000	NaN	3.152969
68	9	2000	NaN	3.134388
76	5	2001	NaN	3.164449
84	1	2002	NaN	3.141355
88	5	2002	NaN	3.158666

	Graduation_Year	Student	UIN
0	2017	Muhammad Abdullah	925024924
1	2016	Kayli Abernathy	925024924
2	2018	Jacqueline Alvarez	925024924
3	2020	Daniel Chavez	915140123
4	2018	Aramayis Dallakyan	925124214

	Quarter	Year	Per_Capita_Income	Date
1995-03-31	1	1995	13409.87	3/31/1995
1995-06-30	2	1995	13239.51	6/30/1995
1995-09-30	3	1995	13342.37	9/30/1995
1995-12-31	4	1995	13406.83	12/31/1995
1996-03-31	1	1996	13344.17	3/31/1996

	Month	Year	Per_Capita_Cons	Real_Price
6	7	1995	0.236881	2.673106
21	10	1996	0.241302	2.969037
38	3	1998	0.251118	2.767408
49	2	1999	0.287569	3.179427
63	4	2000	0.263581	3.152969
68	9	2000	0.249545	3.134388
76	5	2001	0.253853	3.164449
84	1	2002	0.259314	3.141355
88	5	2002	0.246272	3.158666

	Month	Year	Per_Capita_Cons	Real_Price
6	7	1995	0.251914	2.673106
21	10	1996	0.251914	2.969037
38	3	1998	0.251914	2.767408
49	2	1999	0.251914	3.179427
63	4	2000	0.251914	3.152969
68	9	2000	0.251914	3.134388
76	5	2001	0.251914	3.164449
84	1	2002	0.251914	3.141355
88	5	2002	0.251914	3.158666

	Quarter	Year	Per_Capita_Income	Date
2001-09-30	3	2001	14949.23	9/30/2001
2001-12-31	4	2001	14552.38	12/31/2001
2002-03-31	1	2002	14921.27	3/31/2002
2002-06-30	2	2002	15034.68	6/30/2002
2002-09-30	3	2002	15153.60	9/30/2002

	Per_Capita_Income
1995-03-31	13409.870000
1995-04-30	13353.083333
1995-05-31	13296.296667
1995-06-30	13239.510000
1995-07-31	13273.796667

	Month	Year	Per_Capita_Cons	Real_Price
1995-01-31	1	1995	0.280531	2.512980
1995-02-28	2	1995	0.245859	2.541516
1995-03-31	3	1995	0.270097	2.580534
1995-04-30	4	1995	0.246311	2.647361
1995-05-31	5	1995	0.248911	2.681170

	A	B	C	D
0	0.373849	0.392643	0.236637	0.473750
1	0.841594	0.470335	0.788696	0.507633
2	0.684732	0.023441	0.592588	0.088391
3	0.932812	0.201915	0.231552	0.035624

	A	B	C	D
4	0.145048	-0.161068	0.356442	2.032841
5	0.528435	-0.612351	-0.166635	-0.803647
6	1.519046	0.843732	-0.230930	1.226998
7	2.545946	0.605684	2.961073	0.321538

	A	B	C	D
8	0.304821	0.558125	0.959540	0.918874
9	0.711238	0.652642	0.436962	0.390953
10	0.310293	0.856860	0.352815	0.175165
11	0.656995	0.743975	0.045016	0.322504

	B	D	F
2	0.208523	0.729613	-0.850613
3	0.849497	-2.118052	0.457051
6	0.159290	0.675009	-1.333192
7	0.917792	-0.643039	0.201721

	A	B	key
0	0.312712	0.154416	A
1	-0.004686	0.757326	G
2	0.140112	-1.105813	E
3	0.585586	0.056549	C

	C	D	key
0	-0.006953	-0.206162	A
1	-1.506147	-0.227391	G
2	0.746571	-0.101790	E
3	1.295349	0.545622	C

	A	B	key	C	D
0	0.312712	0.154416	A	-0.006953	-0.206162
1	-0.004686	0.757326	G	-1.506147	-0.227391
2	0.140112	-1.105813	E	0.746571	-0.101790
3	0.585586	0.056549	C	1.295349	0.545622

	Per_Capita_Cons	Real_Price	Per_Capita_Income
count	91.000000	91.000000	91.000000
mean	0.251604	2.985879	14101.232747
std	0.014222	0.189700	577.904771
min	0.220848	2.580534	13239.510000
25%	0.241476	2.818134	13477.115000
50%	0.249624	3.066989	14195.660000
75%	0.258852	3.149477	14624.890000
max	0.292521	3.217425	15153.600000

	Per_Capita_Cons	Real_Price	Per_Capita_Income
1995-03-31	NaN	NaN	NaN
1995-04-30	-0.088065	0.025897	-0.004235
1995-05-31	0.010556	0.012771	-0.004253
1995-06-30	-0.048331	-0.006865	-0.004271
1995-07-31	0.066744	0.003884	0.002590
1995-08-31	-0.014426	-0.004534	0.002583
1995-09-30	-0.015499	-0.004294	0.002576
1995-10-31	0.047503	-0.006622	0.001610
1995-11-30	-0.018705	0.007299	0.001608
1995-12-31	0.060453	0.016259	0.001605
1996-01-31	0.035085	0.011480	-0.001558
1996-02-29	-0.117824	0.004200	-0.001560
1996-03-31	0.059415	0.008746	-0.001563
1996-04-30	-0.082940	0.027413	-0.001351
1996-05-31	0.002075	0.020915	-0.001353
1996-06-30	-0.070485	0.022661	-0.001355
1996-07-31	0.019683	0.017862	0.004707
1996-08-31	0.033136	-0.013083	0.004685
1996-09-30	0.037158	-0.006135	0.004663
1996-10-31	0.038412	0.004250	-0.000182
1996-11-30	0.010664	0.002774	-0.000182
1996-12-31	0.064247	-0.000399	-0.000182
1997-01-31	0.039709	-0.010075	0.000226
1997-02-28	-0.143779	0.008699	0.000226
1997-03-31	0.105674	-0.014108	0.000226
1997-04-30	-0.065007	-0.009998	0.002295
1997-05-31	-0.001197	-0.009287	0.002289
1997-06-30	-0.061609	-0.010089	0.002284
1997-07-31	0.036933	-0.019236	0.002238
1997-08-31	0.042711	-0.003209	0.002233
...	...	...	...
2000-04-30	-0.029687	0.007654	0.002001
2000-05-31	-0.031671	0.000706	0.001997
2000-06-30	-0.069189	0.005529	0.001993
2000-07-31	0.046668	-0.010958	0.002333
2000-08-31	0.034259	-0.000796	0.002328
2000-09-30	0.021191	-0.000315	0.002322
2000-10-31	0.016964	0.000844	-0.000680
2000-11-30	-0.009110	0.004206	-0.000680
2000-12-31	0.080434	-0.004056	-0.000681
2001-01-31	0.013404	-0.007513	-0.000936
2001-02-28	-0.112594	0.002642	-0.000937
2001-03-31	0.095206	-0.000015	-0.000938
2001-04-30	-0.071041	0.003043	-0.001255
2001-05-31	-0.008714	0.010503	-0.001256
2001-06-30	-0.074579	0.009372	-0.001258
2001-07-31	0.032898	-0.004374	0.008465
2001-08-31	0.015860	-0.008411	0.008394
2001-09-30	-0.025869	-0.001996	0.008324
2001-10-31	0.048402	-0.008253	-0.008849
2001-11-30	-0.027782	0.008198	-0.008928
2001-12-31	0.068820	-0.007545	-0.009008
2002-01-31	-0.051912	0.005888	0.008450
2002-02-28	-0.055962	0.007676	0.008379
2002-03-31	0.111183	-0.005288	0.008309
2002-04-30	-0.045084	-0.008578	0.002534
2002-05-31	-0.018114	0.011836	0.002527
2002-06-30	-0.043096	0.010366	0.002521
2002-07-31	0.036618	-0.014565	0.002637
2002-08-31	0.009868	0.000324	0.002630
2002-09-30	-0.011510	-0.017024	0.002623

	Per_Capita_Cons	Real_Price	Per_Capita_Income
Per_Capita_Cons	1.000000	-0.154828	-0.047909
Real_Price	-0.154828	1.000000	0.806477
Per_Capita_Income	-0.047909	0.806477	1.000000

	Per_Capita_Cons	Real_Price	Per_Capita_Income
Per_Capita_Cons	0.000202	-0.000418	-0.393761
Real_Price	-0.000418	0.035986	88.412743
Per_Capita_Income	-0.393761	88.412743	333973.924309