Introduction to Statistical Learning Using Python

Table of Contents

  • LAB1: Introduction to Python
    • Basic Elements of Python
    • Working with Lists
    • IF Statement
    • FOR and While Loops
    • Functions
  • LAB2: Dealing With Data
    • Importing and Manipulating Data
    • Dealing with Missing Values
    • Summary Statistics
    • Understanding Data
    • Data Interpolation
    • Visualizing Data
  • LAB3: Real World Data
    • Logistic Regression
    • Linear and Quadratic Discriminant Analysis
    • Classification Trees, Random Forest

Getting Started with Pandas

Panda is a library, which contains high-level data structure that designed to make data analysis fast and easy in Python.
  • Main two type structures:
    • Series
    • DataFrame

Series

  • Series is a one dimensional array-like object containing an array of data and an associated array of data labels, called index
In [2]:
from pandas import Series, DataFrame
import pandas as pd
import os 
In [4]:
obj = Series ([4,7,-5,3])
In [5]:
print(obj)
0    4
1    7
2   -5
3    3
dtype: int64
In [6]:
obj.values
Out[6]:
array([ 4,  7, -5,  3], dtype=int64)
In [7]:
obj.index
Out[7]:
RangeIndex(start=0, stop=4, step=1)

Selecting Value by index

In [8]:
obj[1]
Out[8]:
7
In [10]:
obj[[1,3]]
Out[10]:
1    7
3    3
dtype: int64

Operations with Series

In [11]:
obj[obj>0]
Out[11]:
0    4
1    7
3    3
dtype: int64
In [14]:
obj[obj>3]
Out[14]:
0    4
1    7
dtype: int64
In [15]:
obj*2
Out[15]:
0     8
1    14
2   -10
3     6
dtype: int64
In [19]:
4 in obj.values
Out[19]:
True

DataFrame

  • A DataFrame represents spreadsheet -like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, etc.)
In [56]:
data= {
       'Student': ["Muhammad Abdullah","Kayli Abernathy","Jacqueline Alvarez",
                    "Daniel Chavez","Aramayis Dallakyan"],
       'UIN' : [925024924,925024924,925024924,915140123,925124214],
       'Graduation_Year' : [2017,2016,2018,2020,2018]
}
frame = DataFrame(data)
In [57]:
frame
Out[57]:
Graduation_Year Student UIN
0 2017 Muhammad Abdullah 925024924
1 2016 Kayli Abernathy 925024924
2 2018 Jacqueline Alvarez 925024924
3 2020 Daniel Chavez 915140123
4 2018 Aramayis Dallakyan 925124214

Seperating First and Last Name

In [61]:
frame.loc[frame['Student'].index, 'Last_name'] = frame['Student'].str.split().str[-1]
frame.loc[frame['Student'].index, 'First_name'] = frame['Student'].str.split().str[0]
In [62]:
frame
Out[62]:
Graduation_Year Student UIN Last_name First_name
0 2017 Muhammad Abdullah 925024924 Abdullah Muhammad
1 2016 Kayli Abernathy 925024924 Abernathy Kayli
2 2018 Jacqueline Alvarez 925024924 Alvarez Jacqueline
3 2020 Daniel Chavez 915140123 Chavez Daniel
4 2018 Aramayis Dallakyan 925124214 Dallakyan Aramayis
In [72]:
frame.drop(["Student"],axis=1,inplace= True)
In [75]:
frame["First_name"]
Out[75]:
0      Muhammad
1         Kayli
2    Jacqueline
3        Daniel
4      Aramayis
Name: First_name, dtype: object
In [77]:
frame[frame["First_name"]=="Aramayis"]
Out[77]:
Graduation_Year UIN Last_name First_name
4 2018 925124214 Dallakyan Aramayis

Inserting Data

  • Define Directory
In [335]:
os.chdir("C:\Users\dallakyan1988\Documents\Python_Presentation")
  • Import Data using read_csv() function

read_csv(filepath, sep=', ', header='infer', names=None, index_col=None, usecols=None,skiprows=None, nrows=None, na_values=None)

In [336]:
data1 = pd.read_csv("Orange_data.csv",sep = ",",na_values=' ')
data2 = pd.read_csv("Income.csv",sep = ",")
In [337]:
data2.set_index(pd.DatetimeIndex(data2['Date']),inplace = True)
In [338]:
data1.head()
Out[338]:
Month Year Per_Capita_Cons Real_Price
0 1 1995 0.280531 2.512980
1 2 1995 0.245859 2.541516
2 3 1995 0.270097 2.580534
3 4 1995 0.246311 2.647361
4 5 1995 0.248911 2.681170
In [339]:
data2.head()
Out[339]:
Quarter Year Per_Capita_Income Date
1995-03-31 1 1995 13409.87 3/31/1995
1995-06-30 2 1995 13239.51 6/30/1995
1995-09-30 3 1995 13342.37 9/30/1995
1995-12-31 4 1995 13406.83 12/31/1995
1996-03-31 1 1996 13344.17 3/31/1996
In [340]:
del data2.index.name
  • It is always good idea to check data types of variables
In [341]:
data1.dtypes
Out[341]:
Month                int64
Year                 int64
Per_Capita_Cons    float64
Real_Price         float64
dtype: object
In [342]:
data2.dtypes
Out[342]:
Quarter                int64
Year                   int64
Per_Capita_Income    float64
Date                  object
dtype: object
  • Dealing with missing values
  • Function to count missing values
In [343]:
def missing_count(data):
    missing = pd.DataFrame(data.isnull().sum(),columns=['Total_missing'])
    return missing
In [344]:
missing_count(data1)
Out[344]:
Total_missing
Month 0
Year 0
Per_Capita_Cons 9
Real_Price 0
In [345]:
missing_count(data2)
Out[345]:
Total_missing
Quarter 0
Year 0
Per_Capita_Income 0
Date 0
In [346]:
data1[data1["Per_Capita_Cons"].isnull()==True]
Out[346]:
Month Year Per_Capita_Cons Real_Price
6 7 1995 NaN 2.673106
21 10 1996 NaN 2.969037
38 3 1998 NaN 2.767408
49 2 1999 NaN 3.179427
63 4 2000 NaN 3.152969
68 9 2000 NaN 3.134388
76 5 2001 NaN 3.164449
84 1 2002 NaN 3.141355
88 5 2002 NaN 3.158666
  • Possible Solutions
    • Drop all NAN values (Not always the best)
In [347]:
cleaned = data1.dropna()
In [348]:
missing_count(cleaned)
Out[348]:
Total_missing
Month 0
Year 0
Per_Capita_Cons 0
Real_Price 0

NOTE:You have several options for dropping. You can drop if

* all raw is **`NAN`**
* all column is **`NAN`**
* or by threshhold
In [349]:
from numpy import nan as NA
example = DataFrame([[2.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,4.,4.]])
In [350]:
example
Out[350]:
0 1 2
0 2.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 4.0 4.0
In [351]:
example.dropna()
Out[351]:
0 1 2
0 2.0 6.5 3.0
In [352]:
example.dropna(how="all")
Out[352]:
0 1 2
0 2.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 4.0 4.0
In [353]:
example.dropna(thresh=2)
Out[353]:
0 1 2
0 2.0 6.5 3.0
3 NaN 4.0 4.0

Method 2

  • Fill NAN values with mean
In [354]:
last_item_filled=data1.fillna(method='ffill')
In [355]:
mean_filled=data1.fillna(data1['Per_Capita_Cons'].mean())
In [356]:
data1[data1["Per_Capita_Cons"].isnull()==True]
Out[356]:
Month Year Per_Capita_Cons Real_Price
6 7 1995 NaN 2.673106
21 10 1996 NaN 2.969037
38 3 1998 NaN 2.767408
49 2 1999 NaN 3.179427
63 4 2000 NaN 3.152969
68 9 2000 NaN 3.134388
76 5 2001 NaN 3.164449
84 1 2002 NaN 3.141355
88 5 2002 NaN 3.158666
In [357]:
last_item_filled.ix[data1[data1["Per_Capita_Cons"].isnull()==True].index,:]
Out[357]:
Month Year Per_Capita_Cons Real_Price
6 7 1995 0.236881 2.673106
21 10 1996 0.241302 2.969037
38 3 1998 0.251118 2.767408
49 2 1999 0.287569 3.179427
63 4 2000 0.263581 3.152969
68 9 2000 0.249545 3.134388
76 5 2001 0.253853 3.164449
84 1 2002 0.259314 3.141355
88 5 2002 0.246272 3.158666
In [358]:
mean_filled.ix[data1[data1["Per_Capita_Cons"].isnull()==True].index,:]
Out[358]:
Month Year Per_Capita_Cons Real_Price
6 7 1995 0.251914 2.673106
21 10 1996 0.251914 2.969037
38 3 1998 0.251914 2.767408
49 2 1999 0.251914 3.179427
63 4 2000 0.251914 3.152969
68 9 2000 0.251914 3.134388
76 5 2001 0.251914 3.164449
84 1 2002 0.251914 3.141355
88 5 2002 0.251914 3.158666

Method 3

  • Interpolation (Not usable for all variables) Interpolation is method to create a function based on fixed data points which can be evaluated anywhere within the domain defined by the given data using linear, Cubic spline interpolation.
In [359]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate
from scipy.interpolate import Rbf,InterpolatedUnivariateSpline
In [360]:
x = np.arange(0, 2*np.pi+np.pi/4, 2*np.pi/8)
y = 0.8*np.power(x,5)- 3*np.power(x,2)
tck = interpolate.splrep(x, y, s=0)
xnew = np.arange(0, 2*np.pi, np.pi/50)
ynew = interpolate.splev(xnew, tck, der=0)

plt.figure()
plt.plot(x, y, 'x', xnew, ynew, xnew, 0.8*np.power(xnew,5)- 3* np.power(xnew,2), x, y, 'b')
plt.legend(['Linear', 'Cubic Spline', 'True','RBF'])
plt.axis([-0.05, 3.05, -4.05, 10])
plt.title('Cubic-spline interpolation')
plt.show()

Linear and Cubic Spline for Orange data

In [361]:
linear_interpolated= data1['Per_Capita_Cons'].interpolate(method='linear')
spline_interpolated= data1['Per_Capita_Cons'].interpolate(method='spline',order=3)
In [362]:
plt.figure
plt.plot(xi, data1["Per_Capita_Cons"],xi, mean_filled["Per_Capita_Cons"],'g',xi,linear_interpolated,'b',xi,spline_interpolated,'r')
plt.legend(["Real",'mean_filled','Linear',"Spline"])
plt.show()
In [363]:
data1['Per_Capita_Cons'] = spline_interpolated

Notice that Income data and Consumption data have different time frequences.

  • We want to match the frequences of Income and Consumption
In [365]:
data2.head()
Out[365]:
Quarter Year Per_Capita_Income Date
1995-03-31 1 1995 13409.87 3/31/1995
1995-06-30 2 1995 13239.51 6/30/1995
1995-09-30 3 1995 13342.37 9/30/1995
1995-12-31 4 1995 13406.83 12/31/1995
1996-03-31 1 1996 13344.17 3/31/1996
In [366]:
data2.tail()
Out[366]:
Quarter Year Per_Capita_Income Date
2001-09-30 3 2001 14949.23 9/30/2001
2001-12-31 4 2001 14552.38 12/31/2001
2002-03-31 1 2002 14921.27 3/31/2002
2002-06-30 2 2002 15034.68 6/30/2002
2002-09-30 3 2002 15153.60 9/30/2002
  • To start interpolation from Quartely frequency to Monthly frequency, first we need to proper index
In [367]:
from pandas import datetime
period = 7*12 + 9-2
index = pd.date_range('3/1/1995', periods=period, freq='m')
In [368]:
upsampled = data2["Per_Capita_Income"].resample('M')
interpolated = upsampled.interpolate(method='spline', order=3)
print(interpolated.head(10))
interpolated.plot()
plt.show()
1995-03-31    13409.870000
1995-04-30    13301.960115
1995-05-31    13248.374849
1995-06-30    13239.510000
1995-07-31    13261.310071
1995-08-31    13300.635785
1995-09-30    13342.370000
1995-10-31    13377.273067
1995-11-30    13399.240400
1995-12-31    13406.830000
Freq: M, Name: Per_Capita_Income, dtype: float64
In [369]:
interpolated = upsampled.interpolate(method='linear', order=3)
print(interpolated.head(10))
interpolated.plot()
plt.show()
1995-03-31    13409.870000
1995-04-30    13353.083333
1995-05-31    13296.296667
1995-06-30    13239.510000
1995-07-31    13273.796667
1995-08-31    13308.083333
1995-09-30    13342.370000
1995-10-31    13363.856667
1995-11-30    13385.343333
1995-12-31    13406.830000
Freq: M, Name: Per_Capita_Income, dtype: float64
In [370]:
interpolated=pd.DataFrame(interpolated)
interpolated.head()
Out[370]:
Per_Capita_Income
1995-03-31 13409.870000
1995-04-30 13353.083333
1995-05-31 13296.296667
1995-06-30 13239.510000
1995-07-31 13273.796667

Now we want to merge Interpolated income data with Orange consumption data

In [371]:
data1.head()
Out[371]:
Month Year Per_Capita_Cons Real_Price
0 1 1995 0.280531 2.512980
1 2 1995 0.245859 2.541516
2 3 1995 0.270097 2.580534
3 4 1995 0.246311 2.647361
4 5 1995 0.248911 2.681170
In [372]:
period = 7*12 + 9
data1index = pd.date_range('1/1/1995', periods=period, freq='m')
In [373]:
data1.set_index(data1index,inplace = True)
In [374]:
data1.head()
Out[374]:
Month Year Per_Capita_Cons Real_Price
1995-01-31 1 1995 0.280531 2.512980
1995-02-28 2 1995 0.245859 2.541516
1995-03-31 3 1995 0.270097 2.580534
1995-04-30 4 1995 0.246311 2.647361
1995-05-31 5 1995 0.248911 2.681170

Merge by index

In [414]:
Merged =DataFrame(pd.merge(data1,interpolated,left_index=True,right_index=True))
In [415]:
Merged.head()
Out[415]:
Month Year Per_Capita_Cons Real_Price Per_Capita_Income
1995-03-31 3 1995 0.270097 2.580534 13409.870000
1995-04-30 4 1995 0.246311 2.647361 13353.083333
1995-05-31 5 1995 0.248911 2.681170 13296.296667
1995-06-30 6 1995 0.236881 2.662763 13239.510000
1995-07-31 7 1995 0.252691 2.673106 13273.796667
In [428]:
Merged=Merged.drop(["Year","Month "],axis=1)
In [429]:
Merged.head()
Out[429]:
Per_Capita_Cons Real_Price Per_Capita_Income
1995-03-31 0.270097 2.580534 13409.870000
1995-04-30 0.246311 2.647361 13353.083333
1995-05-31 0.248911 2.681170 13296.296667
1995-06-30 0.236881 2.662763 13239.510000
1995-07-31 0.252691 2.673106 13273.796667
In [412]:
Merged.columns
Out[412]:
Index([u'Month ', u'Year', u'Per_Capita_Cons', u'Real_Price',
       u'Per_Capita_Income'],
      dtype='object')

Other examples of Merging

In [379]:
df1 = pd.DataFrame({'A': np.random.uniform(size=4),
                     'B': np.random.uniform(size=4),
                     'C': np.random.uniform(size=4),
                     'D': np.random.uniform(size=4)},
                     index=[0, 1, 2, 3])


df2 = pd.DataFrame({'A': np.random.normal(size=4),
                        'B': np.random.normal(size=4),
                        'C': np.random.normal(size=4),
                        'D': np.random.normal(size=4)},
                         index=[4, 5, 6, 7])


df3 = pd.DataFrame({'A':np.random.uniform(size=4),
                     'B': np.random.uniform(size=4),
                     'C': np.random.uniform(size=4),
                     'D': np.random.uniform(size=4)},
                     index=[8, 9, 10, 11])
In [382]:
df1
Out[382]:
A B C D
0 0.373849 0.392643 0.236637 0.473750
1 0.841594 0.470335 0.788696 0.507633
2 0.684732 0.023441 0.592588 0.088391
3 0.932812 0.201915 0.231552 0.035624
In [383]:
df2
Out[383]:
A B C D
4 0.145048 -0.161068 0.356442 2.032841
5 0.528435 -0.612351 -0.166635 -0.803647
6 1.519046 0.843732 -0.230930 1.226998
7 2.545946 0.605684 2.961073 0.321538
In [384]:
df3
Out[384]:
A B C D
8 0.304821 0.558125 0.959540 0.918874
9 0.711238 0.652642 0.436962 0.390953
10 0.310293 0.856860 0.352815 0.175165
11 0.656995 0.743975 0.045016 0.322504
In [385]:
frames = [df1, df2, df3]

result = pd.concat(frames)
In [386]:
result
Out[386]:
A B C D
0 0.373849 0.392643 0.236637 0.473750
1 0.841594 0.470335 0.788696 0.507633
2 0.684732 0.023441 0.592588 0.088391
3 0.932812 0.201915 0.231552 0.035624
4 0.145048 -0.161068 0.356442 2.032841
5 0.528435 -0.612351 -0.166635 -0.803647
6 1.519046 0.843732 -0.230930 1.226998
7 2.545946 0.605684 2.961073 0.321538
8 0.304821 0.558125 0.959540 0.918874
9 0.711238 0.652642 0.436962 0.390953
10 0.310293 0.856860 0.352815 0.175165
11 0.656995 0.743975 0.045016 0.322504
In [387]:
df4 = pd.DataFrame({'B': np.random.uniform(size=4),
                 'D':np.random.normal(size=4),
                 'F': np.random.normal(size=4)},
                  index=[2, 3, 6, 7])
In [390]:
df4
Out[390]:
B D F
2 0.208523 0.729613 -0.850613
3 0.849497 -2.118052 0.457051
6 0.159290 0.675009 -1.333192
7 0.917792 -0.643039 0.201721
In [392]:
df1
Out[392]:
A B C D
0 0.373849 0.392643 0.236637 0.473750
1 0.841594 0.470335 0.788696 0.507633
2 0.684732 0.023441 0.592588 0.088391
3 0.932812 0.201915 0.231552 0.035624
In [388]:
result = pd.concat([df1, df4], axis=1)
In [389]:
result
Out[389]:
A B C D B D F
0 0.373849 0.392643 0.236637 0.473750 NaN NaN NaN
1 0.841594 0.470335 0.788696 0.507633 NaN NaN NaN
2 0.684732 0.023441 0.592588 0.088391 0.208523 0.729613 -0.850613
3 0.932812 0.201915 0.231552 0.035624 0.849497 -2.118052 0.457051
6 NaN NaN NaN NaN 0.159290 0.675009 -1.333192
7 NaN NaN NaN NaN 0.917792 -0.643039 0.201721
In [393]:
result = df1.append(df2)
In [394]:
result
Out[394]:
A B C D
0 0.373849 0.392643 0.236637 0.473750
1 0.841594 0.470335 0.788696 0.507633
2 0.684732 0.023441 0.592588 0.088391
3 0.932812 0.201915 0.231552 0.035624
4 0.145048 -0.161068 0.356442 2.032841
5 0.528435 -0.612351 -0.166635 -0.803647
6 1.519046 0.843732 -0.230930 1.226998
7 2.545946 0.605684 2.961073 0.321538
  • Merging based on key
In [396]:
left = pd.DataFrame({'key': ['A', 'G', 'E', 'C'],
                    'A': np.random.normal(size=4),
                     'B': np.random.normal(size=4)})

right = pd.DataFrame({'key': ['A', 'G', 'E', 'C'],
                      'C': np.random.normal(size=4),
                      'D': np.random.normal(size=4)})

result = pd.merge(left, right, on='key')
In [400]:
left
Out[400]:
A B key
0 0.312712 0.154416 A
1 -0.004686 0.757326 G
2 0.140112 -1.105813 E
3 0.585586 0.056549 C
In [401]:
right
Out[401]:
C D key
0 -0.006953 -0.206162 A
1 -1.506147 -0.227391 G
2 0.746571 -0.101790 E
3 1.295349 0.545622 C
In [402]:
result
Out[402]:
A B key C D
0 0.312712 0.154416 A -0.006953 -0.206162
1 -0.004686 0.757326 G -1.506147 -0.227391
2 0.140112 -1.105813 E 0.746571 -0.101790
3 0.585586 0.056549 C 1.295349 0.545622

Descriptive Statistics

In [430]:
Merged.describe()
Out[430]:
Per_Capita_Cons Real_Price Per_Capita_Income
count 91.000000 91.000000 91.000000
mean 0.251604 2.985879 14101.232747
std 0.014222 0.189700 577.904771
min 0.220848 2.580534 13239.510000
25% 0.241476 2.818134 13477.115000
50% 0.249624 3.066989 14195.660000
75% 0.258852 3.149477 14624.890000
max 0.292521 3.217425 15153.600000

Correlation and Covariance Matrix

In [440]:
returns= Merged.pct_change()
In [441]:
returns
Out[441]:
Per_Capita_Cons Real_Price Per_Capita_Income
1995-03-31 NaN NaN NaN
1995-04-30 -0.088065 0.025897 -0.004235
1995-05-31 0.010556 0.012771 -0.004253
1995-06-30 -0.048331 -0.006865 -0.004271
1995-07-31 0.066744 0.003884 0.002590
1995-08-31 -0.014426 -0.004534 0.002583
1995-09-30 -0.015499 -0.004294 0.002576
1995-10-31 0.047503 -0.006622 0.001610
1995-11-30 -0.018705 0.007299 0.001608
1995-12-31 0.060453 0.016259 0.001605
1996-01-31 0.035085 0.011480 -0.001558
1996-02-29 -0.117824 0.004200 -0.001560
1996-03-31 0.059415 0.008746 -0.001563
1996-04-30 -0.082940 0.027413 -0.001351
1996-05-31 0.002075 0.020915 -0.001353
1996-06-30 -0.070485 0.022661 -0.001355
1996-07-31 0.019683 0.017862 0.004707
1996-08-31 0.033136 -0.013083 0.004685
1996-09-30 0.037158 -0.006135 0.004663
1996-10-31 0.038412 0.004250 -0.000182
1996-11-30 0.010664 0.002774 -0.000182
1996-12-31 0.064247 -0.000399 -0.000182
1997-01-31 0.039709 -0.010075 0.000226
1997-02-28 -0.143779 0.008699 0.000226
1997-03-31 0.105674 -0.014108 0.000226
1997-04-30 -0.065007 -0.009998 0.002295
1997-05-31 -0.001197 -0.009287 0.002289
1997-06-30 -0.061609 -0.010089 0.002284
1997-07-31 0.036933 -0.019236 0.002238
1997-08-31 0.042711 -0.003209 0.002233
... ... ... ...
2000-04-30 -0.029687 0.007654 0.002001
2000-05-31 -0.031671 0.000706 0.001997
2000-06-30 -0.069189 0.005529 0.001993
2000-07-31 0.046668 -0.010958 0.002333
2000-08-31 0.034259 -0.000796 0.002328
2000-09-30 0.021191 -0.000315 0.002322
2000-10-31 0.016964 0.000844 -0.000680
2000-11-30 -0.009110 0.004206 -0.000680
2000-12-31 0.080434 -0.004056 -0.000681
2001-01-31 0.013404 -0.007513 -0.000936
2001-02-28 -0.112594 0.002642 -0.000937
2001-03-31 0.095206 -0.000015 -0.000938
2001-04-30 -0.071041 0.003043 -0.001255
2001-05-31 -0.008714 0.010503 -0.001256
2001-06-30 -0.074579 0.009372 -0.001258
2001-07-31 0.032898 -0.004374 0.008465
2001-08-31 0.015860 -0.008411 0.008394
2001-09-30 -0.025869 -0.001996 0.008324
2001-10-31 0.048402 -0.008253 -0.008849
2001-11-30 -0.027782 0.008198 -0.008928
2001-12-31 0.068820 -0.007545 -0.009008
2002-01-31 -0.051912 0.005888 0.008450
2002-02-28 -0.055962 0.007676 0.008379
2002-03-31 0.111183 -0.005288 0.008309
2002-04-30 -0.045084 -0.008578 0.002534
2002-05-31 -0.018114 0.011836 0.002527
2002-06-30 -0.043096 0.010366 0.002521
2002-07-31 0.036618 -0.014565 0.002637
2002-08-31 0.009868 0.000324 0.002630
2002-09-30 -0.011510 -0.017024 0.002623

91 rows × 3 columns

In [452]:
plt.figure()
plt.plot(returns["Real_Price"].cumsum(),"b")
plt.xlabel("Time")
plt.ylabel("Percentage change")
plt.title("Cumulatice Percentage Change")
plt.show()
In [450]:
Merged.corr()
Out[450]:
Per_Capita_Cons Real_Price Per_Capita_Income
Per_Capita_Cons 1.000000 -0.154828 -0.047909
Real_Price -0.154828 1.000000 0.806477
Per_Capita_Income -0.047909 0.806477 1.000000
In [444]:
Merged.cov()
Out[444]:
Per_Capita_Cons Real_Price Per_Capita_Income
Per_Capita_Cons 0.000202 -0.000418 -0.393761
Real_Price -0.000418 0.035986 88.412743
Per_Capita_Income -0.393761 88.412743 333973.924309
In [460]:
Merged["Per_Capita_Cons"].hist(bins=20)
plt.show()
In [466]:
Merged["Per_Capita_Cons"].hist(bins=20,normed= True)
Merged["Per_Capita_Cons"].plot(kind="kde")
plt.show()
In [469]:
comp1=np.random.normal(Merged["Per_Capita_Cons"].mean(),Merged["Per_Capita_Cons"].std(),size=200)
values = Series(comp1)
In [473]:
Merged["Per_Capita_Cons"].hist(bins=20,normed= True)
values.plot(kind="kde")
plt.show()
In [481]:
pd.scatter_matrix(Merged, diagonal="kde",alpha=0.5)
plt.show()