Python Worksheet
Description
ALL FILES WILL BE SENT ONCE QUESTION IS COMPLETED
This homework covers several regression topics, and will give you practice with the numpy and sklearn libraries in Python. It has both a coding and a writeup component.
Goals
In this homework you will:
- Build linear regression models to serve as predictors from input data
- Parse input data into feature matrices and target variables
- Use cross validation to find the best regularization parameter for a dataset
Background
Before attempting the homework, please review the notes on linear regression. In addition to what is covered there, the following background may be useful:
CSV Processing in Python
Like .txt
, .csv
(comma-separated values) is a useful file format for storing data. In a CSV file, each line is a data record, and different fields of the record are separated by commas, making them two-dimensional data tables (i.e., records by fields). Typically, the first row and first column are headings for the fields and records.
Python’s pandas module helps manage two-dimensional data tables. We can read a CSV as follows:
import pandas as pd
data = pd.read_csv('data.csv')
To see a small snippet of the data, including the headers, we can write data.head()
. Once we know which columns we want to use as features (say ‘A’, ‘B’, ‘D’) and which to use as a target variable (say ‘C’), we can build our feature matrix and target vector by referencing the header:
X = data[['A', 'B', 'D']]
y = data[['C']]
More details on Pandas can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Matrix Algebra in Python
Python offers computationally efficient functions for linear algebra operations through the numpy library. Suppose A_nested_list
is a list of m lists, each having n numerical items. We can convert this to a numpy 2D array by calling A = np.array(A_nested_list)
.
Numpy will treat A
as an m × n matrix. If we want to transpose A, we can write:
import numpy as np
AT = np.transpose(A)
or
AT = A.T
if B is another m × n matrix, we can perform the matrix operation A^(T)B (where A^(T) denotes the tranpose of A) by writing:
ATB = np.matmul(np.transpose(A), B)
or
ATB = A.T @ B
Note that this is different from the *
symbol which will do an elementwise multiplication rather than matrix multiplication.
Also, note that if n = 1, i.e., A and B are both vectors with m elements, this operation takes the dot product between the vectors.
If A is a square n × n matrix, we can find its inverse (if it exists) with the following:
Ainv = np.linalg.inv(A)
Other useful matrix operations can be found here: https://docs.scipy.org/doc/numpy/reference/routines.linalg.html
Linear Regression in Python
Python offers several standard machine learning models with optimized implementations in the sklearn library. Suppose we have a feature matrix X
and a target variable vector y
. To train a standard linear regression model, we can write:
from sklearn.linear_model import LinearRegression
model_lin = linear_model.LinearRegression(fit_intercept = True)
model_lin.fit(X, y)
Then, if we have a feature matrix Xn
of new samples, we can predict the target variables (if we know the model is performing well) by applying the trained model:
yn = model_lin.predict(Xn)
And we can view the parameters of the model by writing:
model_lin.get_params()
There are also a few different versions of regularized linear regression models in sklearn. One of the most common is ridge regression, which has a single regularization parameter λ. To train with λ = 0.2 (named alpha
in sklearn), for instance, we can write:
from sklearn.linear_model import Ridge
model_ridge = linear_model.Ridge(alpha = 0.2, fit_intercept = True)
model_ridge.fit(X, y)
More regression models in Python can be found here: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning
Instructions
Setting up your repository
Click the link on Piazza to set up your repository for Homework 7, then clone it. Aside from this readme, the repository should contain the following files:
plotfit.py
, a starter file with functions, instructions, and a skeleton that you will fill out in Problem 1.poly.txt
, a data file for Problem 1 where each row is a datapoint in the format: x y, with x being the explanatory and y being the target variable.regularize-cv.py
, a starter file with functions, instructions, and a skeleton that you will fill out in Problem 2.AAPL.csv
, a data file for Problem 2. It contains historical stock data for Apple Inc. (AAPL) from 12-12-1980 to 02-27-2022. The goal is to forecast next day’s closing stock price.GOOG.csv
, historical stock data for Google over past five years. You would need this for completing the part 8 of Problem 2.problem1_writeup_sample.pdf
andproblem2_writeup_sample.pdf
are there to give you an idea as to what’s expected from your solution PDFs, i.e,problem1_writeup.pdf
andproblem2_writeup.pdf
. Only PDF file format will be accepted. There will be a penalty if your submission includes any other file format.
Problem 1: Polynomial regression
A common misconception is that linear regression can only be used to fit a linear relationship. We can fit more complicated functions of the explanatory variables by defining new features that are functions of the existing features. A common class of models is the polynomial, with a d-th degree polynomial being of the form
with the d + 1
parameters β = (a_d, ..., a_1, b)^T
. So d = 1
corresponds to a line, d = 2
to a quadratic, d = 3
to a cubic, and so forth.
In this problem, you will build a series of functions that fit polynomials of different degrees to a dataset. You will then use this to determine the best fit to a dataset by comparing the models from different degrees visually against a scatterplot of the data, and make a prediction for an unseen sample.
More specifically:
(IMPORTANT NOTE: Your solutions need to be written in a writeup for problem 1 and 2. Only PDF file format will be accepted. There will be a penalty if your submission includes any other file format.)
- Complete the functions in
polyfit.py
, which accepts as input a dataset to be fit and polynomial degrees to be tried, and outputs a list of fitted models. The specifications for themain
,feature matrix
, andleast_squares
functions are contained as comments in the skeleton code. The key steps are parsing the input data, creating the feature matrix, and solving the least squares equations. - Use your completed
polyfit.py
to find fitted polynomial coefficients ford = 1,2,3,4,5
on thepoly.txt
dataset. Write out the resulting estimated functions for each d. - Use the
scatter
andplot
functions in thematplotlib.pyplot
module to visualize the dataset and these fitted models on a single graph (i.e., for each x, plot ). Make sure to vary colors and include a legend so that each curve can be distinguished. What degree polynomial does the relationship seem to follow? Explain. - If we measured a new datapoint x = 2, what would be the predicted value of y (based on the polynomial identified as the best fit in Question 3)?
Note that in this problem, you are not permitted to use the sklearn library. You must use matrix operations in numpy
to solve the least squares equations.
Once you have completed polyfit.py
, if you run the test case provided, it should output:
[array([ -2.11108454, 28.50662487, 112.31481224]), array([-1.51249835e-02, 1.75412364e+00, -1.08212257e+00, -2.55843975e-01, 1.00914532e+02])]
Note: if your output looks slightly different (missing
array(...)
or margnially different decimals) that’s ok. Your output should at least have the same shape as this expected output, and your decimals should agree to at least 3 decimal places.
Problem 2: Regularized regression
Regularization techniques like ridge regression introduce an extra model parameter, namely, the regularization parameter λ. To determine the best value of λ for a given dataset, we often employ cross validation, where we compare the error of the trained model with different values of λ on a test set, and choose the one yielding lowest error.
In this problem, you will complete the starter code in regularize-cv.py
that employs cross validation in selecting the best combination of model parameters β and regularization parameter λ for a predictor on a given dataset. We use the AAPL.csv
dataset (https://finance.yahoo.com/quote/AAPL/history?p=AAPL) here. It has Apple’s stock history from 12-11-1980 to 02-27-2022 containing 10391 rows and seven columns corresponding to each day’s stock price attributes (Date, Open, High, Low, Close, Adj Close, and Volume). We’re interested in forecasting the next day’s Closing (Close
) Price.
Note that we don’t need the date column for forecasting the stock prices.
The prediction
values are generated by shifting the values in Close
column by 1 because we’re interested in forecasting the next day’s closing price.
We have already provided the code for preparing the input and output data as shown below.
#step 1 : read csv
df = pd.read_csv('AAPL.csv')
#step 2: identify the column(s) we want to remove
remove_features = ['Date']
#step 3: create extra column for prediction by shifting
#rows of `Close` columns by one to obtain next day's closing price
df['Prediction'] = pd.Series(np.append(df['Close'][1:].to_numpy(), [0]))
# step 4: drop the last row because it would have invalid value after the shift.
df.drop(df.tail(1).index, inplace=True)
# step 5: remove the columns identified in step 2
df.drop(remove_features, axis=1, inplace=True)
# step 6: create X by dropping the `Prediction` column
X = np.array(df.drop(['Prediction'], axis=1))
# step 7: Store `Prediction` column in Y array
y = np.array(df['Prediction'])
Another important thing to note is that stock market data is a time series data. Therefore, we do not want to shuffle while splitting the data into training and testing sets. We can create the training and testing sets without shuffling using the following call
[X_train, X_test, y_train, y_test] = train_test_split(X, y, test_size=0.2, shuffle=False)
From the input data, you will train a ridge regression model on these six input attributes, i.e., excluding Date
for different values of λ, find the best, and use the result to forecast the next day’s closing price given a set of input features describing it.
More specifically:
(IMPORTANT NOTE:Your solutions need to be written in a writeup for problem 1 and 2. Only PDF file format will be accepted. There will be a penalty if your submission includes any other file format.)
- Complete the function
normalize_train
that takes the training setX_train
as input and returns a normalized feature matrix along with arrays of the means and standard deviations for each column. (The means and standard deviations for each column are needed for properly normalizing the testing data.) The six columns of this matrix,X = [x_1 x_2 ... x_6]
, must each be normalized to have a mean of 0 and a standard deviation of 1. Recall that this can be accomplished, for each column, by calculating the column’s mean and standard deviation and then subtracting that mean from each element in the column and dividing the result by the column’s standard deviation. - Complete the function
normalize_test
that takes the testing setX_test
and the training set means and standard deviations and returns a normalized feature matrix. Each column should subtract the mean of the corresponding column fromX_train
and divide by the standard deviation of the corresponding column fromX_train
. For example, each element in the first column ofX_test
should subtract the mean of the first column ofX_train
and divide by the standard deviation of the first column ofX_train
. - Define the range of λ to test in
get_lambda_range
as[1E−1.00, 1E−0.94, 1E−0.88, ..., 1E2.88, 1E2.94, 1E3.00]
. This type of logarithmic scale is common for regularization. You should use the numpy function np.logspace to define this array (Hint: Use 51 points as num). See the documentation on np.logspace here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.logspace.html. - Complete the function
train_model
to fit a ridge regression model with regularization parameter λ = l on a training datasetX_train
,y_train
. You may use theRidge
class insklearn
to do this. Note that the partition of the training and testing set has already been done for you in themain
function. - Complete the function
error
to calculate the mean squared error of the model on a testing datasetX_test
,y_test
. - Complete the code in main for plotting the mean squared error as a function of λ, and for finding the
model
andmse
corresponding to the bestlmbda
(‘lambda’ is a reserved keyword in Python, therefore we need to name it in a different way). Make sure to include a title and axes labels with your plot. Use the Ridge regression modeltrain_model
. - Using the coefficients (and intercept)
β = (a_1, a_2, ..., a_6, b)^T
from the returnedmodel_best
, write out the equation of your fitted model for a sample x. - Now let’s put our model to test. Load
GOOG.csv
(contains historical stock data for Google over the past five years), similar toAAPL.csv
as shown before. Let’s call the input features and output array asX
andy
. Note that we don’t need to split this data. NormalizeX
similar toX_test
and cally_hat = model.predict(X)
It’s important to normalize theX
. Ploty
andy_hat
on the same plot and make sure to vary colors and include a legend so that each curve can be distinguished.
Once you have completed regularize-cv.py
, if you set lmbda = [1, 3000]
, your output message should be:
Best lambda tested is 1, which yields an MSE of 6.0277...
Note on precision
For Problem 1 and 2 your answer is expected to be correct up to three decimal points.
What to Submit
For each problem, you must submit (i) your completed version of the starter code, and (ii) a writeup as a separate PDF document named problem1_writeup.pdf
and problem2_writeup.pdf
respectively.
Submitting your Code
Push your completed polyfit.py
, problem1_writeup.pdf
, regularize-cv.py
, and problem2_writeup.pdf
to your repository before the deadline.
Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."