Python Worksheet

Description

ALL FILES WILL BE SENT ONCE QUESTION IS COMPLETED

This homework covers several regression topics, and will give you practice with the numpy and sklearn libraries in Python. It has both a coding and a writeup component.

Goals

In this homework you will:

Build linear regression models to serve as predictors from input data
Parse input data into feature matrices and target variables
Use cross validation to find the best regularization parameter for a dataset

Background

Before attempting the homework, please review the notes on linear regression. In addition to what is covered there, the following background may be useful:

CSV Processing in Python

Like .txt, .csv (comma-separated values) is a useful file format for storing data. In a CSV file, each line is a data record, and different fields of the record are separated by commas, making them two-dimensional data tables (i.e., records by fields). Typically, the first row and first column are headings for the fields and records.

Python’s pandas module helps manage two-dimensional data tables. We can read a CSV as follows:

import pandas as pd
data = pd.read_csv('data.csv')

To see a small snippet of the data, including the headers, we can write data.head(). Once we know which columns we want to use as features (say ‘A’, ‘B’, ‘D’) and which to use as a target variable (say ‘C’), we can build our feature matrix and target vector by referencing the header:

X = data[['A', 'B', 'D']]
y = data[['C']]

More details on Pandas can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Matrix Algebra in Python

Python offers computationally efficient functions for linear algebra operations through the numpy library. Suppose A_nested_list is a list of m lists, each having n numerical items. We can convert this to a numpy 2D array by calling A = np.array(A_nested_list).
Numpy will treat A as an m × n matrix. If we want to transpose A, we can write:

import numpy as np
AT = np.transpose(A)

or 

AT = A.T

if B is another m × n matrix, we can perform the matrix operation A^(T)B (where A^(T) denotes the tranpose of A) by writing:

ATB = np.matmul(np.transpose(A), B)

or 

ATB = A.T @ B

Note that this is different from the * symbol which will do an elementwise multiplication rather than matrix multiplication.
Also, note that if n = 1, i.e., A and B are both vectors with m elements, this operation takes the dot product between the vectors.

If A is a square n × n matrix, we can find its inverse (if it exists) with the following:

Ainv = np.linalg.inv(A)

Other useful matrix operations can be found here: https://docs.scipy.org/doc/numpy/reference/routines.linalg.html

Linear Regression in Python

Python offers several standard machine learning models with optimized implementations in the sklearn library. Suppose we have a feature matrix X and a target variable vector y. To train a standard linear regression model, we can write:

from sklearn.linear_model import LinearRegression
model_lin = linear_model.LinearRegression(fit_intercept = True)
model_lin.fit(X, y)

Then, if we have a feature matrix Xn of new samples, we can predict the target variables (if we know the model is performing well) by applying the trained model:

yn = model_lin.predict(Xn)

And we can view the parameters of the model by writing:

model_lin.get_params()

There are also a few different versions of regularized linear regression models in sklearn. One of the most common is ridge regression, which has a single regularization parameter λ. To train with λ = 0.2 (named alpha in sklearn), for instance, we can write:

from sklearn.linear_model import Ridge
model_ridge = linear_model.Ridge(alpha = 0.2, fit_intercept = True)
model_ridge.fit(X, y)

More regression models in Python can be found here: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Instructions

Setting up your repository

Click the link on Piazza to set up your repository for Homework 7, then clone it. Aside from this readme, the repository should contain the following files:

plotfit.py, a starter file with functions, instructions, and a skeleton that you will fill out in Problem 1.
poly.txt, a data file for Problem 1 where each row is a datapoint in the format: x y, with x being the explanatory and y being the target variable.
regularize-cv.py, a starter file with functions, instructions, and a skeleton that you will fill out in Problem 2.
AAPL.csv, a data file for Problem 2. It contains historical stock data for Apple Inc. (AAPL) from 12-12-1980 to 02-27-2022. The goal is to forecast next day’s closing stock price.
GOOG.csv, historical stock data for Google over past five years. You would need this for completing the part 8 of Problem 2.
problem1_writeup_sample.pdf and problem2_writeup_sample.pdf are there to give you an idea as to what’s expected from your solution PDFs, i.e, problem1_writeup.pdf and problem2_writeup.pdf. Only PDF file format will be accepted. There will be a penalty if your submission includes any other file format.

Problem 1: Polynomial regression

A common misconception is that linear regression can only be used to fit a linear relationship. We can fit more complicated functions of the explanatory variables by defining new features that are functions of the existing features. A common class of models is the polynomial, with a d-th degree polynomial being of the form

with the d + 1 parameters β = (a_d, ..., a_1, b)^T. So d = 1 corresponds to a line, d = 2 to a quadratic, d = 3 to a cubic, and so forth.

In this problem, you will build a series of functions that fit polynomials of different degrees to a dataset. You will then use this to determine the best fit to a dataset by comparing the models from different degrees visually against a scatterplot of the data, and make a prediction for an unseen sample.

More specifically:

(IMPORTANT NOTE: Your solutions need to be written in a writeup for problem 1 and 2. Only PDF file format will be accepted. There will be a penalty if your submission includes any other file format.)

Complete the functions in polyfit.py, which accepts as input a dataset to be fit and polynomial degrees to be tried, and outputs a list of fitted models. The specifications for the main, feature matrix, and least_squares functions are contained as comments in the skeleton code. The key steps are parsing the input data, creating the feature matrix, and solving the least squares equations.
Use your completed polyfit.py to find fitted polynomial coefficients for d = 1,2,3,4,5 on the poly.txt dataset. Write out the resulting estimated functions for each d.
Use the scatter and plot functions in the matplotlib.pyplot module to visualize the dataset and these fitted models on a single graph (i.e., for each x, plot ). Make sure to vary colors and include a legend so that each curve can be distinguished. What degree polynomial does the relationship seem to follow? Explain.
If we measured a new datapoint x = 2, what would be the predicted value of y (based on the polynomial identified as the best fit in Question 3)?

Note that in this problem, you are not permitted to use the sklearn library. You must use matrix operations in numpy to solve the least squares equations.

Once you have completed polyfit.py, if you run the test case provided, it should output:

[array([ -2.11108454,  28.50662487, 112.31481224]), array([-1.51249835e-02,  1.75412364e+00, -1.08212257e+00, -2.55843975e-01, 1.00914532e+02])]

Note: if your output looks slightly different (missing array(...) or margnially different decimals) that’s ok. Your output should at least have the same shape as this expected output, and your decimals should agree to at least 3 decimal places.

Problem 2: Regularized regression

Regularization techniques like ridge regression introduce an extra model parameter, namely, the regularization parameter λ. To determine the best value of λ for a given dataset, we often employ cross validation, where we compare the error of the trained model with different values of λ on a test set, and choose the one yielding lowest error.

In this problem, you will complete the starter code in regularize-cv.py that employs cross validation in selecting the best combination of model parameters β and regularization parameter λ for a predictor on a given dataset. We use the AAPL.csv dataset (https://finance.yahoo.com/quote/AAPL/history?p=AAPL) here. It has Apple’s stock history from 12-11-1980 to 02-27-2022 containing 10391 rows and seven columns corresponding to each day’s stock price attributes (Date, Open, High, Low, Close, Adj Close, and Volume). We’re interested in forecasting the next day’s Closing (Close) Price.

Note that we don’t need the date column for forecasting the stock prices.

The prediction values are generated by shifting the values in Close column by 1 because we’re interested in forecasting the next day’s closing price.
We have already provided the code for preparing the input and output data as shown below.

#step 1 : read csv
df = pd.read_csv('AAPL.csv')

#step 2: identify the column(s) we want to remove
remove_features = ['Date']

#step 3: create extra column for prediction by shifting
#rows of `Close` columns by one to obtain next day's closing price 
df['Prediction'] = pd.Series(np.append(df['Close'][1:].to_numpy(), [0]))

# step 4: drop the last row because it would have invalid value after the shift.
df.drop(df.tail(1).index, inplace=True)

# step 5: remove the columns identified in step 2
df.drop(remove_features, axis=1, inplace=True)

# step 6: create X by dropping the `Prediction` column
X = np.array(df.drop(['Prediction'], axis=1))

# step 7: Store `Prediction` column in Y array
y = np.array(df['Prediction'])

Another important thing to note is that stock market data is a time series data. Therefore, we do not want to shuffle while splitting the data into training and testing sets. We can create the training and testing sets without shuffling using the following call

[X_train, X_test, y_train, y_test] = train_test_split(X, y, test_size=0.2, shuffle=False)

From the input data, you will train a ridge regression model on these six input attributes, i.e., excluding Date for different values of λ, find the best, and use the result to forecast the next day’s closing price given a set of input features describing it.

More specifically:

(IMPORTANT NOTE:Your solutions need to be written in a writeup for problem 1 and 2. Only PDF file format will be accepted. There will be a penalty if your submission includes any other file format.)

Complete the function normalize_train that takes the training set X_train as input and returns a normalized feature matrix along with arrays of the means and standard deviations for each column. (The means and standard deviations for each column are needed for properly normalizing the testing data.) The six columns of this matrix, X = [x_1 x_2 ... x_6], must each be normalized to have a mean of 0 and a standard deviation of 1. Recall that this can be accomplished, for each column, by calculating the column’s mean and standard deviation and then subtracting that mean from each element in the column and dividing the result by the column’s standard deviation.
Complete the function normalize_test that takes the testing set X_test and the training set means and standard deviations and returns a normalized feature matrix. Each column should subtract the mean of the corresponding column from X_train and divide by the standard deviation of the corresponding column from X_train. For example, each element in the first column of X_test should subtract the mean of the first column of X_train and divide by the standard deviation of the first column of X_train.
Define the range of λ to test in get_lambda_range as [1E−1.00, 1E−0.94, 1E−0.88, ..., 1E2.88, 1E2.94, 1E3.00]. This type of logarithmic scale is common for regularization. You should use the numpy function np.logspace to define this array (Hint: Use 51 points as num). See the documentation on np.logspace here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.logspace.html.
Complete the function train_model to fit a ridge regression model with regularization parameter λ = l on a training dataset X_train, y_train. You may use the Ridge class in sklearn to do this. Note that the partition of the training and testing set has already been done for you in the main function.
Complete the function error to calculate the mean squared error of the model on a testing dataset X_test, y_test.
Complete the code in main for plotting the mean squared error as a function of λ, and for finding the model and mse corresponding to the best lmbda(‘lambda’ is a reserved keyword in Python, therefore we need to name it in a different way). Make sure to include a title and axes labels with your plot. Use the Ridge regression model train_model.
Using the coefficients (and intercept) β = (a_1, a_2, ..., a_6, b)^T from the returned model_best, write out the equation of your fitted model for a sample x.
Now let’s put our model to test. Load GOOG.csv (contains historical stock data for Google over the past five years), similar to AAPL.csv as shown before. Let’s call the input features and output array as X and y. Note that we don’t need to split this data. Normalize X similar to X_test and call y_hat = model.predict(X) It’s important to normalize the X. Plot y and y_hat on the same plot and make sure to vary colors and include a legend so that each curve can be distinguished.

Once you have completed regularize-cv.py, if you set lmbda = [1, 3000], your output message should be:

Best lambda tested is 1, which yields an MSE of 6.0277...

Note on precision

For Problem 1 and 2 your answer is expected to be correct up to three decimal points.

What to Submit

For each problem, you must submit (i) your completed version of the starter code, and (ii) a writeup as a separate PDF document named problem1_writeup.pdf and problem2_writeup.pdf respectively.

Submitting your Code

Push your completed polyfit.py, problem1_writeup.pdf, regularize-cv.py, and problem2_writeup.pdf to your repository before the deadline.



^{Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."}