MNIST Data Set and The Pima Data Set Machine Learning Questions

Description

For the MNIST data set and the Pima data set, you must create the following ML models and performthe below mentioned part 3,4,5(if needed 2) separately for all the models. Part 1 can be common. Themodels are as follows – 1. K-NN classifier from scratch that uses the training data to build a classifier, andevaluate and report on the classifier performance. 2. SVM Model 3. Decision Tree classifier 4. RandomForest Classifier Do NOT use machine learning packages for the KNN portion of this task.Youare only permitted to use existing tools for simple linear algebra. Rest for the other modelsyou can use scikit learn or any other ML package.

1. (10 points) Perform the analysis of the data and create some visualizations (for images, a few examplesfrom each category; for other data, perhaps some scatter plots or histograms that show a big pictureof the data).

2. (10 points-kNN,5 points-SVM,5 points-Decsion Tree, 5 points-Random Forest Classifier)Describe any data pre-processing or Feature Engineering that you did. Also discuss about the traintest split ratio and some of the hyper-parameters that you have tweaked.

3. (5 points-kNN,5 points-SVM,5 points-Decsion Tree, 5 points-Random Forest Classifier)Show the accuracy of your algorithm by using the Classification Metrics discussed in the class. Alsojustify your reason for using that metric. Your metrics should remain same for all your models andyou can use different metrics for different data sets. Sample metric can be as follows- -In the case ofthe Pima data set, show accuracy with tables showing false positive, false negative, true positive andtrue negatives. -In the case of the MNIST digits show the complete confusion matrix. Choose a singledigit to measure accuracy and show how that number varies as a function of K.In the case of the MNIST digits show the complete confusion matrix. Choose a single digit to measureaccuracy and show how that number varies as a function of K.

4. (5 points) Describe the run-time of your algorithms and also share the actual “wall-clock” time thatit took to compute your results.

5. (10 points) Describe the impact of imbalanced data set,presence of outliers and missing values foreach of the ML algorithm used by you. And discuss if these factors have played any role in youranalysis

Please use the Portuguese data set (student-por.csv) in the provided link for this assignment. This dataset contains 649 instances and 30 features. Create the following ML models- 1.a linear regression model fromscratch 2. SVM model 3. Random Forest model 4. Decision Tree model and use it on this data set to predictthe value for the final variable G3, the final grade for each student.Part 1 can be same for all the models butpart 2 and part 3 needs to be separate for each model Do NOT use machine learning packages forthe Linear Regression portion of the assignment. You are only permitted to use existing toolsfor simple linear algebra. For the other 3 models which are SVM,Decision Tree and RandomForest you can use the ML packages and libraries.

1. (10 points) Some of the variables in this data set are categorical and some of them are numeric. Howcan we encode the categorical variables for the linear regression process? Please describe your approachto encoding categorical values and apply it to the data set in your code.

2. (10 points-Linear Reg.,5 points-SVM,5 points-Decsion Tree, 5 points-Random Forest)Experiment by using different groups of features during training. What features work well in predictinga student’s final score? What features work poorly? Why might you use or not use certain features?Calculate mean squared error scores for your ML models using at least two different groups of features,and compare the performance of the feature groups with each other.

3. (5 points-Linear Reg.,5 points-SVM,5 points-Decsion Tree, 5 points-Random Forest) Per-form linear regression using all available features. Use mean squared error or any other metric to reportthe ability of your model to fit to the data and justify your choice. How does this approach compareto the groups of features you selected?

Data sets: The project will explore three data sets, the famous MNIST data set of pictures of handwrittennumbers, a data set that explores the prevalance of diabetes in a Native American tribe named the Pima,and a data set that examines student achievement in secondary education in two Portuguese schools. Youcan access the data sets here:1. https://www.kaggle.com/c/digit-recognizer/data2. https://www.kaggle.com/uciml/pima-indians-diabetes…3. https://archive.ics.uci.edu/ml/machine-learning-da…



^{Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."}