GCCCD Rapidminer Prediction Modeling Process Predictive Analytics
Description
ASSIGNMENT: Modifying a RapidMiner Prediction Modeling Process / Exploring & Interpreting Predictive Analytics
Answer the following questions based on your learning and understanding from this module’s videos and supporting Excel documents.
IMPORTANT:
This assignment assumes you are STARTING from a completed RapidMiner process for the Meal Kit case. That is, you’ve already completed the tutorial videos successfully, and you now have a RapidMiner process file that you can further modify to answer the following question prompts. The completed process file you should have made is also available here (download it):
THE SOLUTION YOU ARE STARTING FROM WILL BE ATTACHED
PART 1:
Let’s imagine it has been 6 months since our MealKit company first built a prediction model for its “eco friendly” packaging. The company wants to re-run the model with NEWER data, since it is concerned that the data from 6 months ago has become a little stale and not reflective of their current customers.
The MealKit company has provided two new data files to use:
- New400_TrainMe_Data.csv – This is 400 NEW Customers who we marketed the “eco-friendly” packaging to, and we observed whether they did/didn’t subscribe
- New1000_PredictMe_Data.csv – This is 1,000 NEW Customers who we’d like to PREDICT whether or not they’re likely to subscribe to our “eco-friendly” packaging option.
TASK: Using the same completed process file as was built in the Meal Kit tutorial videos, update both the Training & Test *.csv files with the new ones provided here.
Q1.A. – Run the process file. Look at the “Lift Chart (Simple)” Results. According to this result, if we select the first 40% of customers that we are MOST confident in them not subscribing, what % of ALL the non-subscribers will we have identified?
Q1. B. – Now look at the ModelSimulator Results. For someone who is a “Warm” lead, in Segment “2”, has Platform usage of 50, and a NPS_Score of 8, what is the % confidence we have that they WILL SUBSCRIBE and what % confidence do we have that they WON’T SUBSCRIBE?
Q1.C. – Now look at the Predictions made for Customer #3020 and #3030. What is the Confidence % for each that they WON’T SUBSCRIBE?
PART 2:
RapidMiner can build all kinds of prediction models, not just Logistic Regression. Let’s swap out our Logistic Regression operator and try a new prediction model type “Random Forest”. We haven’t used Decision Trees yet, so this type of prediction model is unfamiliar to us. Nonetheless, we can still use it to make predictions!
To do this easily, simply:
- RIGHT-CLICK on the “Logistic Regression” Operator.
- Select “Replace Operator” from the menu.
- Then, move through the folders to find:
Modeling ? Predictive ? Trees ? Random Forest - The default settings should mostly be fine (check the screenshot below to verify), but you will need to set the Random Seed to “2005” to guarantee your results will match mine.
- IMAGE: Parameter Settings for the Random Forest Operator (screenshot)
-
Number of Trees = 100
criterion = gain_ratio
maximal depth = 10
DO NOT check “apply pruning,” “apply prepruning,” or “random splits”
CHECK “guess subset ratio”
voting strategy = confidence vote
CHECK “use local random seed” = 2005Aside from the local random seed, all the other parameters for Random Forest should be, by default, the same as you see here. - TASK: With the Random Forest Operator swapped in and parameters set, Re-run your process file. Let’s inspect the results.
- Q2.A. – Run the process file. Look at the “Lift Chart (Simple)” Results. According to this result, if we select the first 40% of customers that we are MOST confident in them not subscribing, what % of ALL the non-subscribers will we have identified?
- Q2. B. – Now look at the ModelSimulator Results. For someone who is a “Warm” lead, in Segment “2”, has Platform usage of 50, and a NPS_Score of 8, what is the % confidence we have that they WILL SUBSCRIBE and what % confidence do we have that they WON’T SUBSCRIBE?
- Q2.
- C. – Now look at the Predictions made for Customer #3020 and #3030. What is the Confidence % for each that they WON’T SUBSCRIBE?
- QUESTION Q2.D. [critical thinking] Notice that the prompts I asked you in questions A. , B., and C. for Q1 and Q2 were the same. Were your ANSWERS the same? If you used the same DATA to make your predictions, what could possibly account for any differences? Answer using no more than 2-3 sentences.PART 3: [“guardrails” off!]
The MealKit company has given us two NEW datafiles, these files contain all the same information we have previously used, but there are some new additional columns of information as well:- New2000_EvenMoreInfo_TrainMe.csv – This is 2,000 NEW Customers who we marketed the “eco-friendly” packaging to, and we observed whether they did/didn’t subscribe. Plus new information about the customers we haven’t had access to previously.
- New500_EvenMoreInfo_PredictMe.csv – This is 500 NEW Customers who we’d like to PREDICT whether or not they’re likely to subscribe to our “eco-friendly” packaging option. Plus new information about the customers we haven’t had access to previously.
- You should inspect these new files (you can open them in Excel or read them into RapidMiner and check the summary statistics). There are some new columns of information, explained below:
- Twitter: Whether or not the customer is known to have an active Twitter account.
Instagram: Whether or not the customer is known to have an active Instagram account.
Acquisition: The way the customer reported they first heard about our company (PodcastPromo, FriendReferral, DirectMail, None, and Other)
Refunds: Whether this customer has ever requested a refund for a meal kit they were displeased with (None, OneRefund, MoreThanOneRefund)Tip: Since we have more columns of information, you’ll have to update the “data set meta data information” for both of the READ CSV operators! You need to do this yourself, I don’t have any screenshots or step-by-step instructions this time! - TASK: Using any prediction model of your choice (Logistic Regression, RandomForest, or any other model RapidMiner will allow you to use, if you’re feeling it!), build a prediction model. However, management says they want to keep the model (relatively) simple, so you aren’t allowed to use any more than FOUR of the available predictors in the dataset. To make sure your prediction model uses no more than four of the available predictors, you’ll need to figure out how to use the “Select Attributes” Operator in RapidMiner (we didn’t cover this in the videos!) in order to reduce the number of predictors that are used in the prediction model. Part of the challenge of this Task is to practice learning how to use Operators you are unfamiliar with. Fortunately, the Help files in RapidMiner are very good, and some very modest Googling will provide your with tips and examples of how to use this Operator. Once you have set up your RapidMiner process file, you can explore / test as many different potential models as you wish, but you must settle on one final model. Once you do so, report the following:
- Describe the final prediction model and predictors you decided to use. Briefly explain your rationale for how you chose your final prediction model and the particular predictors you selected to use.
- Using available information, describe how well your model performs at predicting the outcome of interest.
- Using available information, describe a few types of customers you think should be targeted by the company based on your results.
Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."