R uses Dplyr renaming and an old regression model

2023-03-30 21:19:00

Also from Columbia's engineering course, Machine Learning, is a machine learning course taught by IBM's principal investigator.

In 2017, the small broken cinema near the school re-screened Emma Watson's Regression (Chinese called Retrospective, the movie is cliché and mediocre), and the model we used this time is also called Regression...

Background: To study the correlation between prostate-specific antigen levels and some clinical indicators in men who are about to undergo radical prostatectomy (please, this data is super-hardcore).

The dataset comes from Stamey et al. (1989) of Stanford University

data：https://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data

description：https://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.info.txt

Variables included logarithmic cancer volume (LCAVOL), logarithmic prostate weight (LBash), age, benign prostatic hyperplasia (LBPH), seminal vesicle infiltration (SVI), capsular penetration (LCP), Gleason score (Gleason), and percentage of Gleason score 4 or 5 (PGG45). (Note: SVI is a binary variable and Glason is an ordered categorical variable)

Now let's expand the data in columns 2 to 9, because the first column is the observation ID is meaningless, and the 10th column is the "training set" designation. We now need to combine the data to produce a qualified training and testing dataset.

Be careful, because the renaming function in R's Dply package is error-prone. We need to give the combined column 9 a name to make it our dependent variable. When we combine datasets into the format of data frame, the name "V9" is automatically assigned to column 9, which is quite convenient.

If we skip the function "as.data.frame", column 9 has no name (null) and Dply's rename function crashes. The follow-up is the old-fashioned return training, which will not be expanded here.