Use Rattle to Help You Learn R

A beginner’s guide

Dick Brown
Towards Data Science

--

From @thebossbaby on Giphy.com

No, not that kind of rattle — although it may seem like you need one some times.

The “rattle” package provides a GUI interface to R functionality — more than that provided by RStudio. Here’s what you get when you run rattle() from within R.

Rattle screenshot by author

Rattle is useful for many things. If you want to run a quick-and-dirty model? It’s great. Want to see what your data look like? It’s good for that, too. But probably the most useful aspect for novice R programmers trying to learn R, is the “Log” tab. Every time you tell Rattle to do something, the log tab records the code it used to do it. With this feature, you can use Rattle to do something in R, then look at the Log tab to see how it did it, so you can learn to do the same thing.

Installation

I’m assuming at this point that you already have R installed — and maybe RStudio. You don’t need the latter for this, but it doesn’t hurt. To install Rattle, you need the rattle package. This will install most of the dependencies. Right away, you’ll need to install the RGtk2 package separately. Then load the library and run Rattle.

install.packages("rattle")
install.packages("RGtk2")
library(rattle)
rattle()

The first time you run Rattle, it will ask you if you want to install GTK+ (yes, again). Select OK. Similar pop-ups will occur as you use try to use certain features within Rattle, telling you that you’ll need to install the relevant package(s) before you can proceed.

(I’ve heard that some people have issues with GTK+ on Macs. I use a PC, so I’ve never had a problem, but YMMV.)

Using Rattle

Before you can do anything, you’ll need to load in some data. Rattle comes with several built-in datasets. For this article, we’ll use the weather dataset. This is Rattle’s default dataset, which it loads if you don’t tell it to load something else. To use it, simply hit the “Execute” button (make sure you’re still on the Data tab). Rattle will notice that you haven’t specified a dataset, so it will ask you if you want to load the weather dataset. Say “Yes”.

Other options for datasets include loading from a file, loading from an ODBC database, loading a dataset from R that you’ve already been working on (my favorite option), etc.

By the way, selecting the Execute button is something that you’ll need to get used to in Rattle. Every time you do something, you’ll need to hit that button to tell Rattle to actually do it. Weird things happen when you don’t.

After loading the dataset, you’ll see a summary of the variables present in the dataset, along with some setup options. Here’s a partial screenshot of the variables in the weather dataset.

Screenshot by author

Note that for each variable, you will see that Rattle has inferred the data type, what the variable is used for (input, target, etc.), and a summary of the number of unique values of the variable, along with the number of missing values.

The data type field often miscategorizes categoric variables as numeric. This happens because many datasets use numbers to represent the categories. For instance, an income variable may have 10 categories, ranked 1–10. If you want to change the type from numeric to categoric (or to make other changes), you’ll do that on the Transform tab. If I get enough interest, I’ll cover that in another article.

The identification of the target variable can also be hit-or-miss. Scan through the data to make sure everything is categorized correctly. If not, select the correct radio button. If you want to change a bunch of variables at the same time, select them (using ctrl-select or shift-select), then click on the red or green lights representing input or ignore.

Above that, you can decide whether you want to partition the data (train/validate/test) and select the seed you want to use. I’m fond of the default 42. After all, it is the answer to the ultimate question of Life, the Universe, and Everything. (If you don’t know what I’m talking about, DON’T PANIC! Just go read The Hitchhiker’s Guide to the Galaxy, by Douglas Adams. And remember, always keep your towel with you.)

Let’s go ahead and partition the data. Use the default 70/15/15. Then click Execute.

Before we get into the log tab too much, I should point out that the format can be a bit convoluted. Some of what Rattle does is done for Rattle’s own purposes. I don’t know why it does some of the things it does. The key is to look at the code that Rattle uses, try to understand what it does, and modify it for your own purposes. The R help() command will be your friend here.

Now that we’ve read in and prepared our dataset, let’s take a look at the Log tab to see what Rattle’s done for us. Skip down a bit and you’ll see:

crv$seed <- 42

You’ll see “crv$” and “crs$” throughout the log. Rattle creates an environment variable that contains all of the individual variables that you’ll use. In general, if you want to copy code, it’s less confusing if you leave that part out. I will be, from here on out.

There are two commands related to the seed. The first creates a variable to use as the seed while the second (a bit down the screen) actually sets the seed. This way, if you want to change the seed value, you only need to modify the code in one spot.

seed <- 42
set.seed(seed)

Between these two commands, you’ll see how Rattle loads the dataset.

fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")

As you can see, the code is overly complex for our purposes. When loading embedded datasets in R, it’s easier to just bring it up with the data() function:

data("weather")

This will create the dataset with the name “weather”. The Rattle method is a good internal procedure, allowing it to use the variable name “dataset” to refer to whatever dataset it’s working with. (This is a similar concept to setting the seed, just above.)

Next, we can see the code that Rattle uses to partition the dataset.

nobs <- nrow(dataset)
train <- sample(nobs, 0.7 * nobs) # the training set
nobs %>%
seq_len() %>%
setdiff(train) %>%
sample(0.15 * nobs) ->
validate # the validation set
nobs %>%
seq_len() %>%
setdiff(train) %>%
setdiff(validate) ->
test # the testing set

Again, this is somewhat unconventional. The normal method would put the new variable first, as in:

validate <- nobs %>%
seq_len() %>%
setdiff(train) %>%
sample(0.15 * nobs)

By the way, if you don’t understand this code yet, the %>% is a “pipe” operator (found in the magrittr and dplyr packages), and it uses the part preceding the pipe as the first argument of the function following the pipe. So this is equivalent to:

validate <- sample(setdiff(seq_len(nobs), train), 0.15* nobs)

Using pipes tends to make your code easier to read. There are a lot of parentheses in R that need to be placed correctly. This is not meant to be a tutorial on pipes, though.

What this does is to create a random list of numbers consisting of 15% of the integers from 1 to nobs (seq_len(nobs)), leaving out the numbers (setdiff) that are already part of train. This can then be used as indices to a dataframe, as in:

weather[train,]
weather[validate,]
weather[test,]

The “test” indices are created the same way, but this time, setdiff() is used twice: once to remove the training indices and again to remove the validation indices.

Much of the rest of the log file at this point is really only for Rattle’s internal use — defining the input vs. target variables, and numeric vs. categoric variables.

Modeling

Let’s try some modeling. First, we’ll make a decision tree. Go to the Model tab, select Tree, and hit Execute. Rattle will create a decision tree with a standard set of parameters, to create a set of rules to predict whether it will rain tomorrow, based on the dataset.

Decision tree model in Rattle (screenshot by author)

There’s a lot of information here, but I’ll have to leave that discussion for another article. The key parts of the decision tree are just above the center. These are the rules that the decision tree created.

  1. If the cloud cover at 3 p.m. is less than 6.5, then it won’t rain.
  2. Otherwise, it still probably won’t rain, but check the barometric pressure.
  3. If the pressure at 3 p.m. is at least 1016 mbar, it won’t rain.
  4. Otherwise, it will probably rain, but check the wind direction at 3 p.m.
  5. If the wind direction is ESE, SSE, W, WNW, or WSW, then it won’t rain.
  6. Otherwise, it will rain.

The main advantage of decision trees is that the rules that they create are easy to understand.

To see how this was coded, we can go over to the Log tab and scroll to the bottom. Here we see the following code (I’ve changed the name of the decision tree variable from “rpart” to “tree” for clarity):

library(rpart) # provides the code to create the decision treetree <- rpart(RainTomorrow ~ .,
data = dataset[train, c(input, target)],
method = "class",
parms = list(split = "information"),
control = rpart.control(usesurrogate=0, maxsurrogate= 0),
model = TRUE)
print(tree)
printcp(tree)

Now that we’ve seen the code, we can go back to R (or RStudio) and find out what all these parameters mean.

?rpart

You can also play with the parameters on the Rattle Model tab to see how it affects the code. For instance, after I chose different values for each of the four default numeric parameters, Rattle generates the following code:

tree <- rpart(RainTomorrow ~ .,
data=dataset[train, c(input, target)],
method="class",
parms=list(split="information"),
control=rpart.control(minsplit=15,
minbucket=6,
maxdepth=20,
cp=0.001000,
usesurrogate=0,
maxsurrogate=0),
model=TRUE)

We can also get the rules for the decision tree along with a graphical representation of the tree by clicking on the two buttons on the right. The rules will show up in the Rattle screen (in a better format than shown earlier). All plots that Rattle generates, however, will go to the RStudio Plots window (if you’re using RStudio) or in an R Graphics Device if not.

Going back to the Log tab, we can find the code used to create the plot.

fancyRpartPlot(tree, main = "Decision Tree weather.csv $ Raintomorrow")
Decision tree created by Rattle (screenshot by author)

Now we know how to create a graphical decision tree, complete with plot title. The text at the bottom is the default subtitle that fancyRpartPlot creates, but this is editable, too. Check the documentation for the fancyRpartPlot() function to see some of the options. Feel free to play with the options to see what they do.

Model Evaluation

Finally, we’ll evaluate the model we just created. Head over to the Evaluate tab:

Rattle Evaluate tab (screenshot by author)

First, let’s generate an error matrix for the model. Error Matrix is already selected, so hit Execute.

Error matrix for Weather decision tree model (screenshot by author)

This model predicts “no rain” fairly well, but “rain”, not so well. This isn’t too surprising, since rain is a pretty rare event in this model. If we’re more concerned with predicting when it does rain than when it doesn’t rain, then we can adjust the loss matrix parameter back on the model to favor the true positives over the true negatives. But that’s a story for another day.

For now, go back to the Log tab to see how the error matrix was created.

pr <- predict(tree, newdata = dataset[validate, c(input, target)],
type = "class")
# Generate error matrix with counts
rattle::errorMatrix(dataset[validate, c(input, target)$RainTomorrow,
pr, count = TRUE))
# Generate error matrix with proportions
(per <- rattle::errorMatrix(dataset[validate, c(input,
target)$RainTomorrow, pr))
#Calculate and display errors
cat(100 - sum(diag(per), na.rm = TRUE)
cat(mean(per[,"Error"], na.rm = TRUE)

And, let’s make a ROC curve. On the Evaluate tab, select the ROC radio button, then hit Execute. Again, the resulting plot won’t show up in Rattle, so go back to R or RStudio to see it. The AUC (Area Under the Curve) value allows us to assign a “grade” to our model. A good rule of thumb is to use the AUC value like you would a grade in school. Better than 90% is an A, 80–90% is a B, etc.

ROC curve for Weather decision tree (screenshot by author)

And go back to the Log tab one more time to see the code that generated this plot. I will leave this to you, dear reader, to decipher. Remember, the help() command is your friend.

# ROC Curve: requires the ROCR package.
library(ROCR)
# ROC Curve: requires the ggplot2 package.
library(ggplot2, quietly=TRUE)
# Generate an ROC Curve for the rpart model on weather.csv [validate].
pr <- predict(tree, newdata=dataset[validate,
c(input, target)])[,2]
# Remove observations with missing target.
no.miss <- na.omit(dataset[validate, c(input,
target)]$RainTomorrow)
miss.list <- attr(no.miss, "na.action")
attributes(no.miss) <- NULL
if (length(miss.list))
{
pred <- prediction(pr[-miss.list], no.miss)
} else
{
pred <- prediction(pr, no.miss)
}
pe <- performance(pred, "tpr", "fpr")
au <- performance(pred, "auc")@y.values[[1]]
pd <- data.frame(fpr=unlist(pe@x.values), tpr=unlist(pe@y.values))
p <- ggplot(pd, aes(x=fpr, y=tpr))
p <- p + geom_line(colour="red")
p <- p + xlab("False Positive Rate") + ylab("True Positive Rate")
p <- p + ggtitle("ROC Curve Decision Tree weather.csv [validate]
RainTomorrow")
p <- p + theme(plot.title=element_text(size=10))
p <- p + geom_line(data=data.frame(), aes(x=c(0,1), y=c(0,1)),
colour="grey")
p <- p + annotate("text", x=0.50, y=0.00, hjust=0, vjust=0, size=5,
label=paste("AUC =", round(au, 2)))
print(p)
# Calculate the area under the curve for the plot.
# Remove observations with missing target.
no.miss <- na.omit(dataset[validate, c(input,
target)]$RainTomorrow)
miss.list <- attr(no.miss, "na.action")
attributes(no.miss) <- NULL
if (length(miss.list))
{
pred <- prediction(pr[-miss.list], no.miss)
} else
{
pred <- prediction(pr, no.miss)
}
performance(pred, "auc")

Conclusion

As I noted early on, much of the code in the Rattle log is designed more for internal use than for users to copy for their own use. Because of that, you’ll likely want to modify the code for readability, if nothing else. Despite that, with sufficient use of R help, you should be able to use Rattle’s log to help you in writing your own code.

--

--