Our goal in this assignment is to predict the interest rate of government debt as accurately as possible.
The California DebtWatch contains the following information:
The principal amounts, sale dates, interest rates, terms, purposes, ratings, costs of issuance, financing team participants, issuance documents, and annual reporting (if applicable), among 67 other data points required under California Government Code section 8855, of the various types of debt issued by all state and local government agencies in California.
Download the data in CSV format and load it into R. Randomly select 20% of the rows and save them into a test set to use later to evaluate the performance of your model. Which column(s) represent interest rate?
set.seed(280)
debt = read.csv("C:/Users/kanni/Downloads/CDA_ALL_Raw.csv")
dim(debt)
## [1] 67043 68
debt20 = debt[sample(nrow(debt), round(0.2*67043)), ]
dim(debt20)
## [1] 13409 68
debt80 = debt[sample(nrow(debt), round(0.8*67043)), ]
dim(debt80)
## [1] 53634 68
grep("interest.rate", colnames(debt80), ignore.case = TRUE, value = TRUE)
## [1] "TIC.Interest.Rate" "NIC.Interest.Rate"
Define one or more new columns from existing text columns in the data set. For example, you could add a logical column indicating whether the term “lease” appears in some column. Why do you think this new column will help you improve the accuracy of your model?
rate = rep(0, times = nrow(debt80))
rate[!is.na(debt80$TIC.Interest.Rate) & is.na(debt80$NIC.Interest.Rate)] = debt80$TIC.Interest.Rate
## Warning in rate[!is.na(debt80$TIC.Interest.Rate) &
## is.na(debt80$NIC.Interest.Rate)] = debt80$TIC.Interest.Rate: number of items to
## replace is not a multiple of replacement length
rate[is.na(debt80$TIC.Interest.Rate) & !is.na(debt80$NIC.Interest.Rate)] = debt80$NIC.Interest.Rate
## Warning in rate[is.na(debt80$TIC.Interest.Rate) & !
## is.na(debt80$NIC.Interest.Rate)] = debt80$NIC.Interest.Rate: number of items to
## replace is not a multiple of replacement length
rate[is.na(debt80$TIC.Interest.Rate) & is.na(debt80$NIC.Interest.Rate)] = NA
debt80$rate = rate
summary(debt80$TIC.Interest.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.656 4.022 4.192 5.950 71.468 26409
summary(debt80$NIC.Interest.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.868 4.271 4.393 6.200 63.037 16837
summary(debt80$rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 2.384 4.486 63.037 17763
Use the remaining 80% of the data (the training set) to come up with two different models to predict interest rate. You’re welcome to use any external machine learning libraries you like, or you can stick with the lm
and rpart
from class. Note that you can come up with different models by using different subsets of columns. For example, a model with 3 input columns differs from a model with 60 input columns. Briefly describe the two models you ended up with.
library(rpart)
debt80$by.Year = as.POSIXct(debt80$Sale.Date, format = "%m/%d/%Y %H:%M:%OS %p")
fit1 = rpart(rate ~ Principal.Amount, data = debt80)
fit2 = lm(rate ~ by.Year, data = debt80)
length(debt80$Principal.Amount)
## [1] 53634
length(debt80$rate)
## [1] 53634
length(debt80$by.Year)
## [1] 53634
plot(debt80$Principal.Amount, debt80$rate)
#lines(debt80$Principal.Amount,predict(fit1), col = "blue",lwd = 2)
plot(debt80$by.Year, debt80$rate)
#lines(debt80$by.Year,predict(fit2), col = "green",lwd = 2)
Evaluate both of your models on the 20% of the data you reserved for the test set by looking at the average absolute difference between the interest rate predicted by the model and the actual interest rate. Do the models do a reasonable job of predicting interest rate? Find the rows where the predicted interest rate is farthest from the true interest rate. Why might the model have done a poor job on these rows?