---
title: "Null models and randomization"
date: "`r Sys.Date()`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(knitr)
library(tidyverse)
library(ggpubr)
library(gam)
library(caret)

conflicted::conflict_prefer("filter", "dplyr")

options(ggplot2.continuous.colour = "viridis",
        ggplot2.continuous.fill = "viridis")

theme_set(theme_minimal() +
            theme(text = element_text(size = 20))) 
```


## Make your randomization _reproducible_

Run this code several times:

```{r eval}
set.seed(20180313)
snacks <- c("pretzels", "cookies", "popcorn", "bamba", "carrots")
sample(snacks, 3)
```

What happens when you run `set.seed()` each time? What happens when you only run the last line?

## Distributions

How would you use the density functions to create the plot shown in the slides, which has mean = 2 and SD = 1?

```{r}
# YOUR CODE HERE
```

Hint: Set the x-values to `seq(0, 5, by = 0.1)`

## Null models

```{r message=FALSE}
library(tidyverse)
pokemon <- read_csv("data/pokemon.csv")
```

We can run a simple t-test:

```{r}
pokemon <- pokemon %>%
  mutate(color2 = if_else(color == "Grey", "Grey", "Not_grey"))

t.test(defense ~ color2, pokemon)
```

But now let's do it with a null model instead!

```{r}
n_grey <- sum(pokemon$color2 == "Grey")
pooled_obs <- pokemon$defense
t <- vector(length = 10000)
```

This is our null model - for each iteration, we randomly sample our `pooled_obs` to create simulated `grey` and `not_grey` groups, then calculate a t-score. After running this process for `t` iterations, we get a null distribution of t-scores.

```{r}
for (i in 1:length(t)){
  # sample WITHOUT replacement
  grey_idx <- sample(length(pooled_obs), n_grey, 
                     replace = FALSE)
  
  # create simulated grey & not_grey groups
  grey <- pooled_obs[grey_idx] 
  not_grey <- pooled_obs[-grey_idx]
  
  # calculate t-score
  t[i] <- t.test(grey, not_grey)$statistic
}
```

Now that we have a null distribution, we can calculate the **observed t-score** and **p-value**:

```{r}
poke_sum <- pokemon %>%
   group_by(color2) %>%
   summarise(
     mean = mean(defense),
     se = sd(defense) / sqrt(n())
   )
poke_sum

obs <- (poke_sum$mean[2] - poke_sum$mean[1])/poke_sum$se[2]
obs

p.val <- sum(abs(t) >= abs(obs)) / length(t)
p.val
```

Now let's visualize this. We'll plot the histogram of the null distribution, and mark our observed value with a red line: 

```{r}
tibble(scores = t, obs = obs) %>%
  ggplot() +
  geom_density(aes(scores)) + 
  geom_vline(xintercept = obs, color = "red")
```

## Weighted sampling in null models

```{r}
grey_prop <- pokemon %>% 
   filter(color2 == "Grey") %>%
   count(type_1) %>% 
   mutate(prop = n/sum(n))

grey_prop
```

Let's add the proportions back into our data frame.

```{r}
pokemon <- left_join(pokemon, grey_prop, by = "type_1") %>% 
  mutate(prop = replace_na(prop, 0))
```

...and then we can calculate our null distribution again, using weighted probabilities.  We'll store the null distributions in a new variable, `t2`. 

```{r}
n_grey <- sum(pokemon$color2 == "Grey")
pooled_obs <- pokemon$defense
t2 <- vector(length = 10000) 
prop <- pokemon$prop 

for (i in 1:length(t2)){
  grey_idx <- sample(length(pooled_obs), n_grey, 
                     replace = FALSE, 
                     prob = prop) 
  
  grey <- pooled_obs[grey_idx] 
  not_grey <- pooled_obs[-grey_idx]
  
  t2[i] <- t.test(grey, not_grey)$statistic
}
```

We can plot this new distribution like we did before:

```{r}
tibble(scores = t2, obs = obs) %>%
  ggplot() +
  geom_density(aes(scores)) + 
  geom_vline(xintercept = obs, color = "red")
```

And we can also compare it side-by-side:

```{r}
# YOUR CODE HERE
```

## Cross-validation

Which model best predicts the relationship between _attack_ and _special defense_ scores of Pokemon?

```{r}
# DO SOME DATA VISUALIZATION/EXPLORATION HERE
```


```{r}
model1 <- lm(sp_def ~ attack, data = pokemon)
pokemon %>% 
    ggplot(aes(attack, sp_def)) +
    geom_point() +
    geom_smooth(method = "lm")
```

```{r}
model1.5 <- lm(sp_def ~ I(attack^2), data = pokemon)
pokemon %>% 
    ggplot(aes(attack, sp_def)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y ~ I(x^2))
```

```{r}
model2 <- gam(sp_def ~ lo(attack), data = pokemon)
pokemon %>% 
    ggplot(aes(attack, sp_def)) +
    geom_point() +
    geom_smooth(method = "loess")
```


```{r message=FALSE, warning=FALSE}
train_control <- trainControl(method="cv", number=10)

model1 <- train(sp_def ~ attack,
                data = pokemon,
                trControl = train_control,
                method = "lm")

model1.5 <- train(sp_def ~ I(attack^2),
                  data = pokemon,
                  trControl = train_control,
                  method = "lm")

model2 <- train(sp_def ~ attack,
                data = pokemon,
                trControl = train_control,
                method = "gamLoess")
```

_train_ is the function to actually run the cross-validation process, using the control parameters we set up with _trainControl_.

_method_ determines the model we use - there are many!