This homework assignment is designed to reinforce your understanding of the tidymodels framework and its application in bioinformatics, note that it isn’t a formal assessment and so isn’t required, is just here to support your learning. Please complete all questions and code challenges. Remember to document your code and explain your reasoning where appropriate.
Setup
First, load the necessary libraries and dataset. We’ll be using a simulated gene expression dataset.
Code
library(tidymodels)library(ggplot2)library(dplyr)# Set seed for reproducibilityset.seed(42)# Create a simulated gene expression datasetn_samples <-1000n_genes <-50gene_expression_data <-tibble(sample_id =1:n_samples,condition =factor(sample(c("Control", "Treatment"), n_samples, replace =TRUE)),)gene_expression =matrix(rnorm(n_samples * n_genes), nrow = n_samples)colnames(gene_expression) <-paste0("gene_", 1:n_genes)gene_expression_data <-cbind(gene_expression_data, gene_expression)
Questions and Challenges
1. Data Exploration
Explore the gene_expression_data dataset.
How many samples and genes are in the dataset?
Create a boxplot comparing the expression levels of the first 5 genes between the Control and Treatment conditions.
Code
# Your code here
2. Data Preprocessing with recipes
Create a recipe that does the following: - Uses condition as the outcome variable - Normalizes all gene expression variables - Performs PCA on the normalized gene expression data, keeping enough principal components to explain 90% of the variance
Code
# Your code here
3. Data Splitting
Split the data into training (80%) and testing (20%) sets. Use stratified sampling based on the condition variable.
Code
# Your code here
4. Model Specification
Specify a random forest model using the rand_forest() function from parsnip. Set it up for a classification task and use the “ranger” engine. Choose at least two hyperparameters to tune.
Code
# Your code here
5. Create a Workflow
Combine your recipe and model specification into a workflow.
Code
# Your code here
6. Model Tuning
Set up a grid for tuning your chosen hyperparameters. Use tune_grid() to perform 5-fold cross-validation on the training data. Visualize the results of your tuning process.
Code
# Your code here
7. Final Model and Evaluation
Select the best model from your tuning results, finalize the workflow, and fit it to the entire training set. Then, use this final model to make predictions on the test set. Calculate and visualize at least two performance metrics of your choice.
Code
# Your code here
8. Interpretation
Which genes appear to be most important in distinguishing between Control and Treatment conditions? (Hint: Look into the vip package for variable importance)
How well does your model perform?
Suggest at least two ways you might improve this analysis pipeline.
Source Code
---title: "Modelling in R - Practice"author: "Gabriel Mateus Bernardo Harrington"format: html: code-fold: true code-tools: trueexecute: echo: true warning: false---## IntroductionThis homework assignment is designed to reinforce your understanding of the tidymodels framework and its application in bioinformatics, note that it isn't a formal assessment and so isn't required, is just here to support your learning. Please complete all questions and code challenges. Remember to document your code and explain your reasoning where appropriate.## SetupFirst, load the necessary libraries and dataset. We'll be using a simulated gene expression dataset.```{r}#| label: setup#| message: falselibrary(tidymodels)library(ggplot2)library(dplyr)# Set seed for reproducibilityset.seed(42)# Create a simulated gene expression datasetn_samples <-1000n_genes <-50gene_expression_data <-tibble(sample_id =1:n_samples,condition =factor(sample(c("Control", "Treatment"), n_samples, replace =TRUE)),)gene_expression =matrix(rnorm(n_samples * n_genes), nrow = n_samples)colnames(gene_expression) <-paste0("gene_", 1:n_genes)gene_expression_data <-cbind(gene_expression_data, gene_expression)```## Questions and Challenges### 1. Data ExplorationExplore the `gene_expression_data` dataset. a) How many samples and genes are in the dataset?b) Create a boxplot comparing the expression levels of the first 5 genes between the Control and Treatment conditions.```{r}# Your code here```### 2. Data Preprocessing with recipesCreate a recipe that does the following:- Uses `condition` as the outcome variable- Normalizes all gene expression variables- Performs PCA on the normalized gene expression data, keeping enough principal components to explain 90% of the variance```{r}# Your code here```### 3. Data SplittingSplit the data into training (80%) and testing (20%) sets. Use stratified sampling based on the `condition` variable.```{r}# Your code here```### 4. Model SpecificationSpecify a random forest model using the `rand_forest()` function from parsnip. Set it up for a classification task and use the "ranger" engine. Choose at least two hyperparameters to tune.```{r}# Your code here```### 5. Create a WorkflowCombine your recipe and model specification into a workflow.```{r}# Your code here```### 6. Model TuningSet up a grid for tuning your chosen hyperparameters. Use `tune_grid()` to perform 5-fold cross-validation on the training data. Visualize the results of your tuning process.```{r}# Your code here```### 7. Final Model and EvaluationSelect the best model from your tuning results, finalize the workflow, and fit it to the entire training set. Then, use this final model to make predictions on the test set. Calculate and visualize at least two performance metrics of your choice.```{r}# Your code here```### 8. Interpretationa) Which genes appear to be most important in distinguishing between Control and Treatment conditions? (Hint: Look into the `vip` package for variable importance)b) How well does your model perform?c) Suggest at least two ways you might improve this analysis pipeline.