CST383 - Module 5

What did I learn in the fifth week of CST383?

This week covered machine learning topics, with the core idea being classification using k-Nearest Neighbors. The theme was to understand how to preprocess and evaluate the steps needed for k-Nearest, while handling missing data, scaling, and assessing performance.

The most surprising aspect this week was how cross-validation solves the problem of wasting training data, as it allows you to estimate a future performance by validating across k iterations by rotating folds. Furthermore, the way you can get an accuracy estimate for the test set through these folds was both really interesting and somewhat difficult to understand, as it requires you to understand why through k iterations that a test set is organized according to validation folds and then through a mean across k iterations.

A concept I am still unsure about is the relationship between precision and recall. Basically, I understand that precision measures how often positive predictions are correct, while recall measures how often actual positives are caught, but there really isn't a way to decide on which one takes priority in a real world scenario. Therefore, how would a data scientist decide whether recall or precision needs to be used?

Some ideas and questions I had when going over the material were that the F1 score seems like a quick solution but may hide important details pertaining to precision and recall, and MCC seemed like the more thorough way to understand precision and recall, as it uses an entire confusion matrix. In the end, how sensitive is KNN accuracy within a distance metric?



Comments

Popular Posts