CST383 - Module 3

What did I learn in the third week of CST383?

This week introduced visualizing one continuous variable, which covered three main plot types: density plots, histograms, and box plots. Furthermore, each serves a different purpose in both analyzation and distribution, as density shows the overall shape, histograms show frequency within bins, and box plots summarize data through quartiles and outliers.

The concepts that took the most time to understand were bin selection and histograms. Bin selection required me to understand how bin width and start/stop worked to what values I put into the specific range(), which would require either going above it by one or by ten in order to get the exact values. On the other hand, histograms weren't too difficult, as they were just a different way of implementing density plots, but I made a mistake by using plt.hist instead of plot.hist, which led me to spending a long time on a problem that could have been solved in 20 minutes or less.

For what I struggled with, I had overlooked an instruction pointing to a plotting basics pdf, which led to me getting frustrated at how I couldn't find any information on tick labels, after reviewing the next lab assignment the next day, I found the plotting basics pdf that I had overlooked in the course materials.

Some ideas and questions that I had when going over the labs were whether there was a systematic way to automatically determine the best bin width instead of manually inputting values from what I could visually see, and whether the bandwidth could also be optimized to do something similar. In the end, how are skewed distribution counts handled in real data science workflows and is a log transformation always the standard approach in data science workflows?

Comments

Popular Posts