CST383 - Module 2

What did I learn in the second week of CST383?

This week introduced Pandas, which is a Python library built on top of NumPy that was designed for data science. Basically, Pandas can bundle data within an index, providing a way to allow multiple elements to be accessed by an assigned label rather than just by a position. Furthermore, through Series and DataFrames, Pandas is capable of handling data either through a 1D array (Series) or a 2D array (DataFrames).

The concepts that took me some time to understand were aggregation and grouping. Basically, with functions going from .aggregate() to .value_counts() and .groupby(), required me to understand how to combine prior knowledge from week 1 to fully implement everything through boolean masks, fancy indexing, etc. while attempting to make it within a single line, as I'm used to using loops to resolve many problems instead of one liners.

Outside of the course, the thing that was the most impactful was the Google Colab, as I didn't know there was an IDE through a web based interface owned by Google. I knew of at least two other web based IDE's from GitHub and a third-party source, but this one really outperformed those, as just the features and distinctions between everything, while incorporating a way for a .py file to be capable of achieving the same results through a .ipynb and retain the results after running everything, is honestly impressive.

In the end, after going through this week, some ideas that came to mind were whether .groupby() or .aggregate() could be chained with other operations, such as filtering data. Therefore, this raises two questions, what other Pandas methods are commonly used in data science workflows and how do Pandas handle large datasets within corporations?

Comments

Popular Posts