1.5 A Pandas Primer

Why Pandas

If you work as a data analyst or engineer in the aerospace industry, chances are that you are going to be working with a lot of tabular data. As examples:

navigation data
airport, runway and obstacle data
traffic data
flight data recorder data

If you are using Python to work with this data, it is worth investing a bit of time to incorporate the pandas library into your work setup. Pandas is a huge library packed with functionality for working with tabular data. I chose the word investing because:

pandas doesn't come with the standard Python library so you'll have to spend some effort setting up a virtual environment and getting it installed, and
the syntax is a bit different from pure Python so you may have to spend a bit of time getting used to it

The upfront cost of learning pandas can have a big payoff in the long run as overall, your code may end up being shorter, cleaner, simpler and faster than if you had tried to write it all with pure Python. To clarify:

shorter because you may be able to write one line of pandas to do what would take 5-10 lines of pure Python
faster because pandas can perform operations on multiple sets of rows or columns without looping through them one by one

Basic Usage

Pandas can be slightly perplexing for beginners, because there can be multiple valid ways to do a certain thing, and the syntax has some differences from pure Python. I would recommend making a cheat sheet of just a few key commands and sticking with them until it starts to feel more natural. My cheat sheet looks something like this:

Basics:

the two main data structures are:
- - the dataframe, which is like a table
  - the series, which is like a single-column table
  - In other words: a dataframe is a set of series (columns) stacked side by side
There are many ways to make a dataframe; the most common ways I use are:
- - from a list of dictionaries
  - from a list of lists
  - from a .csv
I use the df_ prefix for naming my dataframes
I use .head(x) and .tail(x) to quickly inspect the first or last x rows of my dataframe
I frequently use the read_csv() and to_csv() functions to read from and write to .csv file
I often use Python's type() function to double check whether I'm working with a series or a dataframe
I use df.shape to check the number of rows and columns of my dataframe
I use Python's len() to check how many rows are in my dataframe
I use .dtypes to check the variable types for the contents of each column in my dataframe. The data types in pandas are from NumPy so they may look and behave slightly differently than the data types you are used to from pure Python (string, int and float)
I use .value_counts() to quickly count the distribution of values in a given column

Working with tables:

Row selection: I like using .iloc to select rows by number. Rows can be selected individually or in multiples, just like you would do by slicing in pure Python.
Column selection: I like using strings or a list of strings to select columns by column name(s)
Conditional selection: just be aware that there are a bunch of different ways to do conditional selections
Duplicates: there are a bunch of really useful functions for identifying, selecting or dropping duplicate rows. Duplicate identification can be based on the entire row, or just one column of the row, or multiple columns of the row. It's up to you. Pay close attention to the keep argument (options are {‘first’, ‘last’, False}), this can be very confusing the first time you use it.
Merging and Joining: these are used for combining tables and usually require careful attention. Like conditional selection, there are a few different ways to do things.

Random tips:

CopyWarning: When working with selected data, I often use .copy() to make a copy of the original values to avoid the SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame warning
inplace: when performing operations on a dataframe, you can usually choose whether you want pandas to return a new dataframe, or perform the operation directly on the dataframe. The inplace argument is for this choice.
reset_index is useful after sorting or making conditional selections, to reset the row index to start from 0

Recommended Book

To learn pandas from scratch, I would recommend reading the first half of the book Python for Data Analysis by Wes McKinney. The author is the creator of the pandas project so you will be learning from the best!

Previous Section

Table of Contents

Next Section

Google Sites

Report abuse