
Working with missing data
In this section, we will discuss missing, NaN
, or null
values, in Pandas data structures. It is a very common situation to arrive with missing data in an object. One such case that creates missing data is reindexing:
>>> df8 = pd.DataFrame(np.arange(12).reshape(4,3), columns=['a', 'b', 'c']) a b c 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 >>> df9 = df8.reindex(columns = ['a', 'b', 'c', 'd']) a b c d 0 0 1 2 NaN 1 3 4 5 NaN 2 6 7 8 NaN 4 9 10 11 NaN >>> df10 = df8.reindex([3, 2, 'a', 0]) a b c 3 9 10 11 2 6 7 8 a NaN NaN NaN 0 0 1 2
To manipulate missing values, we can use the isnull()
or notnull()
functions to detect the missing values in a Series object, as well as in a DataFrame object:
>>> df10.isnull() a b c 3 False False False 2 False False False a True True True 0 False False False
On a Series, we can drop all null
data and index values by using the dropna
function:
>>> s4 = pd.Series({'001': 'Nam', '002': 'Mary', '003': 'Peter'}, index=['002', '001', '024', '065']) >>> s4 002 Mary 001 Nam 024 NaN 065 NaN dtype: object >>> s4.dropna() # dropping all null value of Series object 002 Mary 001 Nam dtype: object
With a DataFrame object, it is a little bit more complex than with Series. We can tell which rows or columns we want to drop and also if all entries must be null
or a single null
value is enough. By default, the function will drop any row containing a missing value:
>>> df9.dropna() # all rows will be dropped Empty DataFrame Columns: [a, b, c, d] Index: [] >>> df9.dropna(axis=1) a b c 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11
Another way to control missing values is to use the supported parameters of functions that we introduced in the previous section. They are also very useful to solve this problem. In our experience, we should assign a fixed value in missing cases when we create data objects. This will make our objects cleaner in later processing steps. For example, consider the following:
>>> df11 = df8.reindex([3, 2, 'a', 0], fill_value = 0) >>> df11 a b c 3 9 10 11 2 6 7 8 a 0 0 0 0 0 1 2
We can alse use the fillna
function to fill a custom value in missing values:
>>> df9.fillna(-1) a b c d 0 0 1 2 -1 1 3 4 5 -1 2 6 7 8 -1 3 9 10 11 -1