data:image/s3,"s3://crabby-images/e3a4b/e3a4b9b37bec606c99cae253403ab1ffb4b719f1" alt="Python:Data Analytics and Visualization"
Working with date and time objects
Python supports date and time handling in the date time and time modules from the standard library:
>>> import datetime >>> datetime.datetime(2000, 1, 1) datetime.datetime(2000, 1, 1, 0, 0)
Sometimes, dates are given or expected as strings, so a conversion from or to strings is necessary, which is realized by two functions: strptime
and strftime
, respectively:
>>> datetime.datetime.strptime("2000/1/1", "%Y/%m/%d") datetime.datetime(2000, 1, 1, 0, 0) >>> datetime.datetime(2000, 1, 1, 0, 0).strftime("%Y%m%d") '20000101'
Real-world data usually comes in all kinds of shapes and it would be great if we did not need to remember the exact date format specifies for parsing. Thankfully, Pandas abstracts away a lot of the friction, when dealing with strings representing dates or time. One of these helper functions is to_datetime
:
>>> import pandas as pd >>> import numpy as np >>> pd.to_datetime("4th of July") Timestamp('2015-07-04 >>> pd.to_datetime("13.01.2000") Timestamp('2000-01-13 00:00:00') >>> pd.to_datetime("7/8/2000") Timestamp('2000-07-08 00:00:00')
The last can refer to August 7th or July 8th, depending on the region. To disambiguate this case, to_datetime
can be passed a keyword argument dayfirst
:
>>> pd.to_datetime("7/8/2000", dayfirst=True) Timestamp('2000-08-07 00:00:00')
Timestamp objects can be seen as Pandas' version of datetime
objects and indeed, the Timestamp
class is a subclass of datetime
:
>>> issubclass(pd.Timestamp, datetime.datetime) True
Which means they can be used interchangeably in many cases:
>>> ts = pd.to_datetime(946684800000000000) >>> ts.year, ts.month, ts.day, ts.weekday() (2000, 1, 1, 5)
Timestamp objects are an important part of time series capabilities of Pandas, since timestamps are the building block of DateTimeIndex
objects:
>>> index = [pd.Timestamp("2000-01-01"), pd.Timestamp("2000-01-02"), pd.Timestamp("2000-01-03")] >>> ts = pd.Series(np.random.randn(len(index)), index=index) >>> ts 2000-01-01 0.731897 2000-01-02 0.761540 2000-01-03 -1.316866 dtype: float64 >>> ts.indexDatetime Index(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None, tz=None)
There are a few things to note here: We create a list of timestamp objects and pass it to the series constructor as index. This list of timestamps gets converted into a DatetimeIndex
on the fly. If we had passed only the date strings, we would not get a DatetimeIndex
, just an index
:
>>> ts = pd.Series(np.random.randn(len(index)), index=[ "2000-01-01", "2000-01-02", "2000-01-03"]) >>> ts.index Index([u'2000-01-01', u'2000-01-02', u'2000-01-03'], dtype='object')
However, the to_datetime
function is flexible enough to be of help, if all we have is a list of date strings:
>>> index = pd.to_datetime(["2000-01-01", "2000-01-02", "2000-01-03"]) >>> ts = pd.Series(np.random.randn(len(index)), index=index) >>> ts.index DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None, tz=None))
Another thing to note is that while we have a DatetimeIndex
, the freq
and tz
attributes are both None
. We will learn about the utility of both attributes later in this chapter.
With to_datetime
we are able to convert a variety of strings and even lists of strings into timestamp or DatetimeIndex
objects. Sometimes we are not explicitly given all the information about a series and we have to generate sequences of time stamps of fixed intervals ourselves.
Pandas offer another great utility function for this task: date_range
.
The date_range
function helps to generate a fixed frequency datetime
index between start and end dates. It is also possible to specify either the start or end date and the number of timestamps to generate.
The frequency can be specified by the freq
parameter, which supports a number of offsets. You can use typical time intervals like hours, minutes, and seconds:
>>> pd.date_range(start="2000-01-01", periods=3, freq='H') DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:00:00', '2000-01-01 02:00:00'], dtype='datetime64[ns]', freq='H', tz=None) >>> pd.date_range(start="2000-01-01", periods=3, freq='T') DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00', '2000-01-01 00:02:00'], dtype='datetime64[ns]', freq='T', tz=None) >>> pd.date_range(start="2000-01-01", periods=3, freq='S') DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:00:01', '2000-01-01 00:00:02'], dtype='datetime64[ns]', freq='S', tz=None)
The freq
attribute allows us to specify a multitude of options. Pandas has been used successfully in finance and economics, not least because it is really simple to work with business dates as well. As an example, to get an index with the first three business days of the millennium, the B
offset alias can be used:
>>> pd.date_range(start="2000-01-01", periods=3, freq='B') DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq='B', tz=None)
The following table shows the available offset aliases and can be also be looked up in the Pandas documentation on time series under http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases:
data:image/s3,"s3://crabby-images/b3400/b340026be2411cf95d3e57be91f84b160918f648" alt=""
Moreover, the offset aliases can be used in combination as well. Here, we are generating a datetime
index with five elements, each one day, one hour, one minute and one second apart:
>>> pd.date_range(start="2000-01-01", periods=5, freq='1D1h1min10s') DatetimeIndex(['2000-01-01 00:00:00', '2000-01-02 01:01:10', '2000-01-03 02:02:20', '2000-01-04 03:03:30', '2000-01-05 04:04:40'], dtype='datetime64[ns]', freq='90070S', tz=None)
If we want to index data every 12 hours of our business time, which by default starts at 9 AM and ends at 5 PM, we would simply prefix the BH
alias:
>>> pd.date_range(start="2000-01-01", periods=5, freq='12BH') DatetimeIndex(['2000-01-03 09:00:00', '2000-01-04 13:00:00', '2000-01-06 09:00:00', '2000-01-07 13:00:00', '2000-01-11 09:00:00'], dtype='datetime64[ns]', freq='12BH', tz=None)
A custom definition of what a business hour means is also possible:
>>> ts.index DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None, tz=None)
We can use this custom business hour to build indexes as well:
>>> pd.date_range(start="2000-01-01", periods=5, freq=12 * bh) DatetimeIndex(['2000-01-03 07:00:00', '2000-01-03 19:00:00', '2000-01-04 07:00:00', '2000-01-04 19:00:00', '2000-01-05 07:00:00', '2000-01-05 19:00:00', '2000-01-06 07:00:00'], dtype='datetime64[ns]', freq='12BH', tz=None)
Some frequencies allow us to specify an anchoring suffix, which allows us to express intervals, such as every Friday or every second Tuesday of the month:
>>> pd.date_range(start="2000-01-01", periods=5, freq='W-FRI') DatetimeIndex(['2000-01-07', '2000-01-14', '2000-01-21', '2000-01-28', '2000-02-04'], dtype='datetime64[ns]', freq='W-FRI', tz=None) >>> pd.date_range(start="2000-01-01", periods=5, freq='WOM-2TUE') DatetimeIndex(['2000-01-11', '2000-02-08', '2000-03-14', '2000-04-11', '2000-05-09'], dtype='datetime64[ns]', freq='WOM-2TUE', tz=None)
Finally, we can merge various indexes of different frequencies. The possibilities are endless. We only show one example, where we combine two indexes – each over a decade – one pointing to every first business day of a year and one to the last day of February:
>>> s = pd.date_range(start="2000-01-01", periods=10, freq='BAS-JAN') >>> t = pd.date_range(start="2000-01-01", periods=10, freq='A-FEB') >>> s.union(t) DatetimeIndex(['2000-01-03', '2000-02-29', '2001-01-01', '2001-02-28', '2002-01-01', '2002-02-28', '2003-01-01', '2003-02-28','2004-01-01', '2004-02-29', '2005-01-03', '2005-02-28', '2006-01-02', '2006-02-28', '2007-01-01', '2007-02-28','2008-01-01', '2008-02-29', '2009-01-01', '2009-02-28'], dtype='datetime64[ns]', freq=None, tz=None)
We see, that 2000 and 2005 did not start on a weekday and that 2000, 2004, and 2008 were the leap years.
We have seen two powerful functions so far, to_datetime
and date_range
. Now we want to pe into time series by first showing how you can create and plot time series data with only a few lines. In the rest of this section, we will show various ways to access and slice time series data.
It is easy to get started with time series data in Pandas. A random walk can be created and plotted in a few lines:
>>> index = pd.date_range(start='2000-01-01', periods=200, freq='B') >>> ts = pd.Series(np.random.randn(len(index)), index=index) >>> walk = ts.cumsum() >>> walk.plot()
A possible output of this plot is show in the following figure:
data:image/s3,"s3://crabby-images/cd414/cd4148f17b256ec64ea9bc5b4ca59c43d2090cc9" alt=""
Just as with usual series objects, you can select parts and slice the index:
>>> ts.head() 2000-01-03 1.464142 2000-01-04 0.103077 2000-01-05 0.762656 2000-01-06 1.157041 2000-01-07 -0.427284 Freq: B, dtype: float64 >>> ts[0] 1.4641415817112928 >>> ts[1:3] 2000-01-04 0.103077 2000-01-05 0.762656
We can use date strings as keys, even though our series has a DatetimeIndex
:
>>> ts['2000-01-03'] 1.4641415817112928
Even though the DatetimeIndex
is made of timestamp objects, we can use datetime
objects as keys as well:
>>> ts[datetime.datetime(2000, 1, 3)] 1.4641415817112928
Access is similar to lookup in dictionaries or lists, but more powerful. We can, for example, slice with strings or even mixed objects:
>>> ts['2000-01-03':'2000-01-05'] 2000-01-03 1.464142 2000-01-04 0.103077 2000-01-05 0.762656 Freq: B, dtype: float64 >>> ts['2000-01-03':datetime.datetime(2000, 1, 5)] 2000-01-03 1.464142 2000-01-04 0.103077 2000-01-05 0.762656 Freq: B, dtype: float64 >>> ts['2000-01-03':datetime.date(2000, 1, 5)] 2000-01-03 -0.807669 2000-01-04 0.029802 2000-01-05 -0.434855 Freq: B, dtype: float64
It is even possible to use partial strings to select groups of entries. If we are only interested in February, we could simply write:
>>> ts['2000-02'] 2000-02-01 0.277544 2000-02-02 -0.844352 2000-02-03 -1.900688 2000-02-04 -0.120010 2000-02-07 -0.465916 2000-02-08 -0.575722 2000-02-09 0.426153 2000-02-10 0.720124 2000-02-11 0.213050 2000-02-14 -0.604096 2000-02-15 -1.275345 2000-02-16 -0.708486 2000-02-17 -0.262574 2000-02-18 1.898234 2000-02-21 0.772746 2000-02-22 1.142317 2000-02-23 -1.461767 2000-02-24 -2.746059 2000-02-25 -0.608201 2000-02-28 0.513832 2000-02-29 -0.132000
To see all entries from March until May, including:
>>> ts['2000-03':'2000-05'] 2000-03-01 0.528070 2000-03-02 0.200661 ... 2000-05-30 1.206963 2000-05-31 0.230351 Freq: B, dtype: float64
Time series can be shifted forward or backward in time. The index stays in place, the values move:
>>> small_ts = ts['2000-02-01':'2000-02-05'] >>> small_ts 2000-02-01 0.277544 2000-02-02 -0.844352 2000-02-03 -1.900688 2000-02-04 -0.120010 Freq: B, dtype: float64 >>> small_ts.shift(2) 2000-02-01 NaN 2000-02-02 NaN 2000-02-03 0.277544 2000-02-04 -0.844352 Freq: B, dtype: float64
To shift backwards in time, we simply use negative values:
>>> small_ts.shift(-2) 2000-02-01 -1.900688 2000-02-02 -0.120010 2000-02-03 NaN 2000-02-04 NaN Freq: B, dtype: float64