Working with date and time objects_Python：Data Analytics and Visualization-QQ阅读男生武侠网

上QQ阅读APP看书，第一时间看更新

Working with date and time objects

Python supports date and time handling in the date time and time modules from the standard library:

>>> import datetime
>>> datetime.datetime(2000, 1, 1)
datetime.datetime(2000, 1, 1, 0, 0)

Sometimes, dates are given or expected as strings, so a conversion from or to strings is necessary, which is realized by two functions: strptime and strftime, respectively:

>>> datetime.datetime.strptime("2000/1/1", "%Y/%m/%d")
datetime.datetime(2000, 1, 1, 0, 0)
>>> datetime.datetime(2000, 1, 1, 0, 0).strftime("%Y%m%d")
'20000101'

Real-world data usually comes in all kinds of shapes and it would be great if we did not need to remember the exact date format specifies for parsing. Thankfully, Pandas abstracts away a lot of the friction, when dealing with strings representing dates or time. One of these helper functions is to_datetime:

>>> import pandas as pd
>>> import numpy as np
>>> pd.to_datetime("4th of July")
Timestamp('2015-07-04 
>>> pd.to_datetime("13.01.2000")
Timestamp('2000-01-13 00:00:00')
>>> pd.to_datetime("7/8/2000")
Timestamp('2000-07-08 00:00:00')

The last can refer to August 7th or July 8th, depending on the region. To disambiguate this case, to_datetime can be passed a keyword argument dayfirst:

>>> pd.to_datetime("7/8/2000", dayfirst=True)
Timestamp('2000-08-07 00:00:00')

Timestamp objects can be seen as Pandas' version of datetime objects and indeed, the Timestamp class is a subclass of datetime:

>>> issubclass(pd.Timestamp, datetime.datetime)
True

Which means they can be used interchangeably in many cases:

>>> ts = pd.to_datetime(946684800000000000)
>>> ts.year, ts.month, ts.day, ts.weekday()
(2000, 1, 1, 5)

Timestamp objects are an important part of time series capabilities of Pandas, since timestamps are the building block of DateTimeIndex objects:

>>> index = [pd.Timestamp("2000-01-01"),
 pd.Timestamp("2000-01-02"),
 pd.Timestamp("2000-01-03")]
>>> ts = pd.Series(np.random.randn(len(index)), index=index)
>>> ts
2000-01-01 0.731897
2000-01-02 0.761540
2000-01-03 -1.316866
dtype: float64
>>> ts.indexDatetime
Index(['2000-01-01', '2000-01-02', '2000-01-03'],
dtype='datetime64[ns]', freq=None, tz=None)

There are a few things to note here: We create a list of timestamp objects and pass it to the series constructor as index. This list of timestamps gets converted into a DatetimeIndex on the fly. If we had passed only the date strings, we would not get a DatetimeIndex, just an index:

>>> ts = pd.Series(np.random.randn(len(index)), index=[
 "2000-01-01", "2000-01-02", "2000-01-03"])
>>> ts.index
Index([u'2000-01-01', u'2000-01-02', u'2000-01-03'], dtype='object')

However, the to_datetime function is flexible enough to be of help, if all we have is a list of date strings:

>>> index = pd.to_datetime(["2000-01-01", "2000-01-02", "2000-01-03"])
>>> ts = pd.Series(np.random.randn(len(index)), index=index)
>>> ts.index
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None, tz=None))

Another thing to note is that while we have a DatetimeIndex, the freq and tz attributes are both None. We will learn about the utility of both attributes later in this chapter.

With to_datetime we are able to convert a variety of strings and even lists of strings into timestamp or DatetimeIndex objects. Sometimes we are not explicitly given all the information about a series and we have to generate sequences of time stamps of fixed intervals ourselves.

Pandas offer another great utility function for this task: date_range.

The date_range function helps to generate a fixed frequency datetime index between start and end dates. It is also possible to specify either the start or end date and the number of timestamps to generate.

The frequency can be specified by the freq parameter, which supports a number of offsets. You can use typical time intervals like hours, minutes, and seconds:

>>> pd.date_range(start="2000-01-01", periods=3, freq='H')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:00:00', '2000-01-01 02:00:00'], dtype='datetime64[ns]', freq='H', tz=None)
>>> pd.date_range(start="2000-01-01", periods=3, freq='T')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00', '2000-01-01 00:02:00'], dtype='datetime64[ns]', freq='T', tz=None)
>>> pd.date_range(start="2000-01-01", periods=3, freq='S')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:00:01', '2000-01-01 00:00:02'], dtype='datetime64[ns]', freq='S', tz=None)

The freq attribute allows us to specify a multitude of options. Pandas has been used successfully in finance and economics, not least because it is really simple to work with business dates as well. As an example, to get an index with the first three business days of the millennium, the B offset alias can be used:

>>> pd.date_range(start="2000-01-01", periods=3, freq='B')
DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq='B', tz=None)

The following table shows the available offset aliases and can be also be looked up in the Pandas documentation on time series under http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases:

Moreover, the offset aliases can be used in combination as well. Here, we are generating a datetime index with five elements, each one day, one hour, one minute and one second apart:

>>> pd.date_range(start="2000-01-01", periods=5, freq='1D1h1min10s')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-02 01:01:10', '2000-01-03 02:02:20', '2000-01-04 03:03:30', '2000-01-05 04:04:40'], dtype='datetime64[ns]', freq='90070S', tz=None)

If we want to index data every 12 hours of our business time, which by default starts at 9 AM and ends at 5 PM, we would simply prefix the BH alias:

>>> pd.date_range(start="2000-01-01", periods=5, freq='12BH')
DatetimeIndex(['2000-01-03 09:00:00', '2000-01-04 13:00:00', '2000-01-06 09:00:00', '2000-01-07 13:00:00', '2000-01-11 09:00:00'], dtype='datetime64[ns]', freq='12BH', tz=None)

A custom definition of what a business hour means is also possible:

>>> ts.index
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None, tz=None)

We can use this custom business hour to build indexes as well:

>>> pd.date_range(start="2000-01-01", periods=5, freq=12 * bh)
DatetimeIndex(['2000-01-03 07:00:00', '2000-01-03 19:00:00', '2000-01-04 07:00:00', '2000-01-04 19:00:00', '2000-01-05 07:00:00', '2000-01-05 19:00:00', '2000-01-06 07:00:00'], dtype='datetime64[ns]', freq='12BH', tz=None)

Some frequencies allow us to specify an anchoring suffix, which allows us to express intervals, such as every Friday or every second Tuesday of the month:

>>> pd.date_range(start="2000-01-01", periods=5, freq='W-FRI')
DatetimeIndex(['2000-01-07', '2000-01-14', '2000-01-21', '2000-01-28', '2000-02-04'], dtype='datetime64[ns]', freq='W-FRI', tz=None)
>>> pd.date_range(start="2000-01-01", periods=5, freq='WOM-2TUE')
DatetimeIndex(['2000-01-11', '2000-02-08', '2000-03-14', '2000-04-11', '2000-05-09'], dtype='datetime64[ns]', freq='WOM-2TUE', tz=None)

Finally, we can merge various indexes of different frequencies. The possibilities are endless. We only show one example, where we combine two indexes – each over a decade – one pointing to every first business day of a year and one to the last day of February:

>>> s = pd.date_range(start="2000-01-01", periods=10, freq='BAS-JAN')
>>> t = pd.date_range(start="2000-01-01", periods=10, freq='A-FEB')
>>> s.union(t)
DatetimeIndex(['2000-01-03', '2000-02-29', '2001-01-01', '2001-02-28', '2002-01-01', '2002-02-28', '2003-01-01', '2003-02-28','2004-01-01', '2004-02-29', '2005-01-03', '2005-02-28', '2006-01-02', '2006-02-28', '2007-01-01', '2007-02-28','2008-01-01', '2008-02-29', '2009-01-01', '2009-02-28'], dtype='datetime64[ns]', freq=None, tz=None)

We see, that 2000 and 2005 did not start on a weekday and that 2000, 2004, and 2008 were the leap years.

We have seen two powerful functions so far, to_datetime and date_range. Now we want to pe into time series by first showing how you can create and plot time series data with only a few lines. In the rest of this section, we will show various ways to access and slice time series data.

It is easy to get started with time series data in Pandas. A random walk can be created and plotted in a few lines:

>>> index = pd.date_range(start='2000-01-01', periods=200, freq='B')
>>> ts = pd.Series(np.random.randn(len(index)), index=index)
>>> walk = ts.cumsum()
>>> walk.plot()

A possible output of this plot is show in the following figure:

Just as with usual series objects, you can select parts and slice the index:

>>> ts.head()
2000-01-03 1.464142
2000-01-04 0.103077
2000-01-05 0.762656
2000-01-06 1.157041
2000-01-07 -0.427284
Freq: B, dtype: float64
>>> ts[0]
1.4641415817112928
>>> ts[1:3]
2000-01-04 0.103077
2000-01-05 0.762656

We can use date strings as keys, even though our series has a DatetimeIndex:

>>> ts['2000-01-03']
1.4641415817112928

Even though the DatetimeIndex is made of timestamp objects, we can use datetime objects as keys as well:

>>> ts[datetime.datetime(2000, 1, 3)]
1.4641415817112928

Access is similar to lookup in dictionaries or lists, but more powerful. We can, for example, slice with strings or even mixed objects:

>>> ts['2000-01-03':'2000-01-05']
2000-01-03 1.464142
2000-01-04 0.103077
2000-01-05 0.762656
Freq: B, dtype: float64
>>> ts['2000-01-03':datetime.datetime(2000, 1, 5)]
2000-01-03 1.464142
2000-01-04 0.103077
2000-01-05 0.762656
Freq: B, dtype: float64
>>> ts['2000-01-03':datetime.date(2000, 1, 5)]
2000-01-03 -0.807669
2000-01-04 0.029802
2000-01-05 -0.434855
Freq: B, dtype: float64

It is even possible to use partial strings to select groups of entries. If we are only interested in February, we could simply write:

>>> ts['2000-02']
2000-02-01 0.277544
2000-02-02 -0.844352
2000-02-03 -1.900688
2000-02-04 -0.120010
2000-02-07 -0.465916
2000-02-08 -0.575722
2000-02-09 0.426153
2000-02-10 0.720124
2000-02-11 0.213050
2000-02-14 -0.604096
2000-02-15 -1.275345
2000-02-16 -0.708486
2000-02-17 -0.262574
2000-02-18 1.898234
2000-02-21 0.772746
2000-02-22 1.142317
2000-02-23 -1.461767
2000-02-24 -2.746059
2000-02-25 -0.608201
2000-02-28 0.513832
2000-02-29 -0.132000

To see all entries from March until May, including:

>>> ts['2000-03':'2000-05']
2000-03-01 0.528070
2000-03-02 0.200661
 ...
2000-05-30 1.206963
2000-05-31 0.230351
Freq: B, dtype: float64

Time series can be shifted forward or backward in time. The index stays in place, the values move:

>>> small_ts = ts['2000-02-01':'2000-02-05']
>>> small_ts
2000-02-01 0.277544
2000-02-02 -0.844352
2000-02-03 -1.900688
2000-02-04 -0.120010
Freq: B, dtype: float64
>>> small_ts.shift(2)
2000-02-01 NaN
2000-02-02 NaN
2000-02-03 0.277544
2000-02-04 -0.844352
Freq: B, dtype: float64

To shift backwards in time, we simply use negative values:

>>> small_ts.shift(-2)
2000-02-01 -1.900688
2000-02-02 -0.120010
2000-02-03 NaN
2000-02-04 NaN
Freq: B, dtype: float64