.apply() need not apply
Here’s an interesting puzzle I encountered.
I’m working with a dataset with two features of interest: naive datetimes (naive meaning without any timezone information) and timezones. What I’d like is a new feature that has all the date and time information in the datetime feature, except converted to UTC time.
Seems simple enough, right? Well… no. Especially not for a very, very big dataset.
The Data
Here’s a simplified version of the dataset I’m working with:
data.head()
datetime_naive | timezone | |
---|---|---|
0 | 2022-10-08 19:13:00 | America/New_York |
1 | 2022-11-04 08:00:00 | America/Los_Angeles |
2 | 2023-04-05 09:00:00 | America/New_York |
3 | 2022-11-25 14:40:00 | America/Denver |
4 | 2023-03-11 09:00:00 | America/New_York |
data.dtypes
datetime_naive datetime64[ns]
timezone object
dtype: object
As promised, there’s datetime_naive
, which is contains our date/time information, and timezone
, which contains the timezone string.
Now, if the datetime_naive
feature all belonged to the same timezone (say, “America/New_York”), converting to UTC after first localizing to East coast time is trivial:
tz = 'America/New_York'
data['datetime_nyc'] = data['datetime_naive'].dt.tz_localize(tz)
data['datetime_utc'] = data['datetime_nyc'].dt.tz_convert('UTC')
data.head()
datetime_naive | timezone | datetime_nyc | datetime_utc | |
---|---|---|---|---|
0 | 2022-10-08 19:13:00 | America/New_York | 2022-10-08 19:13:00-04:00 | 2022-10-08 23:13:00+00:00 |
1 | 2022-11-04 08:00:00 | America/Los_Angeles | 2022-11-04 08:00:00-04:00 | 2022-11-04 12:00:00+00:00 |
2 | 2023-04-05 09:00:00 | America/New_York | 2023-04-05 09:00:00-04:00 | 2023-04-05 13:00:00+00:00 |
3 | 2022-11-25 14:40:00 | America/Denver | 2022-11-25 14:40:00-05:00 | 2022-11-25 19:40:00+00:00 |
4 | 2023-03-11 09:00:00 | America/New_York | 2023-03-11 09:00:00-05:00 | 2023-03-11 14:00:00+00:00 |
data.dtypes
datetime_naive datetime64[ns]
timezone object
datetime_nyc datetime64[ns, America/New_York]
datetime_utc datetime64[ns, UTC]
dtype: object
First I localized datetime_naive
to East Coast time (datetime_nyc
), and then I converted that to UTC (datetime_utc
). You’ll notice that datetime_naive
and datetime_nyc
are almost identical, except the localized version includes the UTC offset information, which above is either -04:00 or -05:00 depending on DST. The dtypes differ, as well, with datetime_nyc
typed as a localized datetime64 object.
Then, I converted the localized datetime_nyc
to UTC time, so you’ll notice that the UTC offset for datetime_utc
(as expected) is now +00:00. Its dtype is localized to UTC, as well.
So, that’s easy enough. But our problem is not this. Our problem is complicated by the fact that each datetime value in datetime_naive
corresponds to a distinct timezone, so calling .dt.tz_localize()
will not work, since we aren’t localizing the entire feature.
Solution 1: .apply()
yourself!
My first thought was to convert each datetime_naive
record individually using .apply()
in tandem with the pytz library. Something like this:
import pytz
def convert_to_utc(dt, tz):
tz = pytz.timezone(tz)
localized = tz.localize(dt)
utc = localized.astimezone(pytz.utc)
return utc
data['datetime_utc'] = data.apply(lambda row: convert_to_utc(row['datetime_naive'], row['timezone']), axis=1)
data.head()
datetime_naive | timezone | datetime_utc | |
---|---|---|---|
0 | 2022-10-08 19:13:00 | America/New_York | 2022-10-08 23:13:00+00:00 |
1 | 2022-11-04 08:00:00 | America/Los_Angeles | 2022-11-04 15:00:00+00:00 |
2 | 2023-04-05 09:00:00 | America/New_York | 2023-04-05 13:00:00+00:00 |
3 | 2022-11-25 14:40:00 | America/Denver | 2022-11-25 21:40:00+00:00 |
4 | 2023-03-11 09:00:00 | America/New_York | 2023-03-11 14:00:00+00:00 |
data.dtypes
datetime_naive datetime64[ns]
timezone object
datetime_utc datetime64[ns, UTC]
dtype: object
Using .apply()
, we call convert_to_utc()
for each row of the dataframe, passing the paired datetime_naive
and timezone
values as arguments. Then convert_to_utc()
localized the datetime_naive
value according to the timezone
before converting the localized valu to UTC.
Now, this works. Sort of. This dataset has 1,000 records. The one I was working with when I encountered this problem had millions. I’m not sure how long the operation took because I was too impatient, so let’s just say it took a while. .apply()
is not the most efficent route.
How to proceed? I really scratched my head about this for a while. At first I was thinking that maybe I could chunk the dataset and perform the same .apply()
method on one (smaller) chunk at a time. After all, it works well enough for 1,000 records, doesn’t it? My thinking was that maybe I was bumping up against a memory issue. I implemented this solution, but still was too impatient to find out how long it took to process the entire dataset.
Solution 2: Vector? Check, sir
Knowing how fast it was to simply localize an entire datetime feature to a given timezone and then convert that, I kept on thinking that there must be a vectorized solution to this. Maybe if I process all the datetimes for a given timezone at once?
That ended up looking something like this:
def vectorized_convert_to_utc(df, dt_col, tz_col):
# Get list of timezone
tzs = df[tz_col].unique()
for tz in tzs:
# Create mask for current timezome
is_current_tz = (df[tz_col] == tz)
# Localize all datetime records matching `tz` and convert to UTC
df.loc[is_current_tz, 'datetime_utc'] = (
df.loc[is_current_tz, dt_col].dt.tz_localize(tz).dt.tz_convert('UTC')
)
return df['datetime_utc']
data['datetime_utc'] = vectorized_convert_to_utc(data, 'datetime_naive', 'timezone')
data.head()
datetime_naive | timezone | datetime_utc | |
---|---|---|---|
0 | 2022-10-08 19:13:00 | America/New_York | 2022-10-08 23:13:00+00:00 |
1 | 2022-11-04 08:00:00 | America/Los_Angeles | 2022-11-04 15:00:00+00:00 |
2 | 2023-04-05 09:00:00 | America/New_York | 2023-04-05 13:00:00+00:00 |
3 | 2022-11-25 14:40:00 | America/Denver | 2022-11-25 21:40:00+00:00 |
4 | 2023-03-11 09:00:00 | America/New_York | 2023-03-11 14:00:00+00:00 |
data.dtypes
datetime_naive datetime64[ns]
timezone object
datetime_utc datetime64[ns, UTC]
dtype: object
The result is the same here, but it was miles apart with my other dataset of millions of records instead of 1,000, since the computation actually finished.
The results?
Let’s speed test side by side:
# `.apply()`
%timeit data.apply(lambda row: convert_to_utc(row['datetime_naive'], row['timezone']), axis=1)
136 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Vectorized
%timeit vectorized_convert_to_utc(data, 'datetime_naive', 'timezone')
2.32 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Hot damn! That right there is why the vectorized solution crunched through millions of records no problem, while .apply()
left me hanging. It’s ~60 times faster. Which checks out, since even the vectorized version took maybe 5 seconds to execute on my machine, which means applying .apply()
would’ve taken 5 minutes.