In-class Exercise 2

Author

Victoria Grace ANN

Published

January 15, 2024

Modified

January 22, 2024

Data preparation

Packages that will be used:

  • arrow, to read and write Parquet files (format which data is in)

  • lubridate, to work with time-related data more easily

  • tidyverse

  • tmap

  • sf

Code
pacman::p_load(arrow, lubridate, tidyverse, tmap, sf) 

Importing Grab Posisi Dataset

Code
df <- read_parquet("data/GrabPosisi/part-00000-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet")
Code
head(df)
# A tibble: 6 × 9
  trj_id driving_mode osname  pingtimestamp rawlat rawlng speed bearing accuracy
  <chr>  <chr>        <chr>           <int>  <dbl>  <dbl> <dbl>   <int>    <dbl>
1 70014  car          android    1554943236   1.34   104.  18.9     248      3.9
2 73573  car          android    1555582623   1.32   104.  17.7      44      4  
3 75567  car          android    1555141026   1.33   104.  14.0      34      3.9
4 1410   car          android    1555731693   1.26   104.  13.0     181      4  
5 4354   car          android    1555584497   1.28   104.  14.8      93      3.9
6 32630  car          android    1555395258   1.30   104.  23.2      73      3.9
  • One trajectory id, trj_id, represents one Grab ride.

  • There may be multiple repeated trj_id as the ride data is collected every minute

Code
df$pingtimestamp <- as_datetime(df$pingtimestamp) ## $ to overwrite the variable in df

Check updated df

Code
head(df)
# A tibble: 6 × 9
  trj_id driving_mode osname  pingtimestamp       rawlat rawlng speed bearing
  <chr>  <chr>        <chr>   <dttm>               <dbl>  <dbl> <dbl>   <int>
1 70014  car          android 2019-04-11 00:40:36   1.34   104.  18.9     248
2 73573  car          android 2019-04-18 10:17:03   1.32   104.  17.7      44
3 75567  car          android 2019-04-13 07:37:06   1.33   104.  14.0      34
4 1410   car          android 2019-04-20 03:41:33   1.26   104.  13.0     181
5 4354   car          android 2019-04-18 10:48:17   1.28   104.  14.8      93
6 32630  car          android 2019-04-16 06:14:18   1.30   104.  23.2      73
# ℹ 1 more variable: accuracy <dbl>
  • pingtimestamp looks better now

Extracting trip starting locations

Code
# Using lubridate
origin_df <- df %>%
  group_by(trj_id) %>%
  arrange(pingtimestamp) %>% 
  filter(row_number()==1) %>% 
  mutate(weekday = wday(pingtimestamp, 
                        label=TRUE,
                        abbr=TRUE), 
         start_hr = factor(hour(pingtimestamp)), 
         day = factor(mday(pingtimestamp)))
Code
# head(origin_df)
  • arranges sorts out the time stamps in ascending order.
  • The first row of each trajectory data contains the trip origin’s coordinates.
  • wday defines the workday.

Extracting trip ending locations

Code
dest_df <- df %>%
  group_by(trj_id) %>%
  arrange(desc(pingtimestamp)) %>% 
  filter(row_number()==1) %>% 
  mutate(weekday = wday(pingtimestamp, 
                        label=TRUE,
                        abbr=TRUE), 
         end_hr = factor(hour(pingtimestamp)), 
         day = factor(mday(pingtimestamp)))
Code
# head(dest_df)

Overwrite the dataset with the new variables

The original dataset takes up a lot of space.

Code
write_rds(origin_df,"data/rds/origin_df.rds")
write_rds(dest_df,"data/rds/dest_df.rds")
  • Object classes will be intact
  • Saving data in rds format also allows the updated dataframe to be reusable

Data importing

In future, the files can be read as such,

Code
origin_df <- read_rds("data/rds/origin_df.rds")
dest_df <- read_rds("data/rds/dest_df.rds")

Homework: Hands-on Ex 3 and Data Preparation for Take-home Ex 1