4 Importing Data

4.1 Prerequisites

In this chapter, we focus on the use of the readr package that forms a part of the tidyverse package. readr provides a fast and easy way to parse a file because it uses a sophisticated parser that guesses the data type for each column along with the flexibility to specify what to parse. You will also need the here package (to easily provide relative file paths) and the lubridate package (to work with dates and times).

library(tidyverse)
library(lubridate)
library(here)

In this chapter we will also be using the following datasets. If the following code returns FALSE, you should refer to the section Finding your file in the Preface on where you can download the datasets.

file.exists(here("data", "iris.csv"))
file.exists(here("data", "building.csv"))

4.2 Finding your file

Before importing data, you need to first let R know where to find the file by providing the file path. You can provide file paths relative to the top-level directory of the current R project with the here() function.

here()
## [1] "/home/runner/work/r4bes/r4bes"

To reference a file named iris.csv located in the data folder that is located in the top-level directory of your project:

here("data", "iris.csv")
## [1] "/home/runner/work/r4bes/r4bes/data/iris.csv"

4.3 Parsing a csv file

We will focus on the read_csv() function because building data are typically stored in csv format. You can find out more about other file formats in readr’s documentation.

The easiest way to parse a file is by providing the file path. In most cases, it will just work as expected where readr correctly guesses the column specification (that gets printed to the console) and you get a tibble as specified. readr_example() is a function that makes it easy to access example files to demonstrate {readr}'s capabilities.

read_csv(here("data", "iris.csv"))
## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
....

4.4 Building data

Data from buildings are often messy without a standardized format. Therefore, it is not uncommon if readr does not guess correctly. Parsing date/times is particularly important when working with time-series building data.

For example, with the building.csv dataset, you can see that there are problems parsing the timestamp column. Specifically, column timestamp was parsed as a character vector when it actually contains a date/time. The skip argument is particularly useful since building data often contains meta-data that you want to exclude. A character vector can also be supplied to specify the column names.


read_csv(
    here("data", "building.csv"),
    col_names = c(
        "datetime",
        "power"
    ),
    skip = 2
)
## Rows: 1324 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): datetime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1,324 × 2
##    datetime       power
##    <chr>          <dbl>
##  1 1/9/19 0:00 2210185.
##  2 1/9/19 0:30 2210196.
##  3 1/9/19 1:00 2210208.
##  4 1/9/19 1:30 2210219 
##  5 1/9/19 2:00 2210231 
##  6 1/9/19 2:30 2210242.
##  7 1/9/19 3:00 2210254.
....

You can correct this by taking advantage of the lubridate package that was covered in Chapter 3. To recap, you can transform a character vector into a datetime object by specifying the order of the year y, month m, day d, hour h, minute m and second s with the date (y, m, and d) and the time (h, m, and s) separated by an underscore _.

bldg <- read_csv(
    here("data", "building.csv"),
    col_names = c(
        "datetime",
        "power"
    ),
    skip = 2
)
## Rows: 1324 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): datetime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# transform the datetime character vector into a datetime object using {lubridate}
bldg$datetime <- dmy_hm(bldg$datetime)

3 Dates and Times

5 Regular expressions