4 Importing Data
4.1 Prerequisites
In this chapter, we focus on the use of the readr package that forms a part of the tidyverse package. readr provides a fast and easy way to parse a file because it uses a sophisticated parser that guesses the data type for each column along with the flexibility to specify what to parse. You will also need the here package (to easily provide relative file paths) and the lubridate package (to work with dates and times).
In this chapter we will also be using the following datasets. If the following
code returns FALSE
, you should refer to the section Finding your file in the
Preface on where you can download the datasets.
file.exists(here("data", "iris.csv"))
file.exists(here("data", "building.csv"))
4.2 Finding your file
Before importing data, you need to first let R know where to find the file by
providing the file path. You can provide file paths relative to the top-level
directory of the current R project with the here()
function.
here()
## [1] "/home/runner/work/r4bes/r4bes"
To reference a file named iris.csv
located in the data
folder that is
located in the top-level directory of your project:
here("data", "iris.csv")
## [1] "/home/runner/work/r4bes/r4bes/data/iris.csv"
4.3 Parsing a csv file
We will focus on the read_csv()
function because building data are typically
stored in csv
format. You can find out more about other file formats in
readr’s documentation.
The easiest way to parse a file is by providing the file path. In most cases, it
will just work as expected where readr correctly guesses the column
specification (that gets printed to the console) and you get a tibble as
specified.
readr_example()
is
a function that makes it easy to access example files to demonstrate {readr}'
s
capabilities.
read_csv(here("data", "iris.csv"))
## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
....
4.4 Building data
Data from buildings are often messy without a standardized format. Therefore, it is not uncommon if readr does not guess correctly. Parsing date/times is particularly important when working with time-series building data.
For example, with the building.csv
dataset, you can see that there are
problems parsing the timestamp
column. Specifically, column timestamp
was
parsed as a character vector when it actually contains a date/time. The skip
argument is particularly useful since building data often contains meta-data
that you want to exclude. A character vector can also be supplied to specify the
column names.
read_csv(
here("data", "building.csv"),
col_names = c(
"datetime",
"power"
),
skip = 2
)
## Rows: 1324 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): datetime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1,324 × 2
## datetime power
## <chr> <dbl>
## 1 1/9/19 0:00 2210185.
## 2 1/9/19 0:30 2210196.
## 3 1/9/19 1:00 2210208.
## 4 1/9/19 1:30 2210219
## 5 1/9/19 2:00 2210231
## 6 1/9/19 2:30 2210242.
## 7 1/9/19 3:00 2210254.
....
You can correct this by taking advantage of the lubridate package that was
covered in Chapter 3. To recap, you can transform a character
vector into a datetime object by specifying the order of the year y
, month
m
, day d
, hour h
, minute m
and second s
with the date (y
, m
, and
d
) and the time (h
, m
, and s
) separated by an underscore _
.
bldg <- read_csv(
here("data", "building.csv"),
col_names = c(
"datetime",
"power"
),
skip = 2
)
## Rows: 1324 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): datetime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# transform the datetime character vector into a datetime object using {lubridate}
bldg$datetime <- dmy_hm(bldg$datetime)