The vroom package contains one main function vroom() which is used to read all types of delimited files. A delimited file is any file in which the data is separated (delimited) by one or more characters.

The most common type of delimited files are CSV (Comma Separated Values) files, typically these files have a .csv suffix.

library(vroom)

This vignette covers the following topics:

  • The basics of reading files, including
    • single files
    • multiple files
    • compressed files
    • remote files
  • Skipping particular columns.
  • Specifying column types, for additional safety and when the automatic guessing fails.
  • Writing regular and compressed files

Reading files

To read a CSV, or other type of delimited file with vroom pass the file to vroom(). The delimiter will be automatically guessed if it is a common delimiter. If the guessing fails or you are using a less common delimiter specify it with the delim parameter. (e.g. delim = ",").

We have included an example CSV file in the vroom package for use in examples and tests. Access it with vroom_example("mtcars.csv")

# See where the example file is stored on your machine
file <- vroom_example("mtcars.csv")
file
#> [1] "/home/travis/R/Library/vroom/extdata/mtcars.csv"

# Read the file, by default vroom will guess the delimiter automatically.
vroom(file)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# You can also specify it explicitly, which is (slightly) faster, and safer if
# you know how the file is delimited.
vroom(file, delim = ",")
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 32 x 12
#>   model     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda …  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

Reading multiple files

If you are reading a set of files which all have the same columns, you can pass the filenames directly to vroom() and it will combine them into one result.

First we will create some files to read by splitting the mtcars dataset by number of cylinders, (it is OK if you don’t currently understand this code).

mt <- tibble::rownames_to_column(mtcars, "model")
purrr::iwalk(
  split(mt, mt$cyl),
  ~ vroom_write(.x, glue::glue("mtcars_{.y}.csv"), "\t")
)

We can then efficiently read them into one table by passing the filenames directly to vroom.

Often the filename or directory where the files are stored contains information, in this case the id parameter can be used to add an extra column to the result with the full path to each file. (in this case named path).

Reading remote files

vroom can read files from the internet as well by passing the URL of the file to vroom.

It can even read gzipped files from the internet (although currently not the other compressed formats).

Column selection

vroom provides the same interface for column selection and renaming as dplyr::select(). This provides very flexible and readable selections.

  • A character vector of column names
  • A numeric vector of column indexes, e.g. c(1, 2, 5)
  • You can also use the selection helpers
  • Or rename columns

Reading fixed width files

A fixed width file can be a very compact representation of numeric data. Unfortunately, it’s also painful because you need to describe the length of every field. vroom aims to make it as easy as possible by providing a number of different ways to describe the field structure. Use vroom_fwf() in conjunction with one of the following helper functions to read the file.

  • fwf_empty() - Guess based on the position of empty columns.
  • fwf_cols() - Use user provided named pairs of positions.

Column types

vroom guesses the data types of columns as they are read, however sometimes it is necessary to change the type of one or more columns.

The available specifications are: (with single letter abbreviations in quotes)

You can tell vroom what columns to use with the col_types() argument in a number of ways.

If you only need to override a single column the most concise way is to use a named vector.

However you can also use the col_*() functions in a list.

This is most useful when a column type needs additional information, such as for categorical data when you know all of the levels of a factor.

Writing delimited files

Use vroom_write() to write delimited files, the default delimiter is tab.

vroom_write(mtcars, "mtcars.tsv")

Writing CSV delimited files

Use the delim = ',' to write CSV files

vroom_write(mtcars, "mtcars.csv", delim = ",")

Writing compressed files

For gzip, bzip2 and xz compression they will be automatically compressed if the filename ends in gz, bz2 or xz.

vroom_write(mtcars, "mtcars.tsv.gz")

vroom_write(mtcars, "mtcars.tsv.bz2")

vroom_write(mtcars, "mtcars.tsv.xz")

It is also possible to use other compressors, such as pigz, a parallel gzip implementation, lbzip2, a parallel bzip2 implementation or pixz, a parallel xz implementation by using pipe() to create a pipe connection. The parallel versions can be considerably faster for large output files.

vroom_write(mtcars, pipe("pigz > mtcars.tsv.gz"))

Further reading

vignette("benchmarks") discusses the performance of vroom, how it compares to alternatives and how it achieves its results.