vroom is a new approach to reading delimited and fixed width data into R.

It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.

Therefore you can obtain very rapid input by first performing a fast indexing step and then using the Altrep framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.

How it works

The initial reading of the file simply records the locations of each individual record, the actual values are not read into R. Altrep vectors are created for each column in the data which hold a pointer to the index and the memory mapped file. When these vectors are indexed the value is read from the memory mapping.

This means initial reading is extremely fast, in the real world dataset below it is ~ 1/4 the time of the multi-threaded data.table::fread(). Sampling operations are likewise extremely fast, as only the data actually included in the sample is read. This means things like the tibble print method, calling head(), tail() x[sample(), ] etc. have very low overhead. Filtering also can be fast, only the columns included in the filter selection have to be fully read and only the data in the filtered rows needs to be read from the remaining columns. Grouped aggregations likewise only need to read the grouping variables and the variables aggregated.

Once a particular vector is fully materialized the speed for all subsequent operations should be identical to a normal R vector.

This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.

Reading delimited files

The following benchmarks all measure reading delimited files of various sizes and data types. Because vroom delays reading the benchmarks also do some manipulation of the data afterwards to try and provide a more realistic performance comparison.

Because the read.delim results are so much slower than the others they are excluded from the plots, but are retained in the tables.

Taxi Trip Dataset

This real world dataset is from Freedom of Information Law (FOIL) Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, originally posted at http://chriswhong.com/open-data/foil_nyc_taxi/. It is also hosted on archive.org.

The first table trip_fare_1.csv is 1.55G in size.

#> Observations: 14,776,615
#> Variables: 11
#> $ medallion       <chr> "89D227B655E5C82AECF13C3F540D4CF4", "0BD7C8F5B...
#> $ hack_license    <chr> "BA96DE419E711691B9445D6A6307C170", "9FD8F69F0...
#> $ vendor_id       <chr> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CMT...
#> $ pickup_datetime <chr> "2013-01-01 15:11:48", "2013-01-06 00:18:35", ...
#> $ payment_type    <chr> "CSH", "CSH", "CSH", "CSH", "CSH", "CSH", "CSH...
#> $ fare_amount     <dbl> 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ...
#> $ surcharge       <dbl> 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0...
#> $ mta_tax         <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0...
#> $ tip_amount      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ tolls_amount    <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0...
#> $ total_amount    <dbl> 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...

Taxi Benchmarks

The code used to run the taxi benchmarks is in bench/taxi-benchmark.R.

The benchmarks labeled vroom_base uses vroom with base functions for manipulation. vroom_dplyr uses vroom to read the file and dplyr functions to manipulate. data.table uses fread() to read the file and data.table functions to manipulate and readr uses readr to read the file and dplyr to manipulate. By default vroom only uses Altrep for character vectors, these are labeled vroom(altrep: normal). The benchmarks labeled vroom(altrep: full) instead use Altrep vectors for all supported types and vroom(altrep: none) disable Altrep entirely.

The following operations are performed.

  • The data is read
  • print() - N.B. read.delim uses print(head(x, 10)) because printing the whole dataset takes > 10 minutes
  • head()
  • tail()
  • Sampling 100 random rows
  • Filtering for “UNK” payment, this is 6434 rows (0.0435% of total).
  • Aggregation of mean fare amount per payment type.
package manip altrep read print head tail sample filter aggregate total
read.delim base NA 1m 14.5s 5ms 1ms 1ms 1ms 312ms 959ms 1m 15.7s
readr dplyr NA 34.8s 93ms 1ms 1ms 1ms 185ms 736ms 35.8s
vroom base none 17.8s 95ms 1ms 1ms 1ms 1.1s 1.4s 20.4s
vroom dplyr none 18.3s 92ms 1ms 1ms 1ms 898ms 747ms 20.1s
data.table data.table NA 15.2s 11ms 1ms 1ms 1ms 94ms 222ms 15.5s
vroom base normal 2.3s 86ms 1ms 1ms 1ms 1.4s 8.1s 11.8s
vroom base full 1.3s 121ms 1ms 1ms 1ms 1.3s 8s 10.8s
vroom dplyr full 1.2s 160ms 1ms 1ms 1.1s 1.4s 2.1s 6s
vroom dplyr normal 2.4s 88ms 1ms 1ms 1ms 1.3s 2.1s 5.8s

(N.B. Rcpp used in the dplyr implementation fully materializes all the Altrep numeric vectors when using filter() or sample_n(), which is why the first of these cases have additional overhead when using full Altrep.).

All numeric data

The code used to run the all numeric benchmarks is in bench/all_numeric-benchmark.R.

All numeric data is really a worst case scenario for vroom. The index takes about as much memory as the parsed data. Also because parsing doubles can be done quickly in parallel and text representations of doubles are only ~25 characters at most there isn’t a great deal of savings for delayed parsing.

For these reasons (and because the data.table implementation is very fast) vroom is a bit slower than fread for pure numeric data.

However the vroom is multi-threaded and therefore is quicker than readr and read.delim.

package manip altrep read print head tail sample filter aggregate total
read.delim base NA 1m 58.5s 7ms 1ms 1ms 1.5s 4.8s 36ms 2m 4.8s
readr dplyr NA 12.6s 103ms 1ms 1ms 3ms 12ms 32ms 12.8s
vroom base normal 1.2s 109ms 1ms 1ms 3ms 68ms 53ms 1.4s
vroom base none 1.2s 105ms 1ms 1ms 3ms 63ms 52ms 1.4s
vroom dplyr full 269ms 188ms 1ms 1ms 841ms 12ms 38ms 1.3s
vroom dplyr none 1.2s 106ms 1ms 1ms 3ms 12ms 32ms 1.3s
vroom dplyr normal 1.1s 105ms 1ms 1ms 3ms 12ms 32ms 1.3s
vroom base full 370ms 156ms 1ms 1ms 3ms 54ms 272ms 854ms
data.table data.table NA 623ms 12ms 1ms 1ms 3ms 6ms 54ms 696ms

All character data

The code used to run the all character benchmarks is in bench/all_character-benchmark.R.

All character data is a best case scenario for vroom when using Altrep, as it takes full advantage of the lazy reading.

package manip altrep read print head tail sample filter aggregate total
read.delim base NA 1m 39.9s 7ms 1ms 1ms 1ms 24ms 401ms 1m 40.3s
readr dplyr NA 1m 2.9s 102ms 1ms 1ms 4ms 15ms 404ms 1m 3.4s
vroom base none 51.6s 105ms 1ms 1ms 3ms 25ms 2.3s 54s
vroom dplyr none 48.9s 102ms 1ms 1ms 4ms 15ms 316ms 49.4s
data.table data.table NA 38s 13ms 1ms 1ms 3ms 10ms 244ms 38.3s
vroom base normal 333ms 176ms 1ms 1ms 3ms 198ms 2.1s 2.8s
vroom base full 358ms 136ms 1ms 1ms 3ms 181ms 2.1s 2.8s
vroom dplyr normal 273ms 176ms 1ms 1ms 4ms 197ms 1.4s 2.1s
vroom dplyr full 272ms 173ms 1ms 1ms 4ms 191ms 1.4s 2s

Reading multiple delimited files

The code used to run the taxi multiple file benchmarks is at bench/taxi_multiple-benchmark.R.

The benchmark reads all 12 files in the taxi trip fare data, totaling 173,179,759 rows and 12 columns for a total file size of 18.4G.

package manip altrep read print head tail sample filter aggregate total
read.delim base NA 25m 56s 9ms 1ms 1ms 1ms 22.6s 7s 26m 25.6s
readr dplyr NA 7m 31.8s 91ms 1ms 1ms 1ms 11.8s 9.3s 7m 53s
data.table data.table NA 4m 20.1s 5ms 1ms 1ms 1ms 1s 3.8s 4m 25s
vroom dplyr none 3m 38.2s 91ms 1ms 1ms 1ms 2.2s 9.3s 3m 49.8s
vroom dplyr full 12.5s 129ms 1ms 1ms 10.2s 18.8s 24.8s 1m 6.4s

Reading fixed width files

United States Census 5-Percent Public Use Microdata Sample files

This fixed width dataset contains individual records of the characteristics of a 5 percent sample of people and housing units from the year 2000 and is freely available at https://www2.census.gov/census_2000/datasets/PUMS/FivePercent/California/all_California.zip. The data is split into files by state, and the state of California was used in this benchmark.

The data totals 2,342,340 rows and 113 columns with a total file size of 677M.

Census data benchmarks

The code used to run the census data benchmarks is at bench/fwf-benchmark.R.

package manip altrep read print head tail sample filter aggregate total
read.delim base NA 16m 25.2s 15ms 2ms 1ms 3ms 451ms 81ms 16m 25.7s
readr dplyr NA 27.5s 260ms 1ms 1ms 7ms 100ms 126ms 27.9s
vroom dplyr none 14.5s 131ms 1ms 1ms 7ms 177ms 534ms 15.3s
vroom dplyr normal 7.3s 123ms 1ms 1ms 7ms 476ms 438ms 8.3s
vroom dplyr full 232ms 218ms 1ms 1ms 2.7s 450ms 417ms 4s

Writing delimited files

The code used to run the taxi writing benchmarks is at bench/taxi_writing-benchmark.R.

The benchmarks write out the taxi trip dataset in three different ways.

  • An uncompressed file
  • A gzip compressed file using gzfile() (readr and vroom do this automatically for files ending in .gz)
  • A multithreaded gzip compressed file using a pipe() connection to pigz

Note the current CRAN version of data.table (1.12.2) does not support writing directly to gzip compressed files, though the developement version (1.12.3) does

method write.delim readr data.table vroom
xz 14m 43.5s 13m 12.9s NA 10m 3.9s
gzip 3m 23.9s 2m 2.9s NA 1m 17.1s
multithreaded gzip 1m 41.4s 51.8s NA 8.1s
zstandard 1m 40.2s 50.2s NA 12.9s
uncompressed 1m 41.5s 51.2s 1.2s 1.7s

Session and package information

package version date source
base 3.6.0 2019-05-06 local
data.table 1.12.2 2019-04-07 CRAN (R 3.6.0)
dplyr 0.8.1 2019-05-14 CRAN (R 3.6.0)
readr 1.3.1 2018-12-21 CRAN (R 3.6.0)
vroom 1.0.2 2019-06-27 local