As memory and computational capacities increase, so does the amount of data we deal with daily. The storage of navigation data is growing exponentially every year, and processing such data sets may be truly cumbersome, especially for market researchers who are new to the field and have to gain insight on the digital behavior of a person out of a jungle of millions of URLs. These data sets can be handled very easily using a statistics software like R, if one can count on enough RAM to load it.
R is an open source scripting language and a statistics environment that has become very popular specially in this field. Its popularity is not only due to being open source, but mostly because of the large community of developers which make of R a very rich and constantly evolving piece of software. For these reasons, R has been incorporated into several companies like Netquest.
However, despite all the advantages, there’s another side of the coin. R is an interpreted language, therefore, simple instructions which may run in no time in a compiled program, like a loop with a conditional instruction, may take forever if the data set we are using is sufficiently large. This fact encouraged the community to develop new packages to tackle the computational delay, and keep data scientists from spending countless hours staring at the screen while they wait for R to end running.
The main goal of this blog post is to offer a viable solution to this problem, which is the package Rcpp [1]. This package allows to integrate C++ code into your R scripts, which will drastically speed up bottleneck functions. To show how remarkable can the difference be in terms of computational cost, I built an example. The testing data set was formed of 130,000,000 observations, and the two pieces of code used were semantically equivalent although one was scripted in R and the other one in C++ using Rcpp.
The data set was stored in a data table [2] denoted as dt, which included a variable, incDiff, whose values were positive and incremental. Our task was to split the data into chunks of a certain threshold value. Therefore, we would need to go through the array, and once the difference between the first element of the current chunk and the element we’re visiting is larger than the threshold, we flag that element as the beginning of the next chunk. After having finished, we obtain a vector of flags indicating where to cut the data in consecutive chunks so that the difference between the inner elements don’t exceed the threshold. This may sound like a very particular request, however, if we consider that there was an original date variable in the data table, and the vector we want to split is the accumulated difference between dates, the chunks of data turn into values within a period of time. Suddenly, the task sounds more useful.
Method A: R scripting alone |
||
fragments <- function( incrVec, splitThreshold ) { chunks <- rep(0, length(incrVec)) chunks[1] <- 1 curIncValue <- incrVec [1] for(i in 1:length(incrVec)){ if( incrVec[i] > curIncValue + splitThreshold ){ chunks[i] <- 1 curIncValue <- incrVec[i] } } return(chunks) } threshold <- 100 system.time( dt[, chunks_data:=fragments(incDiff, threshold) ] ) |
Method B: Including Rcpp |
||
cppFunction("NumericVector fragments_Rcpp(NumericVector incrVec, int splitThreshold){ int lenVec= incrVec.size(); NumericVector chunks(lenVec); chunks [0] = 1; int curIncValue= incrVec [0]; for (int i = 1; i < lenVec; i++){ if ( incrVec [i] > curIncValue + splitThreshold ){ chunks [i]=1; curIncValue= incrVec [i]; } else chunks [i]=0; } return chunks; }") threshold <- 100 system.time( dt[, chunks_data := fragments_Rcpp(incDiff, threshold)]) |
Execution times show that even though the code was extremely simple, the timings were already very different (almost 30 times faster!). Larger pieces of code may produce other proportional differences; nevertheless, the gain in computational cost is undeniable.
A side conclusion one may reach from this post is that whenever we come across a problem in R, it’s worth searching for it online. The R community is so large that surely someone else must have had the same problem and already developed a solution, which will be open source! Hence, I strongly recommend keeping oneself up to date and sometimes have a look at the latest updates in the CRAN (Comprehensive R Archive Network [3]).
[2] https://cran.r-project.org/web/packages/data.table/index.html
[3] https://cran.r-project.org/