R computing not so fast -

- May 15, 2015

i have csv data in format price,volume

price,volume 329.237000000000,0.011000000000 329.500000000000,1.989000000000 328.006000000000,0.032000000000 328.447000000000,0.010100000000 328.448000000000,0.201455000000 327.839000000000,0.011188600000 328.006000000000,0.064333000000 327.930000000000,0.020800000000 328.006000000000,0.064333000000 327.918000000000,0.011139500000 327.869000000000,0.011090600000 328.127000000000,0.033460100000 ....

moreover 16m rows.

what wanted group prices , amounts price based volume ohlcv ticks, 100 000 usd each tick. took 200seconds group 16000 rows it's slow...

i'm using while loop have no idea how can rid of it

output should this

open    high    low close   volume (usd) 1   329.237 329.500 329.237 329.500 100.00000 2   329.500 329.500 329.500 329.500 100.00000 3   329.500 329.500 329.500 329.500 100.00000 4   329.500 329.500 329.500 329.500 100.00000 5   329.500 329.500 329.500 329.500 100.00000 6   329.500 329.500 329.500 329.500 100.00000 7   328.006 328.448 328.006 328.448 100.00000 8   328.448 328.127 328.448 328.127 100.00000 9   328.127 327.695 328.127 327.695 100.00000 10  327.695 327.695 327.695 327.695 100.00000 11  327.695 327.695 327.695 327.695 100.00000

code:

library(data.table)  # choose file #dti <- fread(file.choose()) dti <- fread("test2.csv")  names(dti)[1]<-"price" names(dti)[2]<-"volume"   # rows count irows <- nrow(dti)  # volume in  vol_btc <- sum(dti$volume) vol_usd <- sum(dti$price*dti$volume)  # equals bars, 100000usd each vol_range <- 100000 bc <- ceiling ( vol_usd / vol_range )   dto <- data.table ( open = numeric(bc),                    high = numeric(bc),                    low = numeric(bc),                    close = numeric(bc),                    volume = numeric(bc))  <- 1 j <- 1  while ( <= irows ) {   pri <- dti$price[i]   vol <- dti$volume[i]   volu <- pri * vol     if ( dto$open[j] == 0 ) { # new ohlcv bars      dto$open[j] <- pri     dto$high[j] <- pri     dto$low[j]  <- pri    } else {      if (dto$high[j] < pri)       dto$high[j] <- pri      if (dto$low[j] > pri)       dto$high[j] <- pri    }    dto$close[j] <- pri     volc <- dto$volume[j] + volu - vol_range     if ( volc < 0 ) {     dto$volume[j] <- dto$volume[j] + volu   } else {      dto$volume[j] = vol_range       j<-j+1      if ( volc > 0 ){        dto$open[j] <- pri       dto$high[j] <- pri       dto$low[j]  <- pri       dto$close[j] <- pri        if (volc > vol_range){          dto$volume[j] <- vol_range          k = floor ( volc / vol_range )          if (k > 0) {            dto[(j+1):(j+k-1)] <- dto[j]           volc <- volc - vol_range * k           j <- j + k          }       }        dto$volume[j] <- volc      }     }      i<-i+1 }

this isn't quite right, maybe gives indication of how make type of operation faster. here's data

url <- "http://pastebin.com/raw.php?i=hsgacr2l" dfi <- read.csv(url)

i calculate product , cumulative sum of product of price , volume. calculation vectorized fast.

pv <- with(dfi, price * volume) cpv <- cumsum(pv) vol_range <- 100000

my strategy figure out how group data in relatively efficient way. did creating logical vector have 'true' when new group starts (i think actual calculation wrong below, , there edge cases fail; strategy needs re-thought, notion minimize non-vectorized data modification)

grp <- logical(nrow(dfi)) <- 1 repeat {     grp[i] <- true     ## find first index evaluating 'true'     <- which.max(cpv - (cpv[i] - pv[i]) > vol_range)     ## prevent fails when, e.g., any(diff(cvp) > vol_range)     if (i > 1l && grp[i] == true)         <- + 1l     if (i == 1l)   # no true values, false max, , elt 1 first false         break }

cumsum(grp) divides data first, second, ... groups, , add data frame

dfi$group <- cumsum(grp)

for output, basic strategy split price (etc.) group, , apply function each group. there number of ways this, tapply not particularly efficient (data.table excels @ these types of calculations, not provide particular benefit point) scale of data sufficient.

dfo <- with(dfi, {     data.frame(         open = tapply(price, group, function(x) x[1]),         high = tapply(price, group, max),         low = tapply(price, group, max),         close = tapply(price, group, function(x) x[length(x)]),         volume = tapply(volume, group, sum),         pv = tapply(price * volume, group, sum)) })

this takes fraction of second 10,000 row sample data.

Search This Blog

GCM

R computing not so fast -

Comments

Post a Comment

Popular posts from this blog

android - Hide only the Action bar on Scroll not action bar tabs -

matlab - "Contour not rendered for non-finite ZData" -

delphi - Indy UDP Read Contents of Adata -