R computing not so fast -
i have csv data in format price,volume
price,volume 329.237000000000,0.011000000000 329.500000000000,1.989000000000 328.006000000000,0.032000000000 328.447000000000,0.010100000000 328.448000000000,0.201455000000 327.839000000000,0.011188600000 328.006000000000,0.064333000000 327.930000000000,0.020800000000 328.006000000000,0.064333000000 327.918000000000,0.011139500000 327.869000000000,0.011090600000 328.127000000000,0.033460100000 ....
moreover 16m rows.
what wanted group prices , amounts price based volume ohlcv ticks, 100 000 usd each tick. took 200seconds group 16000 rows it's slow...
i'm using while loop have no idea how can rid of it
output should this
open high low close volume (usd) 1 329.237 329.500 329.237 329.500 100.00000 2 329.500 329.500 329.500 329.500 100.00000 3 329.500 329.500 329.500 329.500 100.00000 4 329.500 329.500 329.500 329.500 100.00000 5 329.500 329.500 329.500 329.500 100.00000 6 329.500 329.500 329.500 329.500 100.00000 7 328.006 328.448 328.006 328.448 100.00000 8 328.448 328.127 328.448 328.127 100.00000 9 328.127 327.695 328.127 327.695 100.00000 10 327.695 327.695 327.695 327.695 100.00000 11 327.695 327.695 327.695 327.695 100.00000
code:
library(data.table) # choose file #dti <- fread(file.choose()) dti <- fread("test2.csv") names(dti)[1]<-"price" names(dti)[2]<-"volume" # rows count irows <- nrow(dti) # volume in vol_btc <- sum(dti$volume) vol_usd <- sum(dti$price*dti$volume) # equals bars, 100000usd each vol_range <- 100000 bc <- ceiling ( vol_usd / vol_range ) dto <- data.table ( open = numeric(bc), high = numeric(bc), low = numeric(bc), close = numeric(bc), volume = numeric(bc)) <- 1 j <- 1 while ( <= irows ) { pri <- dti$price[i] vol <- dti$volume[i] volu <- pri * vol if ( dto$open[j] == 0 ) { # new ohlcv bars dto$open[j] <- pri dto$high[j] <- pri dto$low[j] <- pri } else { if (dto$high[j] < pri) dto$high[j] <- pri if (dto$low[j] > pri) dto$high[j] <- pri } dto$close[j] <- pri volc <- dto$volume[j] + volu - vol_range if ( volc < 0 ) { dto$volume[j] <- dto$volume[j] + volu } else { dto$volume[j] = vol_range j<-j+1 if ( volc > 0 ){ dto$open[j] <- pri dto$high[j] <- pri dto$low[j] <- pri dto$close[j] <- pri if (volc > vol_range){ dto$volume[j] <- vol_range k = floor ( volc / vol_range ) if (k > 0) { dto[(j+1):(j+k-1)] <- dto[j] volc <- volc - vol_range * k j <- j + k } } dto$volume[j] <- volc } } i<-i+1 }
this isn't quite right, maybe gives indication of how make type of operation faster. here's data
url <- "http://pastebin.com/raw.php?i=hsgacr2l" dfi <- read.csv(url)
i calculate product , cumulative sum of product of price , volume. calculation vectorized fast.
pv <- with(dfi, price * volume) cpv <- cumsum(pv) vol_range <- 100000
my strategy figure out how group data in relatively efficient way. did creating logical vector have 'true' when new group starts (i think actual calculation wrong below, , there edge cases fail; strategy needs re-thought, notion minimize non-vectorized data modification)
grp <- logical(nrow(dfi)) <- 1 repeat { grp[i] <- true ## find first index evaluating 'true' <- which.max(cpv - (cpv[i] - pv[i]) > vol_range) ## prevent fails when, e.g., any(diff(cvp) > vol_range) if (i > 1l && grp[i] == true) <- + 1l if (i == 1l) # no true values, false max, , elt 1 first false break }
cumsum(grp)
divides data first, second, ... groups, , add data frame
dfi$group <- cumsum(grp)
for output, basic strategy split price (etc.) group, , apply function each group. there number of ways this, tapply
not particularly efficient (data.table excels @ these types of calculations, not provide particular benefit point) scale of data sufficient.
dfo <- with(dfi, { data.frame( open = tapply(price, group, function(x) x[1]), high = tapply(price, group, max), low = tapply(price, group, max), close = tapply(price, group, function(x) x[length(x)]), volume = tapply(volume, group, sum), pv = tapply(price * volume, group, sum)) })
this takes fraction of second 10,000 row sample data.
Comments
Post a Comment