r - How to efficiently merge these data.tables -
i want create data.table able check missing data. missing data in case not mean there na, entire row left out. need able see of time dependent column values missing level column. important if there lot of missing values or if spread across dataset.
so have 6.000.000x5 data.table (call tablea) containing time dependent variable, id level , value n add final table.
i have table (tableb) 207x2. couples id's factor columns in tablec.
tablec 1.500.000x207 of each of 207 columns correspond id according tableb , rows correspond time dependent variable in tablea.
these tables large , although acquired ram (totalling 8gb) computer keeps swapping away tablec , each write has called back, , gets swapped away again after. swapping consuming time. 1.6 seconds per row of tablea , tablea has 6.000.000 rows operation take more 100 days running non stop..
currently using for-loop loop on rows of tablea. doing no operation for-loop loops instantly. made one-line command looking correct column , row number tablec in tablea , tableb , writing value tablea tablec. broke one-liner system.time analysis , each step takes 0 seconds except writing big tablec. showed writing value table time consuming , looking @ memory use can see huge chunk appearing whenever write happens , disappears finished.
tablea <- data.table("id"=round(runif(200, 1, 100)), "timecounter"=round(runif(200, 1, 50)), "n"=round(rnorm(200, 1, 0.5))) tableb <- data.table("id"=c(1:100),"realid"=c(100:1)) tsm <- matrix(0,ncol=nrow(tableb), nrow=50) tablec <- as.data.table(tsm) rm(tsm) (row in 1:nrow(tablea)) { tableccol <- tableb[realid==tablea[row,id],id] tablecrow <- (tablea[row,timecounter]) val <- tablea[row,n] tablec[tablecrow,tableccol] <- val }
can advise me on how make operation faster, preventing memory swap @ last step in for-loop?
edit: on advice of @arun took time develop dummy data test on. included in code given above. did not include wanted results because dummy data random , routine work. it's speed problem.
not entirely sure results, give shot dplyr/tidyr packages for, seem more memory efficient loops.
install.packages("dplyr") install.packages("tidyr") library(dplyr) library(tidyr) tablec <- tablec %>% gather(tablec_id, value, 1:207)
this turns tablec 1,500,000x207 long format 310,500,000x2 table 'tablec_id' , 'tablec_value' columns.
tabled <- tablea %>% left_join(tableb, c("levelid" = "tableb_id")) %>% left_join(tablec, c("tableb_value" = "tablec_id")
this couple of packages i've been using of late, , seem efficient, data.table package used management of large tables there useful functions there. i'd take @ sqldf allows query data.frames via sql commands.
Comments
Post a Comment