r - Max efficiency in removing duplicated rows in data frame -

- May 15, 2012

i have large data frame: more 6 million rows, 28 variables of type (num, factors, characters). need remove duplicated rows. however, way identify actual duplicates run check on large character variable (approx 1,000 2,000 characters in each observation). use standard duplicated() function not sure time efficient solution.

is there function or package allows efficiently job ? thank in advance suggestions.

structure(list(city = c("new york", "new york", "new york", "brussels",  "london", "arlington"), prodcategory = structure(c(1l, 1l, 1l,  1l, 1l, 1l), .label = "4", class = "factor"), date = structure(c(16351,  16352, 16351, 16353, 16354, 16355), class = "date"), userid = c("abcd",  "xyzz", "abcd", "abcd", "sdfg", "wedgd"), review = c("in opinion 1 of best pastrami or corned beef sandwiches places in ny (an more). way each sandwich feed whole family days... establishment situated close theatre district , time square. delight see turkey sandwich arrive. wow massive , delicious. ..the celebrity photos awesome ..highly recommend place true taste treat",  "this not usual half-red-lobster place. full experience of super top quality sea food amazingly convenient price basic sandwiches fine cuisine each plate joy.",  "in opinion 1 of best pastrami or corned beef sandwiches places in ny (an more). way each sandwich feed whole family days... establishment situated close theatre district , time square. delight see turkey sandwich arrive. wow massive , delicious. ..the celebrity photos awesome ..highly recommend place true taste treat",  "each time go brussels stop typical brasserie located in historical heart of brussels downtown @ walking distance every interesting place. food great , menu rich , diversified service sharp , fast , pricing reasonable. dont miss typical chocolate cake. should write dont miss... included rich list of belgian beers",  "that call great uk pub food --simple tasty not fat/heavy/greasy (... ok not healthy though) presented service efficient , overall atmosphere deserves stop",  "are fan of house of cards ? have not missed amazing bbq place frank underwood loves go. looks rocklands right you. different atmosphere same kind of yummy meat" )), .names = c("city", "prodcategory", "date", "userid", "review" ), row.names = c(na, -6l), class = "data.frame")

try

library(data.table) setkey(setdt(df), review) res <- unique(df) dim(res) #[1] 5 5

Search This Blog

GCM

r - Max efficiency in removing duplicated rows in data frame -

Comments

Post a Comment

Popular posts from this blog

android - Hide only the Action bar on Scroll not action bar tabs -

matlab - "Contour not rendered for non-finite ZData" -

delphi - Indy UDP Read Contents of Adata -