others - r data.table 在GROUP BY条件中,最后一行的二进制值

我有这样的数据:


library(data.table)


id <- c("1232","1232","1232","4211","4211","4211")


conversion <- c(0,0,0,1,1,1)


DT <- data.table(id, conversion)



id date conversion


1232 2018-01-01 0


1232 2018-01-03 0


1232 2018-01-04 0


4211 2018-04-01 1


4211 2018-04-04 1


4211 2018-04-06 1



我想为每个组创建一个二进制值,该值仅基于id行的最后一行,只有当组的转换为1时,二进制文件才会1.


id date conversion lastconv


1232 2018-01-01 0 0


1232 2018-01-03 0 0 


1232 2018-01-04 0 0


4211 2018-04-01 1 0


4211 2018-04-04 1 0


4211 2018-04-06 1 1



我尝试使用data.table中"多重"参数的几个例子,但是,只返回了错误。


DT[unique(id), lastconv := 1, mult ="last"]



时间:

对每个组的最后一行进行筛选,并将lastconv设置为conversion


DT[DT[, .I[.N], by=id]$V1, lastconv := conversion]



然后用0替换NA


DT[is.na(lastconv), lastconv := 0L]



结果


DT


# id conversion lastconv


#1: 1232 0 0


#2: 1232 0 0


#3: 1232 0 0


#4: 4211 1 0


#5: 4211 1 0


#6: 4211 1 1



如果安装data.table v1.12.3,我们还可以使用新的函数setnafill来替换第二步中的NA


DT[DT[, .I[.N], by=id]$V1, lastconv := conversion]


setnafill(DT, cols ="lastconv", fill = 0L)



修改要在每个组的最后一行上加入的OP的代码:


DT[, v := 0]


DT[.(DT[conversion == 1, unique(id)]), on=.(id), mult="last", v := 1]



 id conversion v


1: 1232 0 0


2: 1232 0 0


3: 1232 0 0


4: 4211 1 0


5: 4211 1 0


6: 4211 1 1



这只是因为它根据所需条件选择要编辑的id。

对于每个id,检查行号是否为组中的最后一行数,如果'转换'为1,则将逻辑结果转换为整数。


DT[ , lastconv := as.integer(.I == .I[.N] & conversion == 1), by = id]



你试过类似下面这样的?


library(tidyverse)



final_conversion_dat <- DT %>% 


 group_by(id) %>% 


 mutate(date = as.Date(date),


 final_conversion = ifelse(date == max(date, na.rm = T) & conversion == 1, 1, 0))



时间供参考:


library(data.table)


#data.table 1.12.3 IN DEVELOPMENT built 2019-05-12 17:04:48 UTC; root using 4 threads (see ?getDTthreads). Latest news: r-datatable.com


set.seed(0L)


nid <- 3e6L


DT <- data.table(id=rep(1L:nid, each=3L))[,


 conversion := sample(c(0L,1L), 1L, replace=TRUE), by=.(id)]


DT0 <- copy(DT)


DT1 <- copy(DT)


DT2 <- copy(DT)


DT3 <- copy(DT)



mtd0 <- function() {


 DT0[DT0[, .I[.N], by=id]$V1, lastconv := conversion]


 DT0[is.na(lastconv), lastconv := 0L]


}



mtd1 <- function() {


 DT1[DT1[, .I[.N], by=id]$V1, lastconv := conversion]


 setnafill(DT1, cols ="lastconv", fill = 0L)


}



mtd2 <- function() {


 DT2[, v := 0]


 DT2[.(DT2[conversion == 1, unique(id)]), on=.(id), mult="last", v := 1]


}



mtd3 <- function() {


 DT3[ , lastconv := as.integer(.I == .I[.N] & conversion == 1), by = id]


}



library(microbenchmark)


microbenchmark(mtd0(), mtd1(), mtd2(), mtd3(), times=1L)



计时:


Unit: milliseconds


 expr min lq mean median uq max neval cld


 mtd0() 1363.1783 1416.1867 1468.9256 1469.1952 1521.7992 1574.4033 3 b 


 mtd1() 1349.5333 1365.4653 1378.9350 1381.3974 1393.6358 1405.8743 3 b 


 mtd2() 511.5615 515.4728 552.9133 519.3841 573.5892 627.7944 3 a 


 mtd3() 3966.8867 4009.1128 4048.9607 4051.3389 4089.9977 4128.6564 3 c



...