Side-by-side by Design Pharma Data Handling with Merge, Join, Match, and Hash in R

>100 Views

November 04, 25

スライド概要

profile-image

SAS言語を中心として,解析業務担当者・プログラマなのコミュニティを活性化したいです

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

各ページのテキスト
1.

2025-11-04 ver1.1 Side-by-side by Design: Pharma Data Handling with Merge, Join, Match, and Hash in R Yutaka Morioka Yuki Nakagawa EPS corporation [email protected]

2.

Yutaka Morioka Title: SAS Guru Organization: EPS Corporation A SAS programmer based in Japan and clinical data scientist. I actively disseminate SAS programming techniques from introductory to advanced levels and have presented at international and regional conferences including SAS Global Forum, CDISC Interchange Japan, Phuse Japan SDE, and PharmaSUG SDE. As an organizer of the SAS User’s Group in Japan, I also focus on fostering collaboration between SAS and R programmers, aiming to create synergy across both communities. Yuki Nakagawa Title: SAS Programmer / Biostatistician Organization: EPS Corporation Since joining the Statistical Analysis Department of EPS Corporation in 2019, Yuki Nakagawa has been engaged in clinical trials service from SDTM to statistical analyses as the SAS Programmer and biostatistician. Copyright©EPS All rights reserved. 2

3.

Disclaimer and Disclosures • The opinions expressed are those of the authors and do not necessarily reflect the official views or positions of their organizations. • This presentation introduces various approaches to data joining, but it does not intend to determine or recommend a single “best” method. • The authors declare no conflicts of interest related to any of the methods, packages, or tools mentioned in this presentation. • In this presentation, I will limit the discussion to merging two tables due to time constraints. The picture can change when merging three or more tables, but that will be for another occasion. Copyright©EPS All rights reserved. 3

4.

Instruction Copyright©EPS All rights reserved. 4

5.

Instruction In clinical trial data handling, data merging is arguably the most frequently used operation. Especially in the derivation of SDTM and ADaM datasets, programmers often perform left joins using uniquely identifying keys such as USUBJID, SUBJID, VISIT, or TESTCD.However, few programmers take the time to re-examine what “joining” really means. There are multiple ways to perform a join, each based on a different conceptual foundation. Moreover, the behavior of merge in SAS differs significantly from that in R, which often leads to confusion for programmers swiching from SAS to R. Since misunderstanding how joins work can easily lead to unintended data errors, this presentation aims to take a deep and careful look at data joins from both a conceptual and practical perspective. Copyright©EPS All rights reserved. 5

6.

Base.R merge function Copyright©EPS All rights reserved. 6

7.

TEST data - Base.R merge # DataFrame 11 df11 <- data.frame( id = c(1, 2, 3), value_A = c("A1", "A2", "A3")) # DataFrame 12 df12 <- data.frame( id = c(2, 3, 4), value_B = c("B2", "B3", "B4")) Copyright©EPS All rights reserved. 7

8.

TEST data(SAS dataset) -- Base.R merge data wk11; do id = 1 , 2, 3; value_A = choosec(id,"A1","A2","A3"); output; end; run; data wk12; do id = 2, 3, 4; value_B = choosec(id-1,"B2","B3","B4"); output; end; run; Copyright©EPS All rights reserved. 8

9.

Inner Join - Base.R merge inner_joined <- merge(df11, df12, by = "id") Copyright©EPS All rights reserved. 9

10.

Full Outer Join - Base.R merge full_outer_joined <- merge(df11, df12, by = "id", all = TRUE) Copyright©EPS All rights reserved. 10

11.

Left Outer Join - Base.R merge left_joined <- merge(df11, df12, by = "id", all.x = TRUE) Copyright©EPS All rights reserved. 11

12.

Cross Join - Base.R merge cross_joined <- merge(df11, df12, by = character()) Copyright©EPS All rights reserved. 12

13.

Base.R merge merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), incomparables = NULL, ...) Copyright©EPS All rights reserved. 13

14.

SAS Merge Copyright©EPS All rights reserved. 14

15.

SAS merge – Inner Join proc sort data=wk11; by id; run; proc sort data=wk12; by id; run; data inner_join; merge wk11(in=in11) wk12(in=in22); by id; if in11 and in22; run; Copyright©EPS All rights reserved. 15

16.

SAS merge – Full Outer Join proc sort data=wk11; by id; run; proc sort data=wk12; by id; run; data outer_join; merge wk11 wk12; by id; run; Copyright©EPS All rights reserved. 16

17.

SAS merge – Left Outer Join proc sort data=wk11; by id; run; proc sort data=wk12; by id; run; data left_outer_join; merge wk11(in=in11) wk12(in=in22); by id; if in11; run; Copyright©EPS All rights reserved. 17

18.

SAS – Cross Join data cross_join; set wk11; do i=1 to wk12obs; set wk12 nobs=wk12obs point=i; output; end; run; Copyright©EPS All rights reserved. 18

19.

R.Base merge VS SAS merge df21 <- data.frame( id = c(1, 2, 3), score = c(90, 80, 70) ) df22 <- data.frame( id = c(2, 3, 4), score = c(85, 75, 65) ) merged <- merge(df21, df22, by = "id") Copyright©EPS All rights reserved. data wk21; do id = 1 , 2, 3; score= choosen(id,90,80,70); output; end; run; data wk22; do id = 2, 3, 4; score = choosen(id-1,85,75,65); output; end; run; proc sort data=wk21; by id; run; proc sort data=wk22; by id; run; data sas_merged; merge wk21(in=in11) wk22(in=in22); by id; if in11 and in22; run; 19

20.

Concepts of Internal Processing in SAS and R R.merge Function SQL Join SAS “Sequential processing” refers to the concept of implicitly looping through all records by reading one record, processing it, and then outputting one record — repeating this sequence for every record. Copyright©EPS All rights reserved. 20

21.

Copyright©EPS All rights reserved. 21

22.

Copyright©EPS All rights reserved. 22

23.

dplyr https://dplyr.tidyverse.org/ Copyright©EPS All rights reserved. 23

24.

Test data – dplyr library(dplyr) df31 <- tibble( id = c(1, 2, 3), value_A = c("A1", "A2", "A3") ) df32 <- tibble( id = c(2, 3, 4), value_B = c("B2", "B3", "B4") ) Copyright©EPS All rights reserved. 24

25.

Join – dplyr inner_join <- inner_join(df31, df32, by = "id") full_join <- full_join(df31, df32, by = "id") left_join <- left_join(df31, df32, by = "id") cross_join <- cross_join(df31, df32) #Cross Join (Before dplyr v1.1.0) cross_join2 <- df31 %>% mutate(dummy = 1) %>% inner_join(df32 %>% mutate(dummy = 1), by = "dummy") %>% select(-dummy) Copyright©EPS All rights reserved. 25

26.
[beta]
dplyr:full_join ≠ SAS merge
library(dplyr)
wk1 <- tibble(id = 1:3, val = c("A", "B", "C"))
wk2 <- tibble(id = 2:4, val = c("D", "E", "F"))
out1 <- full_join(wk1, wk2, by = "id")

data sas_out1 ;
merge wk1 wk2;
by id;
run;

dplyr:full_join ≠
out2 <- rows_upsert(wk1, wk2, by = "id")

Copyright©EPS All rights reserved.

SAS merge

In this case, `rows_upsert`
behaves similarly to a SAS
merge rather than a full join.

26

27.

sassy https://sassy.r-sassy.org/index.html Copyright©EPS All rights reserved. 27

28.
[beta]
sassy – merge
wk1 <- tibble(id = 1:3, val = c("A", "B", "C"))
wk3 <- tibble(id = 2:4, val = c("D", "E", "F"),val2 = c("D", "E", "F") )
out3 <- rows_upsert(wk1, wk3, by = "id")

library(sassy)
out4 <- datastep(
merge(wk1, wk3, by = "id", type = "full"),
{}
)

Even when using the sassy package to replicate SAS behavior in R, it does not
fully replicate the merge statement.
Copyright©EPS All rights reserved.

28

29.

“Perfect replication” is difficult because the SAS merge ... by ...; relies on a unique execution model : • PDV (Program Data Vector): values from the other dataset are retained and carried forward. • Sequential matching within BY groups: duplicates are matched in order, not as a SQL-style Cartesian join. • Overwriting of same-named variables: variables from later datasets overwrite earlier ones, while at the same time special flags like IN=, FIRST., and LAST. are available. Copyright©EPS All rights reserved. 29

30.

SQL Copyright©EPS All rights reserved. 30

31.

SQL – Join /*Inner join*/ select coalesce(wk11.id, wk12.id) as id ,value_A , value_B from wk11 inner join wk12 on wk11.id = wk12.id; /*full outer join*/ select coalesce(wk11.id, wk12.id) as id ,value_A , value_B from wk11 full outer join wk12 on wk11.id = wk12.id; /*left outer join*/ select coalesce(wk11.id, wk12.id) as id ,value_A , value_B from wk11 left outer join wk12 on wk11.id = wk12.id; /*cross join*/ select coalesce(wk11.id, wk12.id) as id ,value_A , value_B from wk11 cross join wk12 ; Copyright©EPS All rights reserved. 31

32.
[beta]
Various Ways to Implement SQL in R
library(sqldf)
sqldf <- sqldf("SELECT a.id, a.value_A, b.value_B
FROM df11 a
INNER JOIN df12 b
ON a.id = b.id")

1. sqldf packageThe most well-known
methodAllows direct handling of dataframes
with SQLUses SQLite internally for processing

library(DBI)
library(RSQLite)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "df11", df11)
dbWriteTable(con, "df12", df12)
dbi1 <- dbGetQuery(con, "SELECT a.id, a.value_A, b.value_B
FROM df11 a
INNER JOIN df12 b
ON a.id = b.id")

2. DBI + dplyr/dbplyrCombining R's
database connection API (DBI) with
dbplyr—an extension of dplyr—enables
you to convert dplyr code into SQL for
execution on databases.Furthermore,
using DBI::dbGetQuery() allows you to
write raw SQL directly.

library(duckdb)
con <- dbConnect(duckdb())
dbWriteTable(con, "df11", df11)
dbWriteTable(con, "df12", df12)
duckdb <- dbGetQuery(con, "SELECT a.id, a.value_A, b.value_B
FROM df11 a
INNER JOIN df12 b
ON a.id = b.id")

3. duckdb PackageUse DuckDB, an
ultra-fast embedded DB engine, in
RRun SQL queries directly on
dataframesIt's becoming a staple for
large-scale data processing

Copyright©EPS All rights reserved.

32

33.

Base.R match Copyright©EPS All rights reserved. 33

34.

Base.R match The match function compares vectors and returns an index . x <- c("B", "C", "A", "D") table <- c("A", "B", "C") match(x, table) Copyright©EPS All rights reserved. 34

35.

Base.R match df41 <- data.frame( ID = c(101, 102, 103, 104), Name = c("Alice", "Bob", "Carol", "Dave") ) df42 <- data.frame( ID = c(103, 101, 105), Score = c(88, 85, 90) ) match <- df41 match$Score <- df42$Score[ match(df41$ID, df42$ID) ] In short, match() is a “position-return” type join method a simplified version of SQL JOIN (limited to many-to-one, left joins)useful for fetching values in a lookup-like manner. Copyright©EPS All rights reserved. 35

36.

hash https://cran.r-project.org/web/packages/hash/index.html Copyright©EPS All rights reserved. 36

37.
[beta]
hash
library(hash)
# Creating a Hash
h <- hash(keys = c("a", "b", "c"), values = c(1, 2, 3))

# Reference to a Value
h[["a"]] # 1
h[["b"]] # 2
# Add new elements
h[["d"]] <- 4
h[["z"]]

Key

Data

# Existence check with has.key()
has.key("a", h) # TRUE
has.key("x", h) # FALSE
# Safe lookup: return NA if not exist
k <- "x"
if (has.key(k, h)) {
print(h[[k]])
} else {
print(NA)
}
Copyright©EPS All rights reserved.

37

38.

hash # Lookup Table(Country Code, Country Name) lookup <- data.frame( code = c("JP", "US", "FR"), name = c("Japan", "United States", "France") ) # Target Dataset df <- data.frame( subject = 1:5, code = c("JP", "US", "CN", "FR", "JP") ) Copyright©EPS All rights reserved. 38

39.

hash library(hash) # Create Hash h <- hash(keys = lookup$code, values = lookup$name) df$name <- vapply(df$code, function(k) if (has.key(k, h)) h[[k]] else NA_character_, FUN.VALUE = character(1)) Copyright©EPS All rights reserved. 39

40.

hash Summary of the R hash package • Fundamentally, it is a dictionary structure of “one key → one value.” • While the value can be a vector or a list, it is not natural to define multiple variables as keys or to manage multiple variables together as data. • Therefore, unlike the SAS hash object, it does not support something like defineKey / defineData method with multiple variables. • Intuitively, its behavior is closer to a proc format lookup table in SAS (mapping a code to a label). Copyright©EPS All rights reserved. 40

41.

data.table https://github.com/Rdatatable/data.table?tab=readme-ov-file Copyright©EPS All rights reserved. 41

42.

data.table Key Features of data.table in R High Performance • Extremely fast operations even with tens of millions of rows. • Optimized algorithms for joins and aggregation. • Reference-based operations minimize memory consumption. Concise Syntax ([i, j, by]) Core structure: DT[i, j, by] i: row filtering (similar to SQL WHERE) j: column operations or aggregations (similar to SQL SELECT) by: grouping (similar to SQL GROUP BY) Enables SQL-like data manipulation in a very compact form. Flexible and Fast Joins Key-based joins with on= are highly efficient. Supports equi-joins, non-equi joins, rolling joins, and multi-key joins. Can reproduce INNER, LEFT, and FULL OUTER JOIN behavior. Copyright©EPS All rights reserved. 42

43.

data.table library(data.table) dt51 <- data.table(id = c(1,2,3), val1 = c("A","B","C")) dt52 <- data.table(id = c(2,3,4), val2 = c(10,20,30)) Copyright©EPS All rights reserved. 43

44.

Inner & Left Outer Join - data.table #Inner Join dt_out1 <- dt51[dt52, on = .(id), nomatch=0] #Left Outer Join (dt51 on the left side as the reference) dt_out2 <- dt52[dt51, on = .(id)] Copyright©EPS All rights reserved. 44

45.

Full Outer & Cross Join - data.table #Full Outer Join dt_out3 <- merge(dt51, dt52, by="id", all=TRUE) data.table::merge.data.table(dt51, dt52, by = "id", all = TRUE) #Cross Join dt51_ <- dt51[, key := 1] dt52_ <- dt52[, key := 1] dt_out4 <- dt51_[dt52_, on=.(key) , allow.cartesian=TRUE][, key := NULL] Copyright©EPS All rights reserved. 45

46.

Rolling Join - data.table #Rolling Join dt_query <- data.table(time = c(2,6,9)) dt_time <- data.table(time = c(1,5,10) , value = c("a","b","c")) #Join the most recent past value dt_out5 <- dt_time[dt_query, on=.(time), roll=TRUE] Copyright©EPS All rights reserved. 46

47.

Rolling Join - data.table Rolling join is an ordered join that fills in matches using the closest previous (or next) key value, when an exact match is not found. For each value in the query table, find the closest matching key in the source table that does not exceed it (when roll=TRUE). In data.table's Rolling Join, when there are duplicate key values,the row that is matched depends on the mult argument:by default, the first matching row is used (mult = "first"),while setting mult = "last" makes it use the last one. Option Description roll = TRUE Carry the most recent past value forward (LOCF) roll = Inf Carry the most recent past value forward (LOCF) roll = -Inf Carry the next future value backward (NOCB) roll = "nearest" Use whichever value is closest (past or future) Copyright©EPS All rights reserved. 47

48.
[beta]
Non-equi Join - data.table
# Non-equi Join (point-in-interval join)
dt_range <- data.table(start = c(1,5), end = c(3,10),
label=c("low","high"))
dt_point <- data.table(x = 1:10)
# Combine elements where x is within the range start <= x <= end
dt_out6 <- dt_range[dt_point, on=.(start <= x, end >= x), nomatch=0]

Copyright©EPS All rights reserved.

48

49.

Non-equi Join - data.table # Non-equi Join # Get the nearest previous or next value (without rolling) X <- data.table(time = c(1, 5, 10), value = c("a", "b", "c")) Y <- data.table(time = c(2, 6, 9)) setkey(X, time) # Nearest previous (time < y) prev <- X[Y, on = .(time < time), mult = "last"] Copyright©EPS All rights reserved. 49

50.
[beta]
Non-equi Join - data.table
# Non-equi Join
#Matching within a +-tolerance band
#Use non-equi conditions to find all matches inside [u − tol, u + tol],
#then pick the closest one (dist = abs(t − u)).
# Match within the tolerance band
X <- data.table(id=1, t=c(1,5,9,12), val=c("A","B","C","D"))
Y <- data.table(id=1, u=c(2,6,11), tol=c(2,1,3))
setkey(X, id, t)
# Precompute join columns
Y[, lower := u - tol]
Y[, upper := u + tol]
# Now perform the join safely
hits <- X[Y, on = .(id, t >= lower, t <= upper),
nomatch=0,.(id, u, tol, t, val, dist = abs(t - u))]
# Keep only the closest match per (id,u)
nearest <- hits[order(id, u, dist)][, .SD[1], by=.(id, u)]

Copyright©EPS All rights reserved.

50

51.

data.table vs. dplyr Aspect dplyr::inner_join data.table [i, on=] Conceptual Model SQL-like (declarative) Hash / index-based (imperative) Implementation Algorithm Hash-based (implemented in C++) not sorted. Hash join or key-based binary search Execution Environment R + C++ (vctrs, dplyr) C (datatable.c) Key Features Readability, strict type consistency, tidyverse integration High speed, low memory usage, supports non-equi joins Translated to SQL via dbplyr In-memory only (no direct SQL translation) Database Integration Copyright©EPS All rights reserved. 51

52.

PharmaForest (SAS) https://pharmaforest.github.io/ Copyright©EPS All rights reserved. 52

53.

Clinical Trial Data Handling In clinical trial data programming, particularly in the creation of CDISC-compliant SDTM and ADaM datasets, left joins account for the vast majority of data merge operations.Moreover, in ADaM dataset derivations, it is extremely common to create flag variables (e.g., “Y/N”) indicating whether a given key variable such as USUBJID exists in another dataset. Copyright©EPS All rights reserved. 53

54.

PharmaForest https://github.com/PharmaForest Copyright©EPS All rights reserved. 54

55.

%kvlookup - Sashash Package data dm; length SUBJID SEX $20.; SUBJID="A001";SEX="MALE";AGE=14;output; SUBJID="A002";SEX="FEMALE";AGE=13;output; SUBJID="B001";SEX="MALE";AGE=13;output; run; data wk1; length SUBJID $20.; SUBJID="A001";output; SUBJID="A002";output; SUBJID="A003";output; SUBJID="B001";output; run; data out1; set wk1; %kvlookup(master=dm,key=SUBJID,var=SEX AGE); run; Copyright©EPS All rights reserved. 55

56.
[beta]
Kvlookup_dt [R]
kvlookup_dt <- function(x,
master,
key,
# character vector: key column names
var = NULL,
# character vector: variable names to retrieve (NULL = all columns except keys)
wh = NULL,
# character scalar: data.table expression applied to master (e.g., quote(Age > 12) or "Age > 12")
warn = FALSE, # whether to issue a warning when lookup keys are not found
overwrite = TRUE # whether to overwrite existing columns in x if names overlap
){
stopifnot(requireNamespace("data.table", quietly = TRUE))
DTx <- data.table::as.data.table(x)
# --- Add a matching flag to master (used to check hits after join) --DTm <- data.table::as.data.table(master)
hit_col <- "__kv_hit__"
while (hit_col %in% names(DTm)) hit_col <- paste0(hit_col, "_")
# --- Apply WHERE filter (wh) to master --DTm[, (hit_col) := TRUE]
if (!is.null(wh) && nzchar(as.character(wh))) {
# Allow both character and expression inputs for evaluation
# --- Set key for faster join --if (is.character(wh)) {
data.table::setkeyv(DTm, key)
wh_expr <- parse(text = wh)[[1]]
} else {
# --- Perform join (X[i] form; keep i = x to preserve its order and row count) --wh_expr <- wh
# master[x, on=key] → keeps the number and order of i (=x) rows,
}
# with X (=master) columns first
DTm <- DTm[eval(wh_expr)]
RES <- DTm[DTx, on = key]
}
# --- Determine columns to keep (key + var) --if (is.null(var)) {
# If var is not specified, use all columns in master except keys
var <- setdiff(names(DTm), key)
}
keep_cols <- unique(c(key, var))
miss_cols <- setdiff(keep_cols, names(DTm))
if (length(miss_cols)) {
stop(sprintf("The following columns do not exist in master: %s", paste(miss_cols, collapse = ", ")))
}
DTm <- DTm[, ..keep_cols]

# --- Extract target columns (var) and merge them into x --# Handle name conflicts
dup_in_x <- intersect(var, names(DTx))
if (length(dup_in_x) && !overwrite) {
stop(sprintf("The following columns already exist in x (overwrite=FALSE): %s",
paste(dup_in_x, collapse = ", ")))
}
# Add new columns or overwrite existing ones (similar to SAS DATA step behavior)
for (vn in var) {
DTx[, (vn) := RES[[vn]]]
}

# --- Issue warnings (warn=TRUE when keys are non-NA but not found) --if (isTRUE(warn)) {
# "matched" = whether the master-side flag is TRUE
matched <- isTRUE(RES[[hit_col]])
matched[is.na(matched)] <- FALSE

# "key_any_present" = at least one key is non-missing
key_df <- RES[, ..key]
key_any_present <- apply(!is.na(as.data.frame(key_df)), 1, any)
not_found <- (!matched) & key_any_present
if (any(not_found)) {
# Log the missing key values
msg_keys <- apply(as.data.frame(key_df[not_found]), 1, function(rw) {
paste(sprintf("%s=%s", names(key_df), ifelse(is.na(rw),
"NA", as.character(rw))), collapse = ", ")
})
warning(sprintf("kvlookup_dt: %d key(s) not found in master. Examples:\n%s",
sum(not_found),
paste(utils::head(msg_keys, 5), collapse = "\n")))
}
}
# --- Remove temporary working column --RES[, (hit_col) := NULL]

# The result is a data.table where var columns are added to x
return(DTx[])
}

Using data.table, I reproduced the behavior of SAS’s PharmaForest.sashash Package. kvlookup
macro — one of the most commonly used data-handling techniques in clinical trial programming —
with equivalent functionality in R.
Copyright©EPS All rights reserved.

56

57.

Kvlookup_dt [R] dm <- data.table(SUBJID = c("A001","A002","B001"), AGE = c(14,13,13), SEX = c("MALE","FEMALE","FEMALE")) wk1 <- data.table(SUBJID = c("A001","A002","A003","B001")) out <- kvlookup_dt( wk1, dm, key = "SUBJID", var = c("AGE","SEX") ) Copyright©EPS All rights reserved. 57

58.
[beta]
Keycheck_dt [R]
keycheck_dt <- function(x,
master,
key,
# character vector: key column names
wh = NULL,
# character or expression: filter applied to master (e.g., "AGE >= 15")
fl = "exist_fl", # name of output flag column
cat = c("YN","NUM","Y") # "YN"=Y/N, "NUM"=1/0, "Y"=Y/""
){
stopifnot(requireNamespace("data.table", quietly = TRUE))
cat <- match.arg(cat)
DTx <- data.table::as.data.table(x)
DTm <- data.table::as.data.table(master)
# --- apply WHERE filter (wh) on master --if (!is.null(wh) && nzchar(as.character(wh))) {
wh_expr <- if (is.character(wh)) parse(text = wh)[[1]] else wh
DTm <- DTm[eval(wh_expr)]
}
# --- keep only key columns, unique keys are enough for existence check --miss_cols <- setdiff(key, names(DTm))
if (length(miss_cols)) {
stop(sprintf("Columns not found in master: %s", paste(miss_cols, collapse = ", ")))
}
DTm <- unique(DTm[, ..key])
# --- prepare master with a hit flag for join result --hit_col <- "__key_hit__"
while (hit_col %in% names(DTm)) hit_col <- paste0(hit_col, "_")
DTm[, (hit_col) := TRUE]
# --- set keys for fast join --data.table::setkeyv(DTm, key)
# --- left-join-like existence check against x (preserve x's order/rows) --RES <- DTm[DTx, on = key]
# FIX: vectorized match flag
matched <- RES[[hit_col]]
# logical vector: TRUE or NA
matched[is.na(matched)] <- FALSE # NA -> FALSE
# --- build the requested flag column on DTx --if (cat == "YN") {
DTx[, (fl) := ifelse(matched, "Y", "N")]
} else if (cat == "NUM") {
DTx[, (fl) := as.integer(matched)] # 1 if exists, 0 otherwise
} else if (cat == "Y") {
DTx[, (fl) := ifelse(matched, "Y", "")]
}

It was also quite straightforward to reproduce
the behavior of SAS’s %keycheck() macro using
data.table in R.

ae <- data.table(USUBJID = c("A001","A001","A003"),
AETERM = c("AE 1", "AE 2", "AE 1"))
dm <- data.table(USUBJID = c("A001","A002","A003","A004"))
out1 <- keycheck_dt(dm, ae, key="USUBJID", fl="AEFL", cat="YN")

invisible(DTx[])
}

Copyright©EPS All rights reserved.

58

59.

Pharmaverse https://pharmaverse.org/ Copyright©EPS All rights reserved. 59

60.

Representative join-related functions in the {admiral} package Copyright©EPS All rights reserved. 60

61.
[beta]
Representative join-related functions
in the {admiral} package
derive_vars_merged()
library(admiral)
adae <- tibble(USUBJID = c("A001","A001","A003"),
AETERM = c("AE 1", "AE 2", "AE 1"))

adsl <- tibble(
USUBJID = c("A001","A002","A003","A004"),
TRTSDT = as.IDate(c("2024-01-10","2024-01-12","2024-01-15","2024-01-18"))
)

adae_1 <- derive_vars_merged(
dataset = adae,
dataset_add = adsl,
by_vars = exprs(USUBJID),
new_vars = exprs(TRTSDT)
)

Copyright©EPS All rights reserved.

61

62.
[beta]
Representative join-related functions in the {admiral} package
derive_var_merged_exist_flag()
adsl_1 <- derive_var_merged_exist_flag(
dataset = adsl,
dataset_add = adae,
by_vars = exprs(USUBJID),
condition = TRUE,
new_var = AEFL,
true_value = "Y",
false_value = "N" ,
missing_value= "N"
)

Copyright©EPS All rights reserved.

62

63.

Pharmaforest {Adamski} package In the current PharmaForest adamski package project,we are gradually working on reproducing the functionality of pharmaverse’s admiral in SAS,inspired by the concepts and design of admiral. Copyright©EPS All rights reserved. 63

64.

sdtm.oak Copyright©EPS All rights reserved. 64

65.

Final Comparison and Summary Copyright©EPS All rights reserved. 65

66.

Summary Notes Common but Subtle Differences in Join Behavior Across R Frameworks Aspect base::merge dplyr::left_join data.table X[i, on=] hash package Preserves order of i (right Not applicable operand) Supported (on=.(a <= b, Non-equi Join Not supported Not supported Not applicable ...)) Supported (roll=TRUE / Rolling / Nearest Join Not supported Not supported Not applicable mult) Explicit reference with i. Column Name Conflicts Managed via suffixes Appends .x / .y Not applicable (overwrite using :=) Expands all combinations Assumes one-to-one Many-to-Many Handling Expands all combinations Expands all combinations (requires unique() / mapping aggregation if undesired) △–○ Moderate (can Performance on Large ◎ Excellent (in-memory ○ Good (limited by △ Slow leverage databases via Data optimized) single-key design) dbplyr) Default Sorting Copyright©EPS All rights reserved. Yes (sort=TRUE) No (preserves left table order) 66

67.

Summary Notes • base::merge() is simple and reliable but can reorder rows and is memory-heavy. • dplyr::xxxx_join() emphasizes readability and consistency, suitable for tidy workflows and moderate datasets. • data.table joins are the most efficient and flexible, supporting non-equi and rolling logic ideal for clinical data (e.g., SDTM/ADaM period merges). • The hash package provides ultra-fast one-to-one lookups, best for code mapping or format-like replacements. • In CDISC workflows, it is also recommended to use the built-in join functions provided by Admiral or oak, as they help standardize and simplify data merge management across ADaM and SDTM process Copyright©EPS All rights reserved. 67

68.

I’m still learning R, so there might be a few mistakes in my slides.I’ll post any updates or corrections on GitHub or LinkedIn later.If you have any questions or comments, please feel free to contact me there. [mail] [email protected] [Github] https://github.com/Morioka-Yutaka https://github.com/Morioka-Yutaka/RinPharma/tree/main/2025 [Linkdin] morioka Copyright©EPS All rights reserved. 森岡 yutaka 裕 68

69.

The above kanji is a motto of a professional Shogi (Japanese chess) player,meaning “Always try a new move.” I’ve long been passionate about pioneering new techniques in SAS, and now, I’m ready to do the same with R. Copyright©EPS All rights reserved. 69