286 Views
December 06, 24
スライド概要
#Japan.R 2024 のLT資料の英語翻訳バージョン
https://japanr.connpass.com/event/334154/
Data Scientist
Try using dbtplyr 2024/12/07 Japan.R 2024 (#JapanR) Takuto Kotsubo (@airspace_nobo) It has been automatically translated. Sorry if there are any inaccurate expressions.
Self Introduction Have been working as a data scientist for five years now ❏ Recently working in Analytics Engineering team to improve data modeling ❏ Participating in today's event as a staff member of the venue Hobbies ❏ Watching comedy, Futsal
Memories Past presentations at Tokyo. ❏ Use R from python @Tokyo. ❏ How to predict the wind direction of tomorrow @Tokyo. This is the second time since Tokyo.R #71 that I have given a presentation in front of Hadley wickham.
To the main topic. ❏ Introducing a package that incorporates dplyr notation into a modern data pipeline tool called dbt, which I have recently started using in my work. I think there are many people who have never seen dbt before, so I hope you can understand what it can do!
What we want to tell you today ❏ dbt SQL can be written in a dplyr-like way ❏ Combined with Dbt , preprocessing written in R may be replaced with SQL? ❏ dplyr is still amazing.There are endless possibilities.
Presentation Flow ❏ Introduction to data modeling ❏ Introduction to dbt ❏ What can dbt solve? ❏ What can dbt solve when writing SQL? ❏ What can be solved with macro ❏ What can be solved with dbtplyr ❏ Summary
What is Data Modeling? Data modeling is the process of designing and building logical structures to organize data and make it easier to analyze ❏ Improve the speed and reduce the effort of analysis and querying. Centralize management of intermediate processes Example of a not-so-good situation ❏ All users are individually performing pre-processing->totalization->visualization, etc. on information similar to raw data.
What is Data Modeling? Improved condition ❏ Pre-processing common to all users is centralized ❏ No need to follow secret sauce-like information when querying Use here raw data Preprocessi ng/Transfor mation Clean Tables
What is dbt? Code that combines SQL and Jinja, a template language used in Python ❏ Easy to implement data modeling and full documentation and visualization of the process ❏ There is a function called "macro", which allows for processing that would be difficult to perform using SQL alone.
What is dbt? ❏ ❏ When you create the query shown on the left as stg_orders.sql and run dbt, the appropriate DB and table are referenced from the configuration. From the query, you can visualize the dependency of the process "stg_orders was created by referencing raw_orders". select id as order_id, user_id as customer_id, order_date, status from {{ source('main', 'raw_orders') }} select id as order_id, user_id as customer_id, order_date, status from "tutorial"."main"."raw_orders"
What is dbt? Preprocessing and joins of other tables in the same PROJECT can make a nice relationship chart. ❏ ❏ Convert raw_payments to stg_payments in the same way Create a table named orders by joining stg_orders and stg_payments
What can dbt solve? With dbt, you can easily implement data modeling based on SQL. ❏ On the other hand, dbt has a feature called "macro" for processing that is difficult to do using SQL alone. ❏ dbtplyr is a package that allows you to use dplyr-like notation in SQL.
Concerns when writing SQL
❏ So it is necessary to write repeatedly when processing multiple
columns in a regular manner, since there is no function.
❏ Repetition also increases the possibility of typos (e.g., mistakenly
writing "num" as "nu").
select
avg(num_a) as avg_num_a,
avg(num_b) as avg_num_b,
avg(num_c) as avg_nu_c,
from
data
data |>
summarise(across(starts_with('num'),
~mean(.x, na.rm=TRUE),
.names = "avg_{.col}"))
What can be solved with macro ❏ When you want to aggregate the data horizontally for each payment_method, you need to write it repeatedly in SQL. with order_payments as ( select order_id, sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount, sum(case when payment_method = 'coupon' then amount else 0 end) as coupon_amount, sum(case when payment_method = 'bank_transfer' then amount else 0 end) as bank_transfer_amount, sum(case when payment_method = 'gift_card' then amount else 0 end) as gift_card_amount, sum(amount) as total_amount
What can be solved with macro ❏ dbt's macro can be used to generalize for loop and processing {% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} with order_payments as ( select order_id, {% for payment_method in payment_methods -%} sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount, {% endfor -%} sum(amount) as total_amount
What can be solved with macro ❏ macro as a function and can be used in other queries (e.g. macro/util.sql) {% macro get_payment_methods() %} ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] {% endmacro %} {% macro calculate_method_amounts(payment_methods, amount_column='amount', payment_method_column='payment_method') %} {% for payment_method in payment_methods -%} sum(case when {{ payment_method_column }} = '{{ payment_method }}' then {{ amount_column }} else 0 end) as {{ payment_method }}_amount, {% endfor -%} {% endmacro %}
What dbtplyr can solve ❏ dbtplyr is a package designed to implement dplyr-like notation in dbt macro. ❏ A package developed by Emily Riederer , whose X profile also conveys her love of R "Three R's in my last name, but it's not enough #rstats for me!"
What dbtplyr can solve ❏ Use across to summarize recurring totals {% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set cols_num = dbtplyr.starts_with('num_', cols) %} select customer_id, {{ dbtplyr.across(cols_num, "sum({{var}}) as {{var}}_tot") }} from {{ ref('customers') }} group by 1 select customer_id, sum(num_total_orders) as num_total_orders_tot, sum(num_total_amount) as num_total_amount_tot from "tutorial"."main"."customers" group by 1
What dbtplyr can solve ❏ Use across to summarize repetitive conversions {% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set cols_date = dbtplyr.starts_with('date_', cols) %} select customer_id, {{ dbtplyr.across(cols_date, "cast({{var}} as timestamp) as {{var}}")}} from {{ ref('customers') }} select customer_id, cast(date_first_order as date) as date_first_order, cast(date_most_recent_order as date) as date_most_recent_order from "tutorial"."main"."customers"
What dbtplyr can solve
❏ Can be used in combination with for loop for where condition
{% set cols_date =
dbtplyr.starts_with('date_', cols) %}
...
where
{% for col in cols_date -%}
{{col}} >= cast('2018-02-01' as date)
or
{% endfor %}
FALSE
...
where
date_first_order >=
cast('2018-02-01' as date) or
date_most_recent_order >=
cast('2018-02-01' as date) or
FALSE
SUMMARY ❏ The use of Data Engineering (dbt) × Data Science (dplyr) was introduced. ❏ The process used in SQL is simple and convenient, so dplyr may become popular in the dbt world. ❏ I will write a bit more about it in dbt Advent Calendar 2024 including the dbt installation process!
bibliography dbtplyr: Bringing Column-Name Contracts from R to dbt posit::conf(2023) What, exactly, is dbt? Apply a function (or functions) across multiple columns
Appendix: Source Code ❏ https://github.com/takuto0831/japanr_dbtplyr
appendix: targets package ❏ Rが生産性を高める 〜データ分析ワークフロー効率化の実践 and others have discussed the data pipeline tool called targets. ❏ Data pipeline + visualization with ggplot and other outputs can be incorporated into the pipeline.