104 Views
September 25, 23
スライド概要
y-tech-ai論文読み会 Active Clean Nonaka
何をよんできたか? • Active Clean; Interactive Data Cleaning for Statistical Modeling • 学習データの質を向上させることで、モデルの精度を上げるアプローチ • なんでよんできたのか? • 先日発表した、Con dent Learningと同じような問題意識 • 教師データの間違いに加えて、特徴量の間違いを念頭に入れている。 fi • ベンチマークのデータセット + 実務上のデータでの検証
n Wang , Eugene Wu , Michael J. Franklin, Ken Goldberg † †† y, Simon Fraser University, Columbia University goldberg}@berkeley.edu [email protected] [email protected] † †† Data Cleaningの問題 • (a)すべてが汚い(dirty)なデータの場合 eaning some data, exdata based on the recess in the context of ingly popular form of ch allows for progreseling problems while Clean supports an imodels (e.g., linear reg those records likely an on five real-world DB, and Dollars For results show that our ccuracy by up-to 2.5x Figure 1: (a) Systematic corruption in one variable can lead to a shifted model. The dirty examples are labeled 1-5 and the cleaned examples are labeled 1’-5’. (b) Mixed dirty and clean data results in a less accurate model than no cleaning. (c) Small samples of only clean data can result in similarly issues.
n Wang , Eugene Wu , Michael J. Franklin, Ken Goldberg
†
††
y,
Simon Fraser University,
Columbia University
goldberg}@berkeley.edu [email protected] [email protected]
†
††
Data Cleaningの問題
• (b)一部のデータを正しいデータに修正した場合 -> Less Accurate!
eaning some data, exdata based on the recess in the context of
ingly popular form of
ch allows for progreseling problems while
Clean supports an imodels (e.g., linear reg those records likely
an on five real-world
DB, and Dollars For
results show that our
ccuracy by up-to 2.5x
Figure 1: (a) Systematic corruption in one variable can lead to
a shifted model. The dirty examples are labeled 1-5 and the
cleaned examples are labeled 1’-5’. (b) Mixed dirty and clean
data results in a less accurate model than no cleaning. (c) Small
samples of only clean data can result in similarly issues.
n Wang , Eugene Wu , Michael J. Franklin, Ken Goldberg † †† y, Simon Fraser University, Columbia University goldberg}@berkeley.edu [email protected] [email protected] † †† Data Cleaningの問題 • (c)それでは、汚いデータを取り除いたら? -> 同様の課題(less accurate) eaning some data, exdata based on the recess in the context of ingly popular form of ch allows for progreseling problems while Clean supports an imodels (e.g., linear reg those records likely an on five real-world DB, and Dollars For results show that our ccuracy by up-to 2.5x Figure 1: (a) Systematic corruption in one variable can lead to a shifted model. The dirty examples are labeled 1-5 and the cleaned examples are labeled 1’-5’. (b) Mixed dirty and clean data results in a less accurate model than no cleaning. (c) Small samples of only clean data can result in similarly issues.
Statistical Modeling ng † , Eugene Wu †† , Michael J. Franklin, Ken Goldberg Active Cleanのやりたいこと mon Fraser University, †† Columbia University g}@berkeley.edu [email protected] [email protected] ome data, exed on the rehe context of pular form of s for progresoblems while pports an im.g., linear reecords likely ve real-world d Dollars For show that our by up-to 2.5x a fixed cleanreturns more earning. • Figure 1: (a) Systematic corruption in one variable can lead to a shifted model. The dirty examples are labeled 1-5 and the cleaned examples are labeled 1’-5’. (b) Mixed dirty and clean data results in a less accurate model than no cleaning. (c) Small samples of only clean data can result in similarly issues. pre-processing step, and instead, repeatedly alternate between cleaning and analysis. It is common to use the preliminary analysis on dirty data as a guide to help identify potential errors and design repairs [14,20,22]. Unfortunately, for statistical models, iteratively cleaning some data and re-training on a partially clean dataset can lead to misleading results in even the simplest models. Consider a linear regression model on systematically translated data (Figure 1a). If one only cleans two of the data points, the intermediate result reveals a misleading trend (Figure 1b). This is a consequence of the well-known Simpson’s paradox where aggregates over different populations of data can result in spurious relationships [32]. Similarly, statistical models face more dramatic sampling effects than the traditional 1D sum, count, avg SQL aggregates (Figure 1c). The challenges with Simpson’s paradox and sampling are problematic because recent advances in SQL data cleaning, such as Sample-and-Clean [33] and Progressive Data Cleaning [5,28,36], actually advocate cleaning subsets of data to avoid the potentially expensive cleaning costs. Clearly, such data clean- • 汚そうなところをアドホックに直しても、正しいモデルが得られる保証なし e several imn, recommenn a survey of for advanced ure [1]. This mia, and there ncy of model an important r inconsistent g values, and leaning dirty important to rations in the • データを逐次的に修正して、再学習することで、大域最適なモデルを得るフレ ームワークを提案。(convex-lossなモデルに限り、つまり、NNは含まれず)
フレームワークの概要 • Figure 2: ActiveClean allows users to train predictive models while progressively cleaning data but preserves convergence guarantees. Solid boxes are essential components for correct-
焦点を当てる2つの課題 • (i) Update Problem: 修正されたデータが与えられたあと、どのようにモデルを学習するか? • (ii) Prioritization Problem: できるだけ早く最適なモデルを得るために、どのデータを修正対象にするか? probability select a sam which each sampling di optimum ca • Figure 2: ActiveClean allows users to train predictive models while progressively cleaning data but preserves convergence 4. UPD This sect
Update Problem • (i) Update Problem: 修正されたデータが与えられたあと、どのようにモデル を学習するか? • 問題の大前提: • 学習の目的関数が凸関数である • SVM, 線形回帰, ロジスティック回帰
clean data, which is not available, and ActiveClean will have to approximate the gradient from a sample of newly cleaned data. To derive a sample-based update rule, the most important property is that sums commute with derivatives and gradients. The convex loss class of models are sums of losses, so given the current 目的関数=凸関数のとき、幾何的に考察すると以下のupdate方法を得る。 best model ✓, the true gradient g ⇤ (✓) is: Update Problem • 4.4 T stud prec of d com that N 1 X (c) (c) ⇤ the g (✓) = r (✓) = r (xi , yi , ✓) N i=1 mor dien ActiveClean needs to estimate g ⇤ (✓) from a sample S, which is tiall すべての学習データがクリーンであるときの目的関数の勾配 drawn from the dirty data Rdirty . Therefore, the sum has two comestim ponentsThe which are the gradient from the already clean data g and C algorithm is as follows: O gS a gradient estimate from a sample of dirty data to be cleaned: Acti 1. Take| aRsample | RRdirty dirty| clean |of data S from g(✓) = · gC (✓)over + the sample· gofS (✓) (1) data (i.e. 2. Calculate the gradient newly clean |R| |R| (t) on t and call the result gS (✓ ) gC can be calculated by applying the gradient to all of the already 3. Calculate the average gradient over all of the already clean read cleaned records: records in Rclean = X R Rdirty , and call the result gC (✓(t) ) clea 1 (c) (c) 4. Apply the following update rule, which a weighted aver- can gC (✓) = r (xi , yi is, ✓) Rclean | i2R age of |the gradient onclean the already clean records and newly clea whe cleaned records: gS can be estimated from a sample by taking the gradient w.r.t Figure 3: (A) A model trained on dirty data can be thought tion | Rdirty | | Rclean | (t+1) (t) (t) (t) the average·gby their)+ respective sam✓ re-weighting ✓ ·( ·gC (✓ )) thro S (✓ of as a sub-optimal point w.r.t to the clean data. (B) The gra-each record, and |the R |gradient, the cleaning | R | funcpling probabilities. Before taking dient gives us the direction to move the suboptimal model to muc 5. Append the newly cleaned records to set of previously clean tion C(·) is applied to each sampled record. Therefore, let S be a approach the true optimum. faste records R = R [ S clean i 2 Sclean sample of data, where each is drawn with probability p(i): Set X 1 1 (c) (c) 4.4 gSAnalysis with Stochastic Descent lear (✓) = r (xi , yGradient i , ✓) clean data, which is not available, and ActiveClean will have to
tially with a rate proportional to the inaccuracy of the sam n from the dirty data Rdirty . Therefore, the sum has two comTo derive a sample-based update rule, the most important propestimate. nts which are the gradient from the already clean data g erty is that sums commute with derivatives and gradients. The conC and 1. Take a sample of data S from Rdirty One key difference with the tpyical application of S gradient estimate from a sample of dirty data to be cleaned: vex loss class of models are sums of losses, so given the current 2. Calculate the gradient over the sample of newly clean data ⇤ a full gradient step on the already ActiveClean takes best model ✓, the true gradient g (✓) is: | Rclean | | Rdirty(t)| (i.e., not sampled) and averages it with a stochastic gr result g(✓) = and call· the gC (✓) + gS (✓ )· gS (✓) (1) N X |R| |R| 1 (c) (c) ⇤ on the dirty data (i.e., sampled). This is because that 3. Calculate the average gradient over all of the already clean g (✓) = r (✓) = r (xi , yi , ✓) は、すべての学習データがクリーンでないと手に入らない n be calculated by applying the gradient to all of the already (t) we N ready clean and assume that the time-consuming step • g*(θ) records in Rclean = R Rdirty , and call the result gC (✓ ) i=1 ed records: cleaning and not the ⇤ numerical operations. The update ActiveClean needs to estimate g (✓) from a sample S, which is 4. Apply the following update rule, which is a weighted averX 1 (c) (c) can be thought of as a variant of SGD that lazily mater gC (✓) = -> r (x , yi already , ✓) drawn from the dirty data Rdirty . Therefore, the sum has two comi the age of the gradient on clean records and newly g*(θ) 手元にある一部クリーンなデータで を近似する必要性 • | Rclean | i2Rclean clean value. As data is sampled at each iteration, data ponents which are the gradient from the already clean data gC and cleaned records: when needed. It is well-known that even for an arbitrary g a gradient estimate from a sample of dirty data to be cleaned: an be estimated from a sample by taking the gradient Sw.r.t | Rdirty | | Rclean tion, |SGD makes (t+1) (t) (t) (t) significant progress in less than one ep |R | )) | Rdirty | record, and re-weighting the clean ✓ ✓ average ·( by their respective ·gS (✓ sam)+ g(✓)through ·g (✓ C the entire dataset) [8]. However, the dirty mo = · g (✓) + · g (✓) (1) C S | Rthe| cleaning func- | R | | R | probabilities. Before taking the gradient, |R| much more accurate than an arbitrary initialization leadi C(·) is applied to each sampled record. Therefore, let S a of 5. Append the newly cleaned records set clean gto can be previously calculated by applying the gradient to all of the already Cbe faster convergence. le of data, where each i R 2 clean S is drawn probability p(i): cleaned records: records = Rwith [ S clean Setting the step size : There is extensive literature i X X 1 (c) (c) 1 1 (c) (c) gC (✓) = , ✓) learning for choosing ther step(xsize ca gS (✓) = r (xi , yi , ✓) i , yi appropriately. | Rclean | i2R | S | p(i) 4.4 Analysis with Stochastic Gradient therDescent to be a constantclean or decayed over time. Many machin i2S frameworks (e.g., MLLib, Sci-kit Learn, Vowpal Wabbit g from a sample by taking the gradient w.r.t S can be estimated at each iteration t, the update becomes: The update algorithm can be formalized as a class of very well each record, and re-weighting the average by their respective samically set this value. In the experiments, we use a techn (t+1) (t) (t) ✓ algorithms ✓ called · g(✓ ) studied Stochastic Gradient Descent (SGD; more pling probabilities. thethere gradient, the cleaning0funcinverseBefore scalingtaking where is a parameter = 0.1, a precisely the mini-batch variant of SGD).tion In SGD, 0 Therefore, let S be a C(·) is random applied tosubsets each sampled record. iteration it decays to = . t |S|t Model Update Algorithm of data, where each iis2 S is drawn with probability p(i): of data are selected at each iteration andsample the average gradient Setting the batch size b: The batch size should be present an outline for one iteration of the update algorithm. 1 X computed for every batch. The basic condition for convergence is 1 (c) (c) Update Problem
Update Problem • 要するに何をやっているか?(t回目の修正時のスナップショット) 全データR irty data can be thought clean data. (B) The grathe suboptimal model to 汚れている可能性のあ The algorithm is as るデータRdirty follows: Rdirtyからサンプリング し、修正したバッチを data S from作成 R 1. Take a sample of dirty t時点目の学 2. Calculate the gradient over the sample of newly clean data 習モデルのパ (t) and call the result gS (✓ ) t ラメタθ 3. Calculate the average gradient over all of the already clean 綺麗なデータRclean (t) records in Rclean = R Rdirty , and call the result gC (✓ ) 4. Apply the following update rule, which is a weighted average of the gradient on the already clean records and newly cleaned records: | Rdirty | | Rclean | (t+1) (t) (t) (t) ✓ ✓ ·( ·gS (✓ )+ ·gC (✓ )) |R| |R|
Prioritization Problem • できるだけ早く最適なモデルを得るために、どのデータを修正対象にする か? => 3つの要素に分解する • Detector: 対象をみつける • Estimator: 対象にサンプル重みをつける • Sampler: 重みに応じサンプリングする
umFor verwell aset, 2 is were odel r of rob- nvex and to a deivedirty s, it d to ocal (e.g., an attribute must not be NULL), and select rows that violate those rules. On the other hand, repair is harder and often requires human involvement (e.g., imputing a value for the NULL attribute). Prioritization Problem : Detector 5.1 Detection Problem • Detectorの仕様 First, we describe the required properties of the dirty data de- tector. The detector returns two important aspects of a record: (1) record is dirty, and r (2) if it is dirty, on which attributes 入力 the : あるレコード •whether there are errors. The sampler can use (1) to select a subset of dirty records to sample at each batch and the estimator can use (2) to 出力:①そのレコードが汚いか綺麗か?②どこが汚いのか? •estimate the value of data cleaning based on other records with the same corruption. D EFINITION 1 (D ETECTOR ). Let r be a record in R. A detector is a function that returns a Boolean of whether the record is dirty and a set of attributes er that are dirty. D(r) = ({0, 1}, er ) From the set of attributes that are dirty, we can find the corresponding features that are dirty fr and labels that are dirty lr since we assume a one-to-one mapping between records and training examples. We will consider two types of detectors: exact rulebased detectors that detect integrity constraint or functional de-
Prioritization Problem : Detector • Detectorの実装 • ルールベース • e.g) ある企業のʼ業界ʼ属性が、業界のマスターに存在しないものになってい るぞ!(フリーテキストでアノテーションされたとして) • マルチクラスの分類機 • すでに修正されたレコードRcleanとその修正箇所Ucleanを学習データにして、 修正箇所を予測するようにモデルを学習する。 • Rdirtyにたいして、予測を行う。
ity of the “clean” class. 6. SELECTING RECORDS TO CLEAN estimator provides more accurate early estimates an tune than the strawman approach until a large amou cleaned (Section 7.4.6). The basic idea is for every feature i to maintain an a i before and after cleaning–estimated from the prev data. This avoids having to design a complex interna late the dirty and clean data, and is based on just estim means. This would work if the gradient was linear i and we could write this estimate as: The algorithm proposed in Section 4.3 will convege for any samPrioritization Problem : Estimator pling distribution where p(·) > 0 for all records, albeit different • distributions will have different convergence rates. The sampling Estimatorが答える問題 algorithm is designed to include records in each batch that are most valuable to the analyst’s model with a higher probability. • • (c) (c) (d) (d) (t) (t) R6.1 dirtyに対して、最適なサンプルリング確率をつけるにはどうすれば良いか? Optimal Sampling Problem r (xi , yi , ✓ ) ⇡ r (xi , yi , ✓ ) + A · Recall that the convergence rate of an SGD algorithm is bounded The problem is that the gradient r (·) can be a v 2 by which is the variance of the gradient. Intuitively, the variance function of the features that couple features together 最適なサンプリング確率 measures how accurately the gradient is estimated from a uniform even in the simple case of linear regression, the gra sample. Other sampling distributions, while preserving the same linear function in x: T expected value, may have a lower variance. Thus, the optimal samその確率でサンプリングしたときのバッチデータの目的関数の値が「きれいなデ r (x, y, ✓) = (✓ x y)x pling problem is defined as a search over sampling distributions to It is not possible to isolate the effect of a change ータ」の目的関数と最も近くなると期待できるとき。 find the minimum variance sampling distribution. on the gradient. Even if one of the features is corrup gradient components can be incorrect. D EFINITION 2 ( OPTIMAL S AMPLING P ROBLEM ). Given a set The way that we address this problem is to linea of candidate dirty data Rdirty , 8r 2 Rdirty find sampling probaent, where the matrix A is found by computing a firs bilities p(r) such that over all samples S of size k it minimizes the Series expansion of the gradient. If d is the dirty variance: the clean value, the Taylor series approximation for ⇤ 2 arg min E(kgS g k ) given as follows: p • • 0 It can be shown [37] that the optimal distribution over records in (c) (c) (t) f (c) = f (d) + f (d) · (d c) + ... In our case, the function f is the gradient r , and t 0
that over all samples S of size k it minimizes the Prioritization arg min E(kgS g k ) p ⇤ 2 Problem : Series expansion of the gradient. the clean value, the Taylor series a Estimator given as follows: 0 f (c) = f (d) + f ( In our case, the function f is the g 0 f (d) · (d c) is a linear function • 最適なサンプリングの条件を満たす重みづけ n [37] that the optimal distribution over records in (c) (c) (t) tional to: pi / kr (xi , yi , ✓ )k Intuitively, • tribution prioritizes records with higher gradients, (c) (c) (t) r (xi , yi , ✓ ) ⇡ (c) The (c) challenge is that r impact during optimization. (x , y ) クリーンなデータ についての目的関数の勾配に比例して重みづける • i i @ ptimal knowing clean (d) (d) natural distribution solution is todepends calculateonthis gradientthe with respect to r (xi , yi , + which is a chicken-and-egg problem: the opti@X ys,values–implicitly assuming that the corruption is not that • しかし, クリーンなデータがない!(これからクリーンにするから) (c) (c) stribution requires knowing (xi , yi ); however, @ (d) (d) + (x , y , ✓ (d) (d) (t) the values so they(xcan, be cleaned. i i pi that / kr y , ✓ )k @Y i i • lution is highly related to the Expected Gradient Length (d) (d) c that has手元にある汚いデータ been proposed before in Active Learning [31]. (x , y ) についての目的関数の勾配に比例して重みづける • i i er, there is additional structure to the data cleaning problem. analyst cleans more data, we can build a model for954 how
Prioritization Problem : Estimator • pi ∝ (d) (d) | | ∇ϕ(xi , yi , θ(t)) | | によってpiを決める方法 • Active Learningの分野では、Expected Gradient Length heuristic • 今回の問題設定では、t > 1時点目のイテレーションでは、すでに綺麗になっ たデータを使用することができる。 • e.g) (c) (c) (d) (d) 修正前の汚いデータ(xi , yi )と修正後の綺麗なデータ(xi , yi )の記録か ら、汚いデータを綺麗に変換するようにモデルを学習して、予測値の勾配に よって重みをつけるなどできる。
Cleanerは? • ルールベースもしくは人手によるアノテーションになる。
検証 • 1万件ぐらい修正すると、それぞれのデータセットで収束してる? Figure 4: (a) We plot the relative model error as a function of the number of records cleaned for each of the alternatives on the IMDB content tagging task. ActiveClean results in the fastest convergence. (b) Faster convergence translates into improved test accuracy. On a holdout set of 20%, ActiveClean is more accurate than the alternatives. (c) We evaluate ActiveClean on the Dollars For Docs fraud detection task and find similar results, where ActiveClean converges faster than the alternatives. (d) ActiveClean improves the false negative rate (detection error) of the classifier resulting in an improved fraud detector. MNIST Fuzzy MNIST Block Removal 5x5 20 12 including edge detection, projection, and raw image patch extraction [3,4]. The cleaning function C() involves replacing a poten-