[DL輪読会]Conditional Neural Processes

>100 Views

July 27, 18

スライド概要

2018/07/27
Deep Learning JP:
http://deeplearning.jp/seminar-2/

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

(ダウンロード不可)

関連スライド

各ページのテキスト
1.

DEEP LEARNING JP [DL Papers] Conditional Neural Processes (ICML2018) Neural Processes (ICML2018 Workshop) Kazuki Fujikawa, DeNA http://deeplearning.jp/ 1

2.

• – DC DC8A 0 IF8A 1FD: GG GR 01S • , . • 8F 8 8FC AD 8C 2DG C 8I IFF8M 8C8 8C M F G DE F 8 8C AD 2 N C GDC 8 D 28 8A D A GA8 8J 8L DC – 0 IF8A 1FD: GG GR01S • , • . 8F 8 GA8 DF G DE 8FC AD DC8 M 8C : 8FN 8C 2DG C 8I 8 D DA8 8C AD • – 1P00 • rb 1 a cz VYtyc mkop • 00 – w c uT s x k ihe i c vWahg nlo d 2 N C A

3.

• • 3 • – –

4.

• 4 • • – –

5.

: 715 • – t v – - 2 0 R c I Published as a conference paper at ICLR 2017 hR n C l Cn Cg L w i a g o - e C LF 2 0 5 a sI o ,- 2 o c - 5 1 lrI , &

6.

l • – yʼ y k l r W • v p m% sd W U – 61: , 7 % :A DIE ANR , HEIE NL NEJI I • t – – k ʼu A E t + AI rwʼ g g k GG fm 627 Ur h AGA U T Udh g T ef G KAJKGA SCJLEGG hT i % olgm k aT f n r v y r k + 5AA f R :L nec S % khef h 2-58 h c G LAKJLN .AK LNHAIN J :L I KJLN NEJI 6 NEJI G 1ECD % :A G -L D 7LAGEHEI LR P G NEJI 8AKJLN 0 RII 0JJCGA KDJNJ G h • blyU ʼu g g % yU r E

7.

• • – – – • – –

8.

• – • • ( ) ( – 8 • • ! = ($% , … , $( ) • ! • + = (,% , … , ,( ) • ,(-% • *(,(-% |+) * ! *(+) *(,(-% , +)

9.

• – • – !" , #" 9 $ » !" = % ' $(#" ) » % – * % = + % ,, . /0 1 : 2 = (!3 , … , !5 ) » 2 = 6% !3 ⋮ !5 9 $0 (#3 ) ⋮ $0 (#5 ) . 6 ⋯ $9 (#3 ) ⋱ ⋮ ⋯ $9 (#5 ) D %

10.

• ) – • ! = ($% , … , $( ) – ! : ( * ! + 1 0 * ! 1 » * ! = ,(!|., /) • E ! = 1E ! = . • cov ! = E !56 = 17 ++ 6 ;(<% );(<% ) 9: ⋮ =8 ;(<( );(<% ) 1 6 = 8 9: 11 6 ⋯ ;(<% );(<( ) @(<% , <% ) ⋯ @(<% , <( ) ⋱ ⋮ ⋮ ⋱ ⋮ = =/ ⋯ ;(<( );(<( ) @(<( , <% ) ⋯ @(<( , <( )

11.
[beta]
:

•
–

•

!
–

"#

1 $

ε#

» "# = &# + ε#

•

) = ("+ , … , ". )

0())

– 0()|!) 1
» 0 ) ! = 2()|!, 3 45 6) 3
» 0 ) = ∫ 0 ) ! 0 ! 8! = 2 ) 9, :;
< =+ , =+ + 3 45
:; =
⋮
<(= . , =+ )

⋯
⋱
⋯

<(=+ , = . )
⋮
< = . , = . + 3 45

&#

12.
[beta]
•
–
1

!"#$

•
– %(!"#$ , ()

:

%(!"#$ , ()

% (

» % !"#$ , ( = + ( ,, -.#/
)

-.#/

0 1$ , 1$ + 3 4/
=
⋮
0(1 " , 1$ )

⋯
⋱
⋯

0(1$ , 1 "#$ )
⋮
0 1 " , 1 "#$ + 3 4/

=

-.
89

8
:

%(!"#$ |()

•

2

–
» % !"#$ ( =

<(=>?@ ,()
<(()

=

" !"#$ , ( ,, -.#/
" ( ,, -.

=

–
4$
4$
» % !"#$ ( = + !"#$ 8 F -"
(, : − 8 F -"
8

|-. |
|-.?/ |

$

exp − E (, !"#$

F

$

-.#/ (, !"#$ + E ( F -. (

13.

• – • 3 • 3 – 1 • 3 – • – – • – ! " #((! + ")' )

14.
[beta]
2
In this section we show how we can contruct
kernels
he marginal likeliencodes the inductive bias that function values closer
(2)
which
des significant
ad- encapsulate the expressive power of deep archiinput
are more
correlated.
The
tectures,together
and howintothe
learn
thespace
properties
of these
kertoX )the
yied µ
, output
complexity
of the functions
in the
space is deternels as part
of a scalable
probabilistic
GPinput
framework.
twork.
Moreover,
1
KX,X
mined by ]
the interpretable length-scale
hyperparam• ⇤.
W
ng non-parametric
Specifically,
a base kernel
k(xi , xto
eter `.starting
Shorterfrom
length-scales
correspond
functions
j |✓)
very recent
KISS– 006
0which
40 vary
0 more
4241
,2
4 the
with hyperparameters
✓, we
transform
the inputs (prerapidly
with
x.
n, 2015)
matrixand
of
extenX⇤ and
X. • dictors) x as
!, $ N through
iently
representing
Theestructure D:!
of our data is "discovered
and kernels.
X⇤ ,
eXdeep
learning
hyperparameters.
The
! k(g(xi , w),kernel
g(xj , w)|✓,
w) ,
(5)
– k(x
i , xj |✓)interpretable
valuated at
marginal
likelihood
ofothe targets
y, the ++
probability
» w)
"(!,is$)
++
snpe
where g(x,
a
non-linear
mapping
given
by
a
deep
s implicitly
of the data conditioned only on kernel hyperparamearchitecture,
as a adeep
network,
» ' ,such
[convolutional
[ probabilistic
ters
provides
principled
framework
Deepparametrized
Kernel Learning by weights w. The popular RBF kernel
$, ' forCkernel
D learning:
i r
lK
dels(GPs)
whichand• the
ses
(Eq. (3))
is
a
sensible
choice
of
base
kernel1 k(xi , xj |✓). 2
For kernel learning, we use the>chain rule to2 compute
space,
and
– derivatives
log p(y|
, X) /we [y
(K
+withI)
y + log
|K + I|] ,
dictions and
flexibility,
also
propose
to use
spectral
Outputkernel
layerFor added
of the log marginal
likelihood
respect
.ese
.
models
(4)
the deepkernels
Williams
(2006)• mixture
for
(Wilson
andgL
Adams, 2013):
$, ' to base
e kernel hyperparameters:
a
where@Lwe@Khave @L
used @L
K as
shorthand
for KX,X given .
.or support
@L0
@K
@g(x, w)
= =
,
=
.
kSM–(x, x
(6) likeliof any conNote
that
expression
for the
@✓|✓)@K
@✓ the@w
@K @g(x, w)
@w log marginal
(predictor)
✓ (4) pleasingly separates
◆
hood
into automatically
Indeed the vectors
1 in Eq.
Q
X
The
implicit
derivative
of
the
1 log marginal likelihood
2
.
.
1
|⌃
|
on
D, which
q
2 and complexity
calibrated
model
induced
by index
to our
n ⇥||⌃
n fit
data
covariance
K terms
x0 )||2matrix
coshx
x0 ,(Rasmussen
2⇡µq i .
q (x
> aq with respect
D exp
(x1 ), . . . , y(xn )) . is(2⇡)
2
2
given
by
and Ghahramani, 2001). Kernel learning is performed
ernel
funcers
q=1
ection
of
function
1 layer
by optimizing
@L
1 Eq. (4) with respect to .
= (K 1 yy> K 1 K 1 ) ,
(7)
bution,
@K
2
ing: A Gaussian process with
The computational bottleneck for inference is solving
onal inputs x through L 372
paraby a hidden layer with an inwhere we have absorbed the noise covariance 2 2 I into
the
linear system (KX,X
+ I) 1 y, and for kernel
Nwith
(µ,base
KX,X
, (1) our
ns,
kernel) hyperpacovariance matrix, and treat it
as part of the base
@K
an process with a deep kernel
learning
is computing
thethelog
determinant
log |KX,X +
kernel
hyperparameters
✓. @✓ are
derivatives
of
ping with an infinite number
the 2deep
with standard
respect to the base
kernel hyperI|. kernel
The
approach
is to compute the
parametrized
= {w, ✓}.
e matrixby defined
parameters (such as length-scale), conditioned on the
jointly through the marginal
Cholesky
decomposition
n ⇥ n matrix KX,X ,
nce kernel of the fixed
ocess.
transformation
of the inputs g(x,of
w).the
Similarly,
h1 (✓)

y1

...

...

h1

yP

...

(L)

hC

...

(L)

...

(L)

h1 (✓)

15.
[beta]
•

continue
theThe
induction,
to guarantee
that thein input
the post-nonlinearity
layer under consideration
is already
denote its with
output.
ith component
of the activations
the lthto
layer,
and postl
We by
briefly
the correspondence
between
layer
networks
and GPs (Neal
governed
a GP.review
InareAppendix
wezprovide
an alternative
derivation,
inneural
terms
marginalization
affine transformation,
denoted
xliCand
Wesingle-hidden
will refer
to these
as
the
post-of
and
i respectively.
1
0
pre-activations.
(WeWilliams
let xi ⌘ which
x(1997)).
input,
theonArabic
numeral
(1994a;b);
The
ithdepend
component
the network
output,
, is computed
as,
over
intermediate
layers,
does
notdropping
theoforder
ofsuperscript,
limits,
in and
theziinstead
case
of a Gaussian
i for the
↵
use a on
Greek
to denote a particular
input ↵). Weight
and bias
parameters
theconvergence
lth◆
✓
prior
thesuperscript
weights. xAl concurrent
work
2018)
further
derives
the
rate
N1 (Anonymous,
dinfor
X
l P ROCESSESX
2.3
G
AUSSIAN
AND
D
EEP
N
EURAL
N
ETWORKS
layer
have
components
W
,
b
,
which
are
independent
and
randomly
drawn,
and
we
take
them
all
to
0
ij 1are
towards a GP ifDallc
layers
taken1 to infinite
zi i(x)
Wij1width
x1j (x),simultaneously.
x1j (x) =
b0j +
Wjk
xk ,
(1)
2 = bi + 2
a
have zero mean and variances w /Nl and b , respectively. GP(µ, K) denotes a Gaussian process
j=1
l 1
k=1 l by induction. We proceed
The
arguments
of identical
the previous
section
can befor
extended
deeper
layers
with meanthat
and zcovariance
functions
µ(·), and
K(·,
·), respectively.
Suppose
is a GP,
independent
every jto(and
hence
xj (x) are independent
j
– +77 where
7
5:
7
C
9
5
,5
85
0
7
7
77
1and
by taking
theemphasized
hidden layer
to be infinite
in x.
succession
! 1,
N2 bias
! 41,
etc.) as are
we
wedistributed).
have
dependence
on input
Because (N
the1 weight
parameters
and identically
After lthewidths
1 steps,
the
network
computes
1
1
0
1
continue
with the
guarantee
layer
consideration
2.2
Rtaken
EVIEW
AUSSIAN
ROCESSES to
AND
S INGLE
LAYER
Ninput
EURALtoNthe
ETWORKS
toOFbeGi.i.d.,
the Pinduction,
post-activations
xj , nr
xj-0that
arethe
independent
for
j 6=under
j . Moreover,
i (x) is
• ,0G]
wl
puRu
RL
N
tW since
Iiszalready
Nfrom
governed
a GP. it
In follows
Appendix
alternative
derivation,
l C we
a 75:
sum
of i.i.dbyterms,
theprovide
CentralanLimit
Theorem
that in in
theterms
limitofofmarginalization
infinite width
X
4
28::85
4
2.3 We
G AUSSIAN
P
ROCESSES
AND
D
EEP
N
EURAL
N
ETWORKS
l
1
l
l
l
l
l
briefly
review
the
correspondence
between
single-hidden
layer
neural
networks
and
GPs
(Neal
1
intermediate
layers,
which W
does
not
depend
on =
the1from
order
of multidimensional
limits, in the case Central
of a (3)
Gaussian
= Gaussian
bi +
xj (x)
(zj the
(x)).
N1 over
! 1,
zi (x)zwill
be
distributed.
Likewise,
Limit
i (x)
ij x
j (x),
(1994a;b);
Williams
(1997)).
The
ith
component
of
the
network
output,
z
,
is
computed
as,
i
1
↵=1
1
↵=k
–
ko
prior
on
the
weights.
A
concurrent
work
(Anonymous,
2018)
further
derives
the
convergence
rate
The arguments
of the previous
section
can be extended
to deeper
layers
by
induction.
Wehave
proceed
j=1
Theorem,
any finite
collection
of
{z
(x
),
...,
z
(x
)}
will
a
joint
multivariate
Gaussian
i
i
✓
◆
N1 infinite in succession (N ! 1,X
dN
in
X
by taking the
hidden
layer
widths
to
be
etc.) as we
towards
a GP
if 1all
layers
are taken to1 infinite
simultaneously.
1ofwidth
2 ! 1,
distribution,
is
exactly
a
1 which
1 1the definition
0 Gaussian
0 process. Therefore we conclude that
l
z
(x)
=
b
+
W
x
(x),
x
(x)
=
b
+
W
x
(1)
» (x)i 1is to
continue
with1 the zinduction,
guarantee
that
input to
layer so
under
consideration
is already
k ,!
i
ij the
j random
j that,
1j the
As before,
a 1sum
of with
i.i.d.
terms
as1jk, N
1,themselves
any finite
collection of i.
l
i
l
1
z
⇠
GP(µ
,
K
),
a
GP
mean
µ
and
covariance
K
which
are
independent
l
i
governed
by
a
GP.
In
Appendix
C
we
provide
an
alternative
derivation,
in
terms
of
marginalization
j=1
k=1
Suppose
that
z
is
a
GP,
identical
and
independent
for
every
j
(and
hence
x
(x)
are independent
⇥
⇤
1 ↵=1
1 ↵=kj
j GP(0,
1 distribution
1 and zil ⇠
(x Because
),
...,
z
(x
)}
will
have
joint
multivariate
Gaussian
K l ).
i
i
the
parameters
have
zero
mean,
we
have
that
µ
(x)
=
E
z
(x)
=
0
and,
over{z
intermediate
layers,
which
does
not
depend
on
the
order
of
limits,
in
the
case
of
a
Gaussian
–
]
nr
i
where we have
emphasized distributed).
the dependenceAfter
on input
Because
weight and
bias parameters
are
and identically
l ⇤ x.
1 steps,
thethenetwork
computes
The
covariance
is
⇥
⇥
⇤
prior
on
the
weights.
A
concurrent
work
(Anonymous,
2018)
further
derives
the
convergence
rate
1
1
0
1
2j 6= j 1. Moreover,
1 0 since z2i (x) is2
taken to be i.i.d.,
the1post-activations
, xj 0 1are
»layers
K
(x,taken
x0 ) to
⌘infinite
E zxi1⇤j(x)z
(x0independent
) N
=l b2 +forw
E xi (x)x
(x ) ⌘ b + w C(x,⇤ x0 ),
(2)
towards
a
GP
if
all
are
width
simultaneously.
⇥
⇥
i
i
a sum of li.i.d terms,
it follows
from
l 1 0
0
l
l 0 the Central
2 Limit
2 Theorem that in the limitl of1infinite width
X
K
(x,1 xwe)will
⌘
Ebe Gaussian
zintroduced
(x
) C(x,
= x
EzNeal
(x))
(zLimit
(x )) . against
(4) the
l l1 l
l l1 )
l 1 by
l+
i (x)z
win
i (zCentral
iintegrating
⇠GP(0,K
where
have
as
(1994a);
itareis(z
obtained
N•1 !
1,
distributed.
Likewise,
the
zil (x)
=bb0for
W
(x),
x(x)
=
(x)).
(3)
]andi independent
nr
s
m
kmultidimensional
Suppose
that
zjl z1i (x)
is a GP,
identical
every
jifrom
(and
xljP
independent
i)+
ij xjhence
j (x)
j
0
0
1
1
1 ↵=1
1 ↵=k
Theorem,
any
finite collection
),
...,as
zi (x
)} will
a joint
multivariate
Gaussian
distribution
ofAfter
W l, bof1. {z
Note
that,
any
two
zi ,have
zj for
i 6=
j are joint
Gaussian and have zero
and identically
distributed).
steps,
network
computes
j=1
i (x the
–
i[e
ko
1 same
distribution,
which
is
exactly
the in
definition
a(4)
Gaussian
process.
wez lconclude
that
covariance,
they
are guaranteed
toof be
independent
utilizing
the
features
produced by
By
the
expectation
Equation
is over
the despite
GP Therefore
governing
, but this
is equivalent
Nl
1 induction,
1
1
1
1
i
X
l
zi ⇠ GP(µ
,
K
),
a
GP
with
mean
µ
and
covariance
K
,
which
are
themselves
independent
of
i.
before,
random
so⇤ that,
as Nl (3)
! 1, any finite collection
l l of i.i.d.
l
the As
hidden
⇥ 1and
i (x)
zillayer.
(x) z=
bli joint
+ is adistribution
Wsum
xof
=1 z(zl jl 11terms
(x)).
ij xj (x),
j (x)
toBecause
integrating
the
only
(x)
z=il 01and,
(x0 ). The
latter is described
by
1» against
↵=1
1zero↵=k
l
the
parameters
have
mean,
we
have
that
µ
(x)
=
E
z
(x)
i
i
{zi (x
), ...,
z
(x
)}
will
have
joint
multivariate
Gaussian
distribution
and
z
⇠
GP(0,
K l ).
j=1
i
li 1
0
⇥
⇤
⇥
⇤
a zero mean,
Gaussian whose
covariance matrix has
entries
K (x, x ),
1 two-dimensional
2
2 distinct
The
K
(x,covariance
x0 ) ⌘nr
E1 zi1is
(x)z0 i1 (x0 ) = b2 + w
E x1i (x)x1i (x0 ) ⌘ b2 + w
C(x, x0 ),
(2)
–
l
1
l
0
l ]
As K
before,(x,
zi (x)
a sum
i.i.d.
terms these
so⇤ that,
Nl only
!
anyquantities
finite collection
x), isand
K of (x
, x random
).⇥0 As such,
areasthe
three
that appear in the result.
3 1,
⇥
⇤
1 where
↵=1 we have
1 ↵=k
l
l l 1 the
introduced
C(x,
x
)
as
in
Neal
(1994a);
it
is
obtained
by
integrating
against
l 1 0
l will have
0
l
l 0 Gaussian
2 distribution
2
{zi We
(x introduce
), ..., zi (x»the
)}
joint
multivariate
and
z
⇠
GP(0,
K
).
shorthand
K
(x,
x
)
⌘
E
z
(x)z
(x
)
=
+
E
(z
(x))
(z
(x
))
. (4)
l
1
i
l 1)
0 0
1 1 b
i
i
w zi joint
i
i
⇠GP(0,K
Gaussian
and have
zero
The distribution
covariance isof W , b . Note that, as any two⇣zi , zj for i 6= j are
⌘
covariance,
they
are
guaranteed
to
be
independent
despite
utilizing
the
same
features
produced
by
⇥
⇤
⇥
⇤
l
0
2
2
l
1
0
l
1
l
1
0
0
lK
1
lx),
1 K
l
0
l (x,l x 0) = 2 + 2
0
»
K
F
K
(x,
x
),
(x,
(x
,
x
)
(5)
(x, x )layer.
⌘ E zi (x)zi (x ) = bb + wwEzl 1 ⇠GP(0,K l 1 ) (zi (x)) (zi (x )) . (4) l 1
theKhidden
i
By induction, the expectation in Equation (4) is over the GP governing zi , but this is equivalent
lof onlyllz 1l 1 1 (x) and z l 1 (x0 ). The latter is described by
to
integrating
against
the
joint
distribution
to
emphasize
the
recursive
relationship
between
K
and K
viathis
a deterministic
function F whose
i
By induction, the expectation in Equation (4) is over the GP governing
zi i , but
is equivalent
3
1
l . 1This gives
l 1 an0 covariance
a zeroonly
mean,
two-dimensional
whose
matrix
distinct entries
K lcan
(x, x0 ),
form depends
on the
nonlinearity Gaussian
iterative series
ofhas
computations
which

16.

• 76: Learning to Learn Losses 134 + , 2 0 1

17.

• Losses Learning to Learn Model Based ● ● ● ● ● Metric Based Santoro et al. ’16 Duan et al. ’17 Wang et al. ‘17 Munkhdalai & Yu ‘17 Mishra et al. ‘17 ● ● ● ● Koch ’15 Vinyals et al. ‘16 Snell et al. ‘17 Shyam et al. ‘17 Trends: Learning to Learn / Meta Learning 7 : + 12 ,0 Optimization Based ● ● ● ● ● ● ● Schmidhuber ’87, ’92 Bengio et al. ’90, ‘92 Hochreiter et al. ’01 Li & Malik ‘16 Andrychowicz et al. ’16 Ravi & Larochelle ‘17 Finn et al ‘17 135

18.

• m – : I 7 I -, I • I I 8 L + C ih 0 u Pg y u ]eaV Model Based 2 Meta + Learningg • [ I kN Ts ecR "P ol Tr • nk C U t • P 0 !P Mnk e M A ol U UkN Sd bcRM0 +00 gnk wmnp ol emnp 0 I s y g g wmnp I2 IL ]Td y e + Oriol Vinyals, NIPS 17 6 L 0-1 ( I LC C )

19.

• hcr – I 6 : 2 9: 2 6 6 7 6I • • u l N I% • 62 ca d )# Sg e f MkNO y bS P] M g Matching Networks, Vinyals et al, NIPS 2016 Oriol Vinyals, NIPS 17 2 - 6 2 62 : 2 - c ! "# Sg 6 nSs [L )# , "# r Ms : 0 : "# %& , ! Metric Based Meta Learning t : 1 e p "# (%& , !) p )# (%, !) 6 • )# %, ! • , 6 9 : o wV iaV e V ] ia u [L )# , "# o l m o r

20.

• kdz – 1+ I0 C 7 ( C ,) g C 7C 2 7 -7 A z Summing Up 7 7C A7 CA lgnO r [ db z s • • , • a imk A+ C( d ce d lgn z o C C t db Model Based z L z Ffb Metric Based 77 M ce 7CD A Fimk ( d z pwN]F imk O L d lgn Model-Agnostic Meta-Learning for Fast Adaptation o Algorithm 1 Model-Agnostic Meta-Learning meta-learning Require: p(T ): distribution over tasks Optimization Based learning/adaptation Require: ↵, : step size hyperparameters 1: randomly initialize ✓ rL3 Examples of Optimization Based Meta Learning 2: while not done do rL2 3: Sample batch of tasks Ti ⇠ p(T ) ✓3⇤ Finn et al, 17 rL1 4: for all Ti do 5:Oriol Vinyals, Evaluate NIPS 17 r✓ LTi (f✓ ) with respect to K examples ✓1⇤ ✓2⇤ 6: Compute adapted parameters with gradient descent: ✓i0 = ✓ ↵r✓ LTi (f✓ ) Figure 1. Diagram of our model-agnostic meta-learning algo- Ravi et al, 17 rithm (MAML), which optimizes for a representation ✓ that can 7: end for P quickly adapt to new tasks. 8: Update ✓ ✓ r✓ Ti ⇠p(T ) LTi (f✓i0 ) 9: end while ✓ make no assumption on the form of the model, other than Oriol Vinyals, NIPS 17 products, which braries such as experiments, w this backward p which we discu 3. Species of In this section, meta-learning a forcement learn function and in sented to the m nism can be app 3.1. Supervised

21.

• • • – –

22.

1 + 1 1 , – L L io d – » "# $# G eL ] G – » %u 1. Introduction Deep neural networks have enjoyed remarkable success in recent years, but they require large datasets for effective training (Lake et al., 2017; Garnelo et al., 2016). One y4 y5 y6 Data d "# » Gr [N y3 f f f f f f a G – y2 x1 x2 x3 x4 x5 x6 … … Observations Supervised Learning G Deep neural networks excel at function approxia yetC I mation, they are typically trained from scratch s forPeach new function. On the other hand, Bayesian methods, such as Gaussian Processes (GPs), exploit prior knowledge to quickly infer the shape of a new function at test time. Yet GPs expensive, and it can be hard a C $# Gtare computationally L]d to design appropriate priors. In this paper we propose a family of neural models, Conditional Neural Processes (CNPs), that combine the benefits of both. CNPs are inspired by the flexibility of stochastic processes such as GPs, but are structuredn as neural networks and trained via gradient lG ] descent. CNPs make accurate predictions after observing only a handful of training data points, yet scale to complex functions and large datasets. a M We demonstrate the performance and versatility of the approach on a range of canonical machine & u a C !(% + &) learning tasks, including regression, classification and image completion. !G d y1 … Targets g y4 y5 y6 y1 y2 y3 g g g x1 x2 x3 x4 x5 x6 … … Train b » t c • Abstract G – Our Model +Gr Marta Garnelo 1 Dan Rosenbaum 1 Chris J. Maddison 1 Tiago Ramalho 1 David Saxton 1 Murray Shanahan 1 2 Yee Whye Teh 1 Danilo J. Rezende 1 S. M. Ali Eslami 1 v:1807.01613v1 [cs.LG] 4 Jul 2018 • 1 c 0282 • Conditional Neural Processes Predict r1 r2 r3 h h h a y4 y5 y6 y1 y2 y3 r g g g x1 x2 x3 x4 x5 x6 Observe Aggregate … Predict Figure 1. Conditional Neural Process. a) Data description b) Training regime of conventional supervised deep learning models c) Our model. of these can be framed as function approximation given a

23.
[beta]
Abstract

1. Introduction
Deep neural networks have enjoyed remarkable success
in recent years, but they require large datasets for effec-

y3

y4

Neural Processes
y5Conditional
y6
…

,+

0

a

Data

spect to some prior process. CNPs parametrize distributions
f
f
f
f
f
f
over f (T ) given a distributed representation of O of fixed
dimensionality.
By
doingx so we give up
the mathematical
x1
x2
x4
x
x6
3
guarantees associated with stochastic processes,5 trading
this…
off for functional
flexibility and scalability.Targets
Observations

b

Supervised Learning

Specifically, given a set of observations O, a CNP is a cong
ditional stochastic process Q✓ that defines distributions over
el P
y4
y5 all paramey6
f (x) for inputs x 2 T . ✓ is the real vector
of
…
ters defining Q. Inheriting from the properties of stochastic
y2
y3 Q✓ is invariant
g
g
processes, ywe
that
to gpermutations
1 assume
of O and T . If O0 , T 0 are permutations of O and T , re0
x4✓ (f (Tx05) | O,xT
spectively, xthen
Qx✓2 (f (Tx)3 | O, T ) = Q
…
1
6 ) =
0 ]d
L
M
N
L
el
Q✓ (f (T ) | O , T ). In this work, we generally enforce per-P
Trainwith respect to T by Predict
mutation invariance
assuming a factored structure. Specifically,
we consider Q✓ s that factor
Q r
r
r
1 T) =
2
3
Q✓ (f (T ) | O,
x2T Q✓ (f (x) | O, x). In the absence
of assumptions on output space Y , this is the easiest way to
y5
y6 can…
h stochastic
h
h
a Still,ythis
4
P a valid
ensure
process.
framework
be extended to non-factored distributions, we consider such
y1 experimental
y2
y3 section.
r
g
g
g
a model in the
Our Model

–

y2

The defining
characteristic
of a CNP isxthat it xconditions
on
x1
x2
x3
x6
…
4
5
O via an embedding of fixed dimensionality. In more detail,
we use the following
architecture,
Observe
Aggregate
Predict
c

Deep neural networks excel at function approxi83 81
:1are typically
: 2 trained from scratch
1:8
mation, yet they
for each new function. On the other hand,
[amethods, such as Gaussian Processes
Bayesian
prior knowledge to quickly infer
• 82 (GPs),
3 : exploit
ℎ
the shape of a new function
at test time. Yet GPs
./0
– v
"
=
{(&
,
)
)}
l
1' Ar
]d
' ' expensive,
',are computationally
and it can
be hard
to design appropriate
priors. In this paper we
– ,+
I
propose a family of neural models, Conditional
• 1 :Neural
1 Processes
: 2
(CNPs), that combine the benefits of both.
are inspired
– v
LnCNPs
Cus
P 1' by the
lflexibility
1A GL
of stochastic processes such as GPs, but are struc–
:1
I
tured as neural networks and trained via gradient
• 3 2 descent.
3 : 3 CNPs make accurate predictions after
observing only a handful
of training data points,
.5650
– o
it
4
=
{&
}
llarge datasets.
o g
el
' ',.
yet scale to complex
functions and
We»demonstrate the performance
and versatility 7, 8 9
g L
of »
theg
approach on a range ofccanonical
dg machine
L
:;
learning tasks, including regression, classification
and image completion.

:1807.01613v1 [cs.LG] 4 Jul 2018

•

y1

ON = {(xi , yi )}N
i
minimize the nega
h
L(✓) = Ef ⇠P

Thus, the targets i
and unobserved v
estimates of the gr

This approach shi
edge from an ana
the advantage of
specify an analyti
intended to summa
emphasize that the
conditionals for all
does not guarantee

1. A CNP is a
trained to mo
of functions f

2. A CNP is per

3. A CNP is sca
ity of O(n +
ri = h✓ (xi , yi )
8(xi , yi ) 2 O
(1)
observations.
Figure 1. Conditional
r = r1 Neural
r2 . . .Process.
rn 1 rn a) Data description
(2)

b) Training regime of conventional supervised deep learning mod8(xi ) 2 T
(3)
i = g✓ (xi , r)
Within this specifi
els c) Our model.
where h✓ : X ⇥ Y ! Rd and g✓ : X ⇥ Rd ! Re are neural

aspects that can b
The exact impleme

24.

• Conditional Neural Processes 24:4 0 , 80 8 1 08 + metrize distributions ON = {(xi , yi )}N i=0 ⇢ O, the first N elements of O. We – tation of O of fixed minimize the negative conditional log probability • G C sh h o Nn CM ii p the mathematical ocesses, trading this – L(✓) = Ef ⇠P EN log Q✓ ({yi }ni=01 |ON , {xi }ni=01 ) lity. (4) • s O, a CNP is a cones distributions over ector of all parameperties of stochastic ant to permutations ns of O and T , re0 0 ✓ (f (T ) | O, T ) = nerally enforce perby assuming a facder Q✓ s that factor O, x). In the absence is the easiest way to this framework can ns, we consider such that it conditions on idGI tN L C Thus, the targets it scores Q✓ on include both the observed +,– and unobserved N ] G s ] ]P {(# , & )} $ $ $)* values. In practice, we take Monte Carlo – estimates t ngradient , sN = {(# of the of this loss by sampling.f/ and N $., &$ )}/ $)* – . {#$ }+,{&$ }+,r a Nu CM / approach This shifts the burden prior knowl$)* Nl $)* of imposing – edge o 1from N an ] analytic ]Pprior to empirical ] data. [ ThischasNu the advantage of liberating a practitioner from having to specify an analytic form for the prior, which is ultimately intended to summarize their empirical experience. Still, we emphasize that the Q✓ are not necessarily a consistent set of conditionals for all observation sets, and the training routine does not guarantee that. In summary, 1. A CNP is a conditional distribution over functions trained to model the empirical conditional distributions of functions f ⇠ P . N eCM

25.
[beta]
3.2. Meta-Learning

C

•

08D

1

8

8 2,

Deep learning models are generally more scalable and are
very successful at learning features and prior knowledge
from the data directly. However they tend to be less flexible with regards to input size and order. Additionally, in
general they only approximate one function as opposed to
distributions over functions. Meta-learning approaches address the latter and share our core motivations. Recently
meta-learning has been applied to a wide range of tasks
like RL (Wang et al., 2016; Finn et al., 2017) or program
induction (Devlin et al., 2017).

8

/.

+

–

lo

•

– ifk

h s d N

» 5 u w

bM

d

G

• x
– 8

8

("# , %# ) d

–

:: 8: C

–

8

8

C

G

/.1

' [ "( d

C

G

/.1

ifk

•
–
–
–

]
C

, D
L

01
: 8 8

C DC

]G
b[x

teu u w
,1

N

P
P

Often meta-learning algorithms are implemented as deep
generative models that learn to do few-shot estimations
of the underlying density of the data. Generative Query
Networks (GQN), for example, predict new viewpoints in
3D scenes given some context observations using a similar
training regime to NPs (Eslami et al., 2018). As such, NPs
can be seen as a generalisation of GQN to few-shot prediction tasks beyond scene understanding, such as regression
and classification. Another way of carrying out few-shot
density estimation is by updating existing models like PixelCNN (van den Oord et al., 2016) and augmenting them
with attention mechanisms (Reed et al., 2017) or including
a memory unit in a VAE model (Bornschein et al., 2017).
Another successful latent variable approach is to explicitly
condition on some context during inference (J. Rezende
et al., 2016). Given the generative nature of these models
they are usually applied to image generation tasks, but models that include a conditioning class-variable can be used for
classification as well.

n ] a

]g

d b

'#

4. Experimental Results

n ]

h s Iu w

»

3

other generative models the neural statistician learns to estimate the density of the observed data but does not allow for
targeted sampling at what we have been referring to as input
positions xi . Instead, one can only generate i.i.d. samples
from the estimated density. Finally, the latent variant of
CNP can also be seen as an approximated amortized version
of Bayesian DL (Gal & Ghahramani, 2016; Blundell et al.,
2015; Louizos et al., 2017; Louizos & Welling, 2017)

r

I

d

Classification itself is another common task in metalearning. Few-shot classification algorithms usually rely
on some distance metric in feature space to compare target
images to the observations provided (Koch et al., 2015),
(Santoro et al., 2016). Matching networks (Vinyals et al.,
2016; Bartunov & Vetrov, 2016) are closely related to CNPs.
In their case features of samples are compared with target
features using an attention kernel. At a higher level one can
interpret this model as a CNP where the aggregator is just
the concatenation over all input samples and the decoder g
contains an explicitly defined distance kernel. In this sense
matching networks are closer to GPs than to CNPs, since
they require the specification of a distance kernel that CNPs
learn from the data instead. In addition, as MNs carry out allto-all comparisons they scale with O(n ⇥ m), although they

c ,1

G 01
d
G

8

8 8
Nb
b

[

Figure 2. 1-D Regression. Regression results on a 1-D curve
(black line) using 5 (left column) and 50 (right column) context
points (black dots). The first two rows show the predicted mean
and variance for the regression of a single underlying kernel for
GPs (red) and CNPs (blue). The bottom row shows the predictions
of CNPs for a curve with switching kernel parameters.

26.

8 • 68 8 2 8 + -, - +01G – ln • Conditional Neural Processes – - +01 IT » • s og c N a u [0, 1]& ] – S] [0, 1]C t – – S] i u T ] a [0, 1]' G Conditional Neural Processes ] P Figure 3. Pixel-wise image regression on MNIST. Left: Two examples of image regression with v provide the model with 1, 40, 200 and 728 context points (top row) and query the entire image. T variance (bottom row) at each pixel position is shown for each of the context images. Right: mode observations that are either chosen at random (blue) or by selecting the pixel with the highest varia • – – ] C ] 1 Nb L : T 4.1. Function Regression As a first experiment we test CNP on the classical 1D regression task that is used as a common baseline for GPs. We generate two different datasets that consist of functions generated from a numbers GP withofanobservations. exponentialWe kernel. In the first Figure 3. Pixel-wise image regression on MNIST. Left: Two examples of image regression with varying dataset use a kernel with fixed row) parameters, and in the provide the model with 1, 40, 200 and 728 context points (top row) and query the entire image.we The resulting mean (middle and second dataset the function switches at some variance (bottom row) at each pixel position is shown for each of the context images. Right: model accuracy with increasing number of random point observations that are either chosen at random (blue) or by selecting the pixel with the highest (red). on thevariance real line between two functions each sampled with different kernel parameters. cdhoer aM C [ aD T D , [ c 4.1. Function Regression As a first experiment we test CNP on the classical 1D re- Ataccuracy every training we sample a curve from the GP, select that increases in as thestep number of context points a subset of n points (x , y ) as observations, and a subset of i i increases. points (xt , yt ) as target points. Using the model described that increases in accuracy increases. Furthermore the model ach on the switching kernel t is not trivial for GPs whe change the dataset used fo 4.2. Image Completion We consider image comp functions in either f : [0, or f : [0, 1]2 ! [0, 1]3 fo 2D pixel coordinates nor

27.

• C C 8 0 G 8 2 C: EE E ,8 L – r • – O[n ul P • – /0- b • a NaI M]MST – b R[ / +k a – » 8 CA C • • » 1 C • ,2 • - / 7 Conditional Neural Processes 4.2.1. MNIST t » C 0 b We first test CNP on the MNIST dataset and use the test Figure 6. Image completion with a latent variable model. The set to evaluate its performance. As shown in Figure 3a the latent variables capture the global uncertainty, allowing the sammodel learns to make good predictions of the underlying pling of different coherent images which conform to the obser5. number Flexible image completion. In contrast to standard condigit even for aFigure small of context points. Crucially, vations. As the number of observations increases, uncertainty is ditional can be directlycontext conditioned when conditioned onlymodels, on oneCNPs non-informative pointon observed reduced and the samples converge to a single estimate. pixels in arbitrary patterns, even ones which were never seen in k i (e.g. a black pixel on the edge) the model’s prediction correthe training set. Similarly, the model can predict values for pixel sponds to the average over allwere MNIST thetraining numberset, like subcoordinates that never digits. includedAs in the servations and targets. This forms a multivariate Gaussian of context points increases the predictions become pixel values in different resolutions. The dottedmore white lines were which can be used to coherently draw samples. In order to added for clarity after generation. similar to the underlying ground truth. This demonstrates maintain this property in a trained model, one approach is to the model’s capacity to extract dataset specific prior knowltrain the model to predict a GP kernel (Wilson et al., 2016). edge. It is worth mentioning that even with a complete set However the difficulty is the need to back-propagate through of observations the model does not has achieve pixel-perfect and prediction task, and the capacity to extract domain the sampling which involves a large matrix inversion (or reconstruction,knowledge as we havefrom a bottleneck a training at set.the representation some approximation of it). level. a MIWe compare CNPs , CG Figure 4. Pixel-wise image completion on CelebA. Two examquantitativelyGto twoa related models: ples of CelebA imageRandom regression with varying numbers obserkNNs and GPs. shown in Table 4.2.3outputs, CNPs outperform Context Ordered of Context Since this implementation of As CNP returns factored vations. We provide the model with 1, 10, 100 and 1000 context the latter whenproduce number given of context pointscontext is small (empiri# 10 100 1000 10 100 1000 the best prediction it can limited points (top row) and query the entire image. The resulting mean cally when half of the image or less is provided as context). information is to average over all possible predictions that kNN 0.215 (bottom 0.052 row) 0.007at each 0.370 (middle row) and variance pixel0.273 position0.007 is When the majority of the image is given as context exact agree with the context. An alternative to this is to add GP 0.247 0.137 0.001 0.257 0.220 0.002 shown for each of the context images. methods like GPs and kNN will perform better. From the talatentk variablesble in thecan model such can be sampled CNP 0.039 0.016 0.009 0.057 0.047 0.021 s P o mL we also see thatthat the they order in which the context points conditioned onare theprovided contextistoless produce predictions with high important for CNPs, since providing the probability datapoints distribution. We consider this model 02 c in the 00 ,2in orderhe ]Mi context from top to bottom still results Table 1.aspect Pixel-wise mean squared error for all the pixels in the Aninimportant of CNPs demonstrated in ofFigure 5, is later in sectiongood 4.2.3. image not completion the CelebA data set with increasing performance. Both insights point to the fact that CNPs its flexibility only intask theonnumber of observations and M I M I 00 ,2 b learn a data-specific ‘prior’ that will generate good samples number of context points (10, 100, 1000). The context points are targets it chosen receives but also with regards to their input values. An important aspect of the ability to estimate either at random or ordered from the top-left corner to the even when the model numberisofits context points is very small. It is interesting to compare thiscontext property to GPs one hand, of the prediction. shown in the bottom-right. With fewer points CNPson outperform kNNs 0 b the uncertainty k b P As a Lbottom 00 c x b 02 c b and to trained generative (van den et al., 2016; and GPs. In additionmodels CNPs perform well Oord regardless of the order of row of Figure 3a, as we add more observations, 4.2.3. L ATENT VARIABLE MODEL the variance the context points, whereas GPs and kNNs perform worse when Gregor et al., 2015) on the other hand. shifts from almost the digit abeing ] uniformly spread over RS i[dI the context is ordered. The main model we use throughout this paper is a factored positions to being localized around areas that are specific The first type of flexibility can be seen when conditioning on model that predictsaSM the mean and variance of the target ]e/ +c to the underlying digit, specifically its edges. Being able to outputs. Although we have shown that the mean is by subsets itself that the model has not encountered during training. model the uncertainty someand context canvariance be helpful ]Mi a useful given prediction, thatL the is a for good way to Inconditioning contrast, the the approach to simply add latent Consider modelwe on use oneishalf of the image, many tasks. One example is active this exploration, where the us fox capture the uncertainty, factored model prevents fromexample. variables to our decoder g, allowing our predict model topixel capture Thisz forces the model to not only

28.

: 8 8 :2 . • 2 0 2 : + -, 1 C – • Conditional I P – » • • Ici c g d ] I • a c x N [or L I C Ai G Iu Aci c I M Conditional Neural Processes » • g d • a c x s gnC N [t 8 [ L I .. I IAil I N eb 6. Image completi Figure 5. Flexible image completion. In contrastFigure to standard con latent variables capture th ditional models, CNPs can be directly conditioned observed pling ofon different coheren Figure 5. Flexible image completion. In contrast to standard convations. As theseen numberin o pixels in models, arbitrary patterns, ones never ditional CNPs can be directlyeven conditioned on which observed were reduced and the samples c in arbitrary even ones which were never in the pixels training set.patterns, Similarly, the model canseen predict values for pixel the training set. Similarly, the model can predict values for pixel coordinates that were never included in the training set, like sub coordinates that were never included in the training set, like subservations and targets. pixel values inin different resolutions. The dotted white lines dotted were pixel values different resolutions. The white were which canlines be used to co added for clarity after generation. maintain this property in added for clarity after generation. and prediction task, and has the capacity to extract domain knowledge from a training set. train the model to predi However the difficulty is the sampling which inv some approximation of compare CNPstask, quantitatively to two models: to extract domain andWeprediction and has therelated capacity kNNs and GPs. As shown in Table 4.2.3 CNPs outperform Random Co knowledge from a oftraining set.is small (empirithe latter when number context points # 10 100 cally when half of the image or less is provided as context).

29.

• , AIA – y :C 2 G:C G isku 3 8-:G A C IR A C I O av G ,22 G :I G isk :K G: • – 1+22 86: I G – 1:I – a A IA ,10 2 IL G 8 A M:C a cN,2 1+22 !(# + $) Conditional Neural Processes R • rot – : 9 We apply this model to MNIST and CelebA (Figure 6). We use the same models as before, but we concatenate the representation r to a vector of latent variables z of size 64 (for CelebA we use bigger models where the sizes of r and z are 1024 and 128 respectively). For both the prior and posterior models, we use three layered MLPs and average their outputs. We emphasize that the difference between the prior and posterior is that the prior only sees the observed pixels, while the posterior sees both the observed and the target pixels. When sampling from this model with a small number of observed pixels, we get coherent samples and we see that the variability of the datasets is captured. As the model is conditioned on more and more observations, the variability of the samples drops and they eventually converge to a single possibility. » 2 L:M / I ,C: A A :IA 2isku h/ a oPmh – ,10 a conditional Gaussian prior p(z|O) that is conditioned on the observations, and a Gaussian posterior p(z|O, T ) that is also conditioned on the target points. • oPmln – 3 C S ] fR :I 9 derot 4.3. Classification 00 overview of the task). This dataset consists of 1,623 classes the task). This dataset consists of 1,623 classes 1:I Aoverview 2 ofIL !(#$) f has alphabets. Each class of characters fromG50 different only 20t examples and as such this dataset is particularly k Ps suitable for few-shot learning algorithms. As in (Vinyals suitable for few-shot learning algorithms. As in (Vinyals et al., 2016) we use 1,200 randomly selected classes as our training set and the remainder as our testing data set. In addition we augment the dataset following the protocol described in (Santoro et al., 2016). This includes cropping the image from 32 ⇥ 32 to 28 ⇥ 28, applying small random translations and rotations to the inputs, and also increasing the number of classes by rotating every character by 90 degrees and defining that to be a new class. We generate 00 11 00 11 B B r11 r22 r33 r44 r55 h h h h h a Class Class A A Class Class B B Class Class C C Class Class D D Class Class E E r Observe C C Aggregate D D E E g g g Predict Figure 7. One-shot Omniglot classification. At test time the model is presented with a labelled example for each class, and outputs the classification probabilities of a new unlabelled example. The model uncertainty grows when the new example comes from an un-observed class. V w we apply the model to one-shot classification using 2 6 Finally, 9a g] [ the Omniglot dataset (Lake et al., 2015) (see Figure 7 for an 11 A A MANN MN CNP 5-way Acc 1-shot 5-shot 20-way Acc 1-shot 5-shot 82.8% 98.1% 95.3% 93.8% 89.9% 94.9% 98.9% 98.5% 98.5% 96.8% Runtime O(nm) O(nm) O(nm) O(nm) O(n O(n + + m) m) Table 2. Classification results on Omniglot. Results on the same task task for for MANN MANN (Santoro (Santoro et et al., al., 2016), 2016), and and matching matching networks networks (MN) (MN) (Vinyals (Vinyals et et al., al., 2016) 2016) and and CNP. CNP. using a significantly simpler architecture (three convolutional layers for the encoder and a three-layer MLP for the decoder) and with a lower runtime of O(n + m) at test time as opposed to O(nm).

30.

andinformed unobserved In practice, we Monte Carlo and unobserved values. InQpractice, we ditional stochastic process Q defines distributions the scores include a less posterior ofcontext. the underlying function. lack atoThus, latent allows for global sampl ✓ that ✓ onbehind Specifically, given avalues. settheof observations atake CNP isequivalent a over conthe prior on As suchO, this priorThis is a variable largetargets partthat ofitthe motivation n estimates of the gradient of this loss by sampling f and N . estimates of the gradient of this loss by formulation makes it clear that the posterior given a subset Figure 2c for a diagram of the model). As a resul fditional (x) forstochastic inputs x 2 T . ✓ is the real vector of all parameand unobserved values. In practice, we process posterior Q✓ that defines distributions over This a less informed of the underlying function. lack a latent variable that allows for g of the context points will serve as the prior when additional are unable to produce different samples ters defining Q. Inheriting from the properties of stochastic estimates of2cthe gradient of this by formulation itthe clear the posterior given a subsetThis Figure for a diagram of theloss model) f (x) forapproach inputs x shifts 2 Tmakes . the ✓ isburden realthat of allprior parameThis ofvector imposing knowlapproach shifts thefunction burden of imp context points are included. By using this setup, and trainsame context data, which can be important if model processes, we assume that Q✓will isthe invariant to prior permutations ofQ. the context points serve as the when additional unable to produce different funct tersedge defining Inheriting from properties of stochastic from an analytic prior to empirical data. This has This edgeare from an analytic to empiri approach shifts theprior burden of imp 0 of0 context, we encourage the learned ing with different sizes uncertainty is desirable. It is worth mentioning that of the O and Twe . If Oof ,T are permutations of O and T , and re- traincontext points areQincluded. By using this setup, same context which can be import processes, assume that to permutations advantage liberating a invariant practitioner from having to edge the advantage ofdata, liberating atopractition ✓ is from an analytic prior empiri 0 0 model to be flexible with regards to the number and position inal CNP formulation did include experiments with 0 ) |sizes ing with uncertainty is desirable. It isthe worth menw spectively, then Q (f ) context, =prior, Q✓ (fwe (Tencourage O,ultimately TTthe =learnedspecify of specify O and Tan . analytic If O✓0different , T(T are O, permutations of O) |and ,) reform forTof the which is an analytic form for prior, the advantage of liberating a practition of the context points. variable in addition to the deterministic connectio 0 0 permodel be flexible regards to the position inal CNP formulation did include exper Q (f (T ) | O TQ ).to✓In work, ✓intended spectively, then (f this (T ) their | O,with Tempirical )we = generally Q✓ (f (T 0number )enforce | O, Tand )= to,summarize experience. Still, we given intended toanalytic summarize their specify form forempirical the to prior, ever, the an deterministic connections the epw of the context points. variable in addition to the deterministi 0 mutation invariance with respect to T by assuming a facQ✓emphasize (f (T ) | O that , T ).the In Q this work, we generally enforce perare not necessarily a consistent set of emphasize that theglobal Q✓ are notempirical necessaril ✓model intended to summarize their variables, the role of the latent variable isen 2.4. The Neural process – t ever, given the deterministic connectio tored structure. Specifically, we consider Q s that factor ✓ mutation invariance with respect to T by assuming a facconditionals for allQobservation sets, and the training routine conditionals for all observation sets, and In contrast, NPs constitute a more clear-cut gener emphasize thatthe therole Q✓ofare necessaril variables, thenot global latent 2.4.TThe model Q (f ) | O, )Specifically, =Neural Q (faccommodate (x) | O, x).Qfor In the absence In✓does our(T implementation ofthat. NPsprocess we two ad-factorof the does ✓ summary, tored structure. we consider s that x2T not guarantee In not guarantee that. In summary, ✓ original deterministic CNP with stronger conditionals for all observation sets, Q In contrast, NPs constitute a more and cleap • b of assumptions output space Yof(x) ,order this is x). the easiest way toto to✓ (f the of context points Qditional ) | O, Ton ) implementation =invariance Q | O, In the absence In our NPs we accommodate for two ad✓ (f (T desiderata: other latent variable models x2T does notthe guarantee that.and Inapproximate summary, of original deterministic CNP witB computational efficiency. The resulting model can be ensure a valid stochastic process. Still, this framework can ofand assumptions on output space Y , this is the easiest way to ditional desiderata: invariance to the order of context points methods. These parallels allow us to compare ou kIM cdistributions, ac N [ 1. to C modelsdistribut 1.l A CNP is acore conditional distribution overP functions AL CNPlatent is a variable conditional other and app boiled down to stochastic three components (seeThe Figure 1b): model be extended non-factored we consider such and computational efficiency. resulting can be ensure atrained valid process. Still, this framework can to a wide range of related research areas in us the These parallels allow tofo to model the empirical conditional distributions trained model the empirical cond 1. methods. A CNPtois a conditional distribut abemodel in the experimental section. boiled down to three core components (see Figure 1b): sections. extended to non-factored distributions, we consider such to afunctions widetorange related research are functions f ⇠ input P . space into representation of f of ⇠the P .empirical trained model cond • Anofencoder h from a model in the experimental section. The defining of a CNP is thath it conditions on The defining characteristic a(x, CNP is that itvalues conditions onFinally, NPssections. CNPs themselves • 32 2 characteristic : 2 23 space that in pairsofof y)input ofand functions f ⇠ P . can be seen a i context • takes Anpermutation encoder h from space intoTand representation 2. A CNP is invariant in O and . 2. A CNP is permutation invariant O via an embedding of fixed dimensionality. In more detail, The O viadefining an embedding fixed In moreofdetail, recentlyNPs published generative queryin produces a representation raiin =pairs h((x, y)(x, for each characteristic ofdimensionality. CNP isofthat conditions onalizations i ) it and CNPs themselves can spaceof that takes y) and ofFinally, i context values – 32 2 : 2 23 32 Ps ou i s ouG rv (GQN) which apply a similar training procedure we use the following architecture, we use following architecture, A CNP isofpermutation invariant into thethe pairs. We parameterise h as a neural alizations recently published genera O via of fixed dimensionality. In more detail, produces a representation ri network. = h((x, y)i )complexfor each of 2. 3.anAembedding CNP is scalable, achieving a running time 3. A CNP is scalable, achieving a run new viewpoints in 3D scenes given sometraining context p (GQN) which apply a similar we •use the following architecture, the pairs. We parameterise h as a neural network. An aggregator that summarises the encoded inputs. ofri O(n m) for making m predictions with n ity of O(n + m) for making 3. A CNP is scalable, achieving am run . .ity ri = h✓ (xi , yi ) 8(xi , yi ) 2 O (1) = h✓a+ (x , y ) 8(x , y ) 2 O (1) tions (Eslami al., 2018). inConsistent (CG i i i i newetviewpoints 3D scenesGQN given so Weobservations. are interested in obtaining a single order-invariant observations. • An aggregator a that summarises the encoded inputs. an extension of GQN that focuses on generating co ity of O(n + m) for making m tions (Eslami et al., 2018). Consisten rri = rh1✓ (xir,2rythat (1) r = r 1 r 2 . . . rn 1 r n (2) . r8(x (2) i ) . .parameterises n 1i , yir)n2 O global representation latent disWe are interested in obtainingthe a single order-invariant samples andan is extension thus also closely related to NPs (Kum observations. of GQN that focuses on g rg1✓(µ(r), . . 8(x rnThe (2) 8(xi ) 2 T (3) (xrepresentation , 2r)I .(r)). 2 rparameterises Tn operationthe latent (3)2018). tribution rzi = ⇠ N simplest i = g✓ (xi , r) ir i1) global r that disWithin this specification of the model there are still some Within this specification the model samples and is thus alsoof closely related t Neural Processes that ensures order-invariance andi works well insimplest prac- operation tribution zr)⇠ N (µ(r), I 2(r)). The = g (x , 8(x ) T (3) P i ✓ i aspects that can be modified to suit specific requirements. aspects that can be modified to suit spet n 2018). Neural Processes Within this specification of the model the⇥ mean function r= = Rn1 d ! rei .are Cruwhere h✓ : X ⇥ Y ! Rd and g✓ : X ⇥ Rd ! Re are neural wheretice h✓ is: X Y ! Rd and g✓a(r :X Rworks neural i) ⇥ that ensures order-invariance and well in praci=1 3.2. Gaussian processes Pnadapted Thecially, exacttheis implementation ofoperation h, for example, can be The exact implementation of h, for exam dreduces d a(r e +1 aggregator the runtime to O(n m) networks, is a commutative operation that takes elements where networks, atice commutative that takes elements is the mean function rR= )= ri . Cru-aspects that can be modified to suit spe h : X ⇥ Y ! R and g : X ⇥ ! are neural iR ✓ ✓ z i=1 n 3.2. Gaussian processes to by the data type.models For low tod where the type. For alow dimensional data the encoder d target The exact implementation ofdimensiona h, exam nisand myare number of context and considering that, likeforNPs, li in R and data maps them into element of Rruntime , and areWe+ ystart in Rd and maps them into a single element zof Rd , and i are networks, cially, thethe aggregator reduces the toiO(n m) a commutative operation that elements rC single r z takes a g C T can be implemented as an MLP, wher can be implemented as an MLP, whereas for inputs with type. For low models dimensiona respectively. between networks (NNs) and d points for Qthem |aO, (xof |ofdi,context ).and Dependparameters for Q✓ (f (xi ) | O, xi ) = Q(f (xi ) | i ).z Depend- parameters where ni )and mxare theQ(f number and targetto the Wedata startneural by considering that,G ✓ (f (x i) = i)R in R anddimensions maps into single element arespectrum iinclude larger dimensions and spatial correlation larger and spatial correlations it can also processes (GPs). Algorithms on the NN end of the(N s can be implemented as an MLP, wher zT points spectrum between neural networks task model learns to a different ing on the task the model learns to parametrize a different ing parameters for Qthe (xrespectively. | O, xsetup = parametrize Q(f Depend•onAthe conditional decoder g that as(x input ✓ (f i ) in i ) takes ir) | the i ).samconvolutions. Finally, in the setup describ convolutions. Finally, the described the model is not fit a single function that they learn fromcorrelation aon very larger dimensions and spatial processes (GPs). Algorithms thelarge NN distribution. This architecture ensures permutation output distribution. This architecture ensures permutation output global latent variable z astowell thetakes new ingable onpled the task model learns parametrize atarget • the A conditional decoder gasthat asdifferent input the samable to produce any coherent samples, to produce any coherent samples, as it learns to model of data directly. GPs on the other hand can repa convolutions. Finally, in the setup describ fit a single function that they learn from x+ rT the h xT invariance and xO(n m) architecture for conditional prediction. invariance and O(n + m) scaling for conditional prediction. output hscaling a ŷT permutation locations and outputs thevariable predictions for C distribution. This ensures T pled global latent z as well as the new target distribution over a family of functions, which is con only a factored prediction theother mean only athat, factored prediction of the mean and the in variances, yC yT T to yT xT yC xC yT of data directly. GPs onofthe ha able produce any coherent samples, a C computed We since r+ .scaling .Cfand can computed O(1) We note that, since xrT1 . . . yrCn can xbe inyTO(1) corresponding values of (xrTn)outputs = yconditional . the 1 m).x yC xC xT note Tbe invariance and O(n for prediction. locations predictions ŷ for the T T by an assumption on the functional form of the co C T disregarding the covariance between t disregarding the covariance between target points. This C T C T distribution over a family functions, only a factored prediction of the mean Generation T r1 from . . .since rn r11 , this supports streaming from r1 . . . rn 1 , this architecture supportsCstreaming We note that, . . .architecture rimplementation computed in O(1) corresponding values of be f (x . Inference n can T ) = yTof between two points. is a result of this particular implement is a result of this particular the model. by an assumption on the functional disregarding the covariance betweenfort observations withrnminimal overhead. observations minimal overhead.(d) Neural 3. Related r1way. .we .work architecture supports Neural statistician (c) Conditional neuralwith process 1 , this coherent (a) process Graphical model from (b) Computational One way we can obtain coherent One can obtain samples isdiagram bystreaming introducing between two points. is aacross resultthis ofspectrum, this particular Scattered we can implement placesample recent observations with minimal overhead. 3. Related work For regression tasks we use to parametrize the mean and For regression tasks we use to parametrize the mean and 3.1. Conditional neural processes a latent variable that we can sample fr a latent variable that we can sample from. We carry out i i combined from One can obtain coherentnon-para sample . process . (d). Gray ]W b indicates b 1. Neural Scattered across this Bayesian spectrum, we can p process is model. (a) model of a neural2process. . x and y correspond to the data where2ythat = fhas (x). C way and Tweideas 2 shading Figure 2 Graphical elated models (a-c) and of the variance neural the variable observed. C variance = (µ , ) of a Gaussian distribution N (µ , ) = (µ , ) of a Gaussian distribution N (µ , ) proof-of-concept experiments some proof-of-concept experiments on such abackground model i 3.1.tasks i andiwe i i i and For regression parametrize iwith i i target neural networks. Methods like (Calandra etofr a Conditional neural processes asome latent variable that we can sample i to latent : ,+ 8 2a : ,+of 8i and 1in are1 the number of context points points respectively z0are isuse the global variable. Athe greymean indicates that the that has combined ideas from Bayesi Neural Processes (NPs) generalisation Conditional T for target variables i.e.0 the2 variables to predict 2 2 for every x 2 T . For classification tasks parametrizes for every xi 2 Tgiven . ForC.classification tasks parametrizes section 4.2.3. section 4.2.3. variable is observed. neural process implementation. in circles correspond to (µ the ivariables i i (b) Diagram of ourvariance et graphical al., 2015) remain fairly close to the GPs,ob (µi , ) of Variables a Gaussian distribution N , Huang ) of the some proof-of-concept experiments i i= • ditional stochastic process Q✓ that defines distributions over f (x) for inputs x 2 T . ✓ is the real vector of all parameters defining Q. Inheriting from the properties of stochastic processes, we assume that Q✓ is invariant to permutations of O and T . If O0 , T 0 are permutations of O and T , respectively, then Q✓ (f (T ) | O, T ) = Q✓ (f (T 0 ) | O, T 0 ) = Q✓ (f (T ) | O0 , T ). In this work, we generally enforce permutation invariance with respect to T by assuming a factored structure. Specifically, we consider Q✓ s that factor Q Q✓ (f (T ) | O, T ) = x2T Q✓ (f (x) | O, x). In the absence nis the easiest ILway to of.assumptions on output space Y , this ensure a valid stochastic process. Still, this framework can – epILhl G i be extended to non-factored distributions, we consider such lin theeexperimental u xL P a model section. 2 . 0 2 : ,+ 8 1

31.

8C 8 – l hp 8 2 =8 + , C= 8 := D 1 3 • u o t I P MLT – – l [ r d y • u – TI o o W kc n o t I sN ei b ab M] T I ] y b ab M] y P C= 0 C G I IL Neural Processes Number of context points 10 100 300 Number of context points 784 15 30 90 1024 Sample 3 Sample 2 Sample 1 Context – Sample 3 Sample 2 Sample 1 Context • Figure 4. Pixel-wise regression on MNIST and CelebA The diagram on the left visualises how pixel-wise image completion can be framed as a 2-D regression task where f(pixel coordinates) = pixel brightness. The figures to the right of the diagram show the results on image completion for MNIST and CelebA. The images on the top correspond to the context points provided to the model. For better clarity the unobserved pixels have been coloured blue for the MNIST images and white for CelebA. Each of the rows corresponds to a

32.

C • /C DD D le – C + G 2 D , 3 C D D : Neural Processes • – iIg – 2 u [ D D w : • – – 0 – / / PR RWa N] 8 /[ 1 C C 8 C d ST M d Figure 5. Thompson sampling with neural processes on a 1-D objective function. The plots show the optimisation process over five iterations. Each prediction function (blue) is drawn by sampling a latent variable conditioned on an incresing number of context points t(black circles). STle drSR sa blackophm PR snext evaluation point The underlying ground truth function is depicted as dotted line. The red triangle indicates the which corresponds PR rto the minimum d value of the sampled NP curve. The red circle in the following iteration corresponds to this evaluation point with its underlying ground truth value that serves as a new context point to the NP. PR / /L ekIkonIg Neural Processes r d PR cR 4.3. Black-box optimisation with Thompson sampling To showcase the utility of sampling entire consistent trajectories we apply neural processes to Bayesian optimisation on 1-D function using Thompson sampling (Thompson, 1933). Thompson sampling (also known as randomised probability matching) is an approach to tackle the explorationexploitation dilemma by maintaining a posterior distribution over model parameters. A decision is taken by drawing a Figure 5. Thompson sampling with neural processes on a 1-D objective function. The plots show the and optimisation process over fivethe sample of model parameters acting greedily under iterations. Each prediction function (blue) is drawn by sampling a latentresulting variable conditioned an incresing number ofis context points policy. Theonposterior distribution then updated (black circles). The underlying ground truth function is depicted as a black dotted The is redrepeated. triangle indicates evaluationThomppoint and the line. process Despitetheitsnext simplicity, which corresponds to the minimum value of the sampled NP curve. The red circle in the following iteration corresponds to this evaluation son sampling has been shown to be highly effective both point with its underlying ground truth value that serves as a new context point to the NP. empirically and in theory. It is commonly applied to black box optimisation and multi-armed bandit problems (e.g. Agrawal & Goyal, 2012; Shahriari et al., 2016). d] R L d PT function evaluations increases. Neural process Gaussian process Random Search 0.26 0.14 1.00 Table 1. Bayesian Optimisation using Thompson sampling. Average number of optimisation steps needed to reach the global minimum of a 1-D function generated by a Gaussian process. The values are normalised by the number of steps taken using random search. The performance of the Gaussian process with the correct kernel constitutes an upper bound on performance. 4.4. Contextual bandits

33.

e • n u – t – • o C o k a G l F – • F • o lC N C G G – • / P i C ikG a a C k / 3 / • r : – – N r F G bF N Fv r GwN :

34.

• • / – 2 NFLO 8 S FS L" /ON JSJON L – 2 NFLO 8 S FS L" FT L ODFRRFR" 4N 4/87 " ODFRRFR" 4N 4/87 VO KRIOP S DK " 8FS 7F NJNH J – – ?JN DIJN N 3THO 7 ODIFLLF" :PSJMJY SJON R NSO O . – – FF – ?J M FS L" 8FS DOSS N F HF 7F JNF" 8O FL LF NJNH VJSI MFMO N O F 1 FJS R : JOL ?JN HNORSJD MFS " LF NJNH GO G RS PS SJON OG FFP LR" 0FFP 7F NJNH, 4 DSJDF N FN R" 4N 8FS 7F NJNH 4 MPORJTM " TSO J L " ODFRRFR JLRON .N FV 2O ON FS L" 0FFP KF NFL LF NJNH" . SJGJDJ L 4NSFLLJHFNDF N – 6FN LL .LFW N A JN 2 L" I S TNDF S JNSJFR O VF NFF JN C JRJON-" 4N 4 VO KRIOP S DK " – 7FF 4 " THMFNSF NFT L NFSVO KR" 4N 4/87 LR : JOL" 8O FL R :PSJMJY SJON 8FS 7F NJNH" 4N 2 TRRJ N – MO FL GO GFV RIOS LF NJNH" 4N 4/7 LR : JOL FS L" 8 SDIJNH NFSVO KR GO ONF RIOS LF NJNH" 4N – 1JNN /IFLRF JFSF .CCFFL NFSVO KR" 4N 4/87 " • FT L S SJRSJDR" " FRJ N FFP LF NJNH GO DOMPTSF FIOON FS L" 0FFP NFT L NFSVO KR R H TRRJ N P ODFRRFR" 4N 4/7 " "