[DL Hacks]Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme

116 Views

August 21, 18

#deep learning #DLHacks #Entity extraction #Relation extraction #Tagging Scheme #LSTM

スライド概要

2018/08/20
Deep Learning JP:
http://deeplearning.jp/hacks/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 86.5K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 57.3K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 40.5K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 35.4K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 34.9K

各ページのテキスト

DLHacks Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme 2018.07.21 山田涼太

論文実装結果考察 !2

論文実装結果考察 !3

書誌情報 Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao,Peng Zhou, Bo Xu • ACL 2017のoutstanding paper • ymymさんのACL要約リポジトリで見つけました https://github.com/ymym3412/acl-papers/issues/134 • 動画あり（音質悪し） - https://vimeo.com/234945423 • Github（Keras、ドキュメント乏しい） - https://github.com/zsctju/triplets-extraction !4

https://github.com/zsctju/triplets-extraction

要約 end-to-endなentityとrelationの抽出手法の提案 • 新しいtagging schemeを導入することでextraction taskをtagging taskに置き換えて解くことができる（LSTMが有効） • 重み付けしたRMSPropを導入 !5

Extraction of entities and relationsとは非構造なテキストデータからentityとsemanticなrelationを抽出 1. entity US, Trump, Apple Inc 2. relation 予め用意された関係性の中から合致するものを選択「United StatesとTrumpはCountry-Presidentの関係」をtripletで表現 {United Statese1 , Country-Presidentr , Trumpe2} !6

Extraction of entities and relationsとは • Open information extraction（OIE）との違いあらかじめ与えられたものからrelationを選択している点 OIEはrelationを文中から抜き出すのでより自由度が高い =難しくまだまだ発展途上 • 最近のOIEの動向まとめについては以下を参照 - A Survey on Open Information Extraction (https://arxiv.org/ abs/1806.05599) !7

既存手法の課題 • pipelined method 1. named entity recognition 2. relation classiﬁcation 上の二つのステップに区切って扱うタスク同士が独立しておりミスが多い • joint learning framework 上記1, 2のステップを一気に行う手法マニュアルで調整した特徴量で計算手間がかかる !8

提案タグ付け手法 entity以外・O: other entity ・BIES: entityを構成する単語の位置（begin, inside, end, single）・CP, CF, …: 予め定めたrelation（CPならContry-President）・1, 2: entityの番号タグとtripletが1対1対応＝抽出タスクをtaggingタスクに置き換えた！＝LSTMが有効 !9

10.

提案タグ付け手法補足補足1 文内に同じrelationが複数出てきた場合、それぞれのentityを単語の近さでtripletにする補足2 overlapするようなrelationについては扱わない ex: A is the founder of B and C. !10

11.

LSTM 文脈情報（長期記憶）を上手く扱えるRNN https://arxiv.org/pdf/1508.01991.pdf 近年sequentialなtaggingに対して良い成果をあげている・NER (Lample et al., 2016) ・CCG Supertagging (Vaswani et al., 2016) ・Chunking (Zhai et al., 2017) !11

https://arxiv.org/pdf/1508.01991.pdf

12.

論文実装結果考察 !12

13.

全体 !13

14.

embedding layer Googleのword2vecを利用 skipgramでword embeddingを取得 !14

15.

encoding layer: Bi-LSTM(peephole connection) ft it zt ot !15

16.

decoding layer: LSTM encoding layerのhtが入力される ft ft it zt ot ？ !16

17.

Softmax !17

18.

Bias objective function RMSprop (Tieleman and Hinton, 2012)を使用 αは1以上の重み、’O’以外の有用な情報をどれだけ重視するか α=1で’O’とそれ以外のタグを等価に扱う !18

19.

dataset NYT produced by distant supervision method(Ren et al., 2017) train data: 353,000 triplets（自動生成） test data: 3,880 triplets（手動）（validation data: testの10%をランダム取得） entity: 3 types relation: 24 types 以下からダウンロード可能 https://github.com/shanzhenren/CoType !19

20.

[beta]

dataset読み方
test.json
=旧来の手法
{

"sentId": 135,

"articleId": “0",
"relationMentions":
[

{"em1Text": "Bill Elliott”, "em2Text": “Dawsonville", "label": “/people/person/place_lived"},

{"em1Text": “Dawsonville”, "em2Text": "Bill Elliott”,"label": “None"}
],

“entityMentions":
[

{"start": 0, "text": "Bill Elliott", "label": “PERSON"},

{"start": 1, "text": "Dawsonville", "label": “LOCATION"}
],

}

"sentText": "\"And they would never understand why , for Bill Elliott , there was no joy in Dawsonville .\”\r\n"

relationとentityの抽出が別々に行われている
!20

21.

[beta]

dataset読み方
test_tag.json
＝新しいタグ付け手法
{

"tokens": ["``", "And", "they", "would", "never", "understand", "why", ",", "for", "Bill", "Elliott", ",", "there", "was", "no", "joy", "in",

"Dawsonville", ".", “''"],
"tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "/people/person/place_lived__E1B", "/people/person/place_lived__E1L", "O",
"O", "O", "O", "O", "O", "/people/person/place_lived__E2S", "O", “O"]
}

"/people/person/place_lived__E1B"
relation

entity

BIES (Endの代わりにLが使われている)

タグにrelationが含まれているのでタグ付けのみのワンステップ

!21

22.

dataset集計 train test 3種類のentity 24種類のrelation !22

23.

evaluation • precision, recall, F-measureで評価 • relationと2つのentityのoffsetが合っていれば正解とした • 先行研究にならってrelationがNoneのものは扱わない • タグ付けにおけるprecision, recall, F-measureの計算法 http://www.wilmina.ac.jp/ojc/edu/kiyo_2011/ kiyo_08_PDF/d2011_02.pdf https://hpi.de/ﬁleadmin/user_upload/fachgebiete/ plattner/teaching/NaturalLanguageProcessing/ NLP2015/NLP_Exercise2.pdf !23

http://www.wilmina.ac.jp/ojc/edu/kiyo_2011/

24.

コード TaggingScheme.py: NYTのtagging schemeから本論文のschemeへ変換 PrecessEEdata.py: ファイルを読み込み扱いやすいよう整形 End2EndModel.py: end-to-endなモデルを学習させる decodelayer.py: decodeレイヤを切り出し Evaluate.py: Precision, Recall, F1の算出 Current.py: 予測したタグの正否を算出、未使用（Evaluateに吸収された？） !24

25.

論文実装結果考察 !25

26.

結果 pipelined methods jointly extracting methods end-to-end tagging models 提案手法 !26

27.

結果 Table1より低下した2.5%が relationを間違えた分 !27

28.

結果 αの値は小さいとRecallが悪く、大きいとPrecisionが悪くなる α=10でF1ベスト !28

29.

論文実装結果考察 !29

30.

考察 • 制約があるtaggingのdecodeはCRFが良いと思っていたが、LSTMの記憶のシステムで十分ルールを満たす結果が得られるようだ • 実はLi and Ji 2014のjoint methodも同じくらいのF-measureを出している • recallが低い = 見逃しが多い、性質的にヒットすることの方が少ないと思うのでもっと検出力を挙げて、人目でチェックするというような方向性の方が実用的ではないか !30

31.

next • E1とE2の取り違えの改善 • softmaxから複数タグ可能な分類器に変更（overlapに対応） • 強化学習でやったやつもあるみたい - https://www.hindawi.com/journals/cin/2017/7643065/ • Pytorchで実装できたら共有します • 文をまたがるrelationの取得でうまくいってるものあったら読み込みたい !31

32.

omake !32

33.

fastTextで分散表現取得 wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip unzip v0.1.0.zip cd fastText-0.1.0 make テストファイル(322MB)をダウンロード mkdir data wget -c http://mattmahoney.net/dc/enwik9.zip -P data unzip data/enwik9.zip -d data fasttext入力のためにhtmlを除去する perl wikiﬁl.pl data/enwik9 > data/ﬁl9 データ形式を確認 less data/ﬁl9 単純なスペース区切り自前の単語の分散表現を取得したい時はスペース区切りにしてやればいい !33

34.

fastTextで分散表現取得分散表現を取得（） skipgramの代わりにcbowを使ったり次元をいじったりできる mkdir result ./fasttext skipgram -input data/ﬁl9 -output result/ﬁl9 -dim 300 これでresult/ﬁl9が分散表現として利用できる ./fasttext skipgram -input data/hoge -output result/hoge -dim 300 自前の単語リストを使う時は単語数が小さすぎると「Empty vocabulary. Try a smaller -minCount value.」と言われるオプション-minCount 0をつけて再実行単語数175,000で2分ぐらい resultに.binと.vecが出力される .vecを確認一行ごとにそれぞれの単語の分散表現が保存されている先頭行は単語数と次元数なので注意 !34

35.

fastTextで分散表現取得分散表現チェック result/enwik9を開くこのように事前に準備した分散表現をpytorchで読み込みたい場合は以下を参考 http://kento1109.hatenablog.com/entry/2018/03/21/195840 http://kento1109.hatenablog.com/entry/2018/03/15/153652 !35