Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

3.6K Views

October 11, 17

#オブジェクトストレージ #分散システム #S3互換 #Go言語 #Yahoo! JAPAN

スライド概要

Presentation slides about the architecture of “Dragon” A distributed object storage at Yahoo! JAPAN.

Yahoo!デベロッパーネットワーク

@ydnjp

スライド一覧

2023年10月からSpeaker Deckに移行しました。最新情報はこちらをご覧ください。 https://speakerdeck.com/lycorptech_jp

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

深層学習による自然言語処理入門: word2vecからBERT, GPT-3まで

Yahoo!デベロッパーネットワーク 189.6K

ゼロから始める転移学習

Yahoo!デベロッパーネットワーク 91.2K

ヤフーにおける WebAuthn と Passkey の UX の紹介と考察 #idcon #fidcon

idcon fidcon

Yahoo!デベロッパーネットワーク 80.6K

OpenID Connectとネイティブアプリを取り巻く仕様と動向 Yahoo! JAPANの取り組み #openid #openid_tokyo

openid openid_tokyo

Yahoo!デベロッパーネットワーク 64.3K

運用業務とスクラムは本当に組み合わせにくいのか︖運用業務が大半を占めるプロダクト開発での試行錯誤

devsumi

Yahoo!デベロッパーネットワーク 42K

ヤフーのオンプレ機械学習基盤AIPFについて #ml_kubernetes

ml_kubernetes

Yahoo!デベロッパーネットワーク 32.9K

各ページのテキスト

Dragon: A Distributed Object Storage @Yahoo! JAPAN (English Ver.) Sep. 19. 2017 WebDB Forum Tokyo Yasuharu Goto 1 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

About me 2 • Yasuharu Goto • Yahoo! JAPAN (2008-) • Software Engineer • Storage, Distributed Database Systems (Cassandra) • Twitter: @ono_matope • Lang: Go Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

Agenda 3 • About Dragon • Architecture • Issues and Future works Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

Dragon Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

Object Storage • What is Object Storage? • • • • Populer products • • • • 5 A storage architecure that manages files not as files but as objects. Instead of providing features like file hierarchy, it provides high availability and scalabiliity. (Typically) provides REST API, so it can be used easily by applications. AWS: Amazon S3 GCP: Google Cloud Storage Azure: Azure Blob Storage An essential component for modern web development. Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

Dragon • A distributed Object Storage developed at Yahoo! JAPAN. • Design Goals: • • Written in Go • Released in Jan/2016 (20 months in production) • Scale • • 6 High { performance, scalability, availability, cost efficiency } deployed in 2 data centers in Japan Stores 20 billion / 11 PB of objects. Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

Use Cases • 250+ users in Y!J • Yahoo! Auction (image) Yahoo! News/Topics (image) • • Various usage • Yahoo! Display Ad Network (image/video) • Yahoo! Blog (image) • Yahoo! Smartphone Themes (image) • Yahoo! Travel (image) • Yahoo! Real Estate (image) • Yahoo! Q&A (image) • Yahoo! Reservation (image) • Yahoo! Politics (image) • Yahoo! Game (contents) • Yahoo! Bookstore (contents) • Yahoo! Box (user data) • Netallica (image) • etc... • • • 7 media content user data, log storage backend for Presto (experimental) Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

S3 Compatible API • Dragon provides an S3 compatible API • aws-sdk, aws-cli, CyberDuck... • Implemented • Basic S3 API (Service, Bucket, Object, ACL...) • SSE (Server Side Encryption) • TODO • Multipart Upload API (to upload large objects up to 5TB) • and more... 8 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

Performance (with Riak CS/reference) GET Object 10KB Throughput PUT Object 10KB Throughput 3500 1000 900 800 2500 2000 Riak CS 1500 Dragon 1000 Requests / sec Requests / sec 3000 700 600 500 Dragon 300 200 500 100 0 0 1 9 Riak CS 400 5 10 50 100 # of Threads 200 400 1 • Dragon: API*1, Storage*3, Cassandra*3 • Riak CS: haproxy*1, stanchion*1, Riak (KV+CS)*3 • Same Hardware except for Cassandra and Stanchion. Co p yrig ht © 2 0 1 7 5 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved . 10 50 100 # of Threads 200 400

10.

Why? Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

11.

Why did we build a new Object Storage? • Octagon (2011-2017) • • • • Problems of Octagon • • • • • 11 Our 1st Generation Object Storage Up to 7 PB / 7 Billion Objects / 3,000 Nodes at a time used for personal cloud storage service, E-Book, etc... Low performance Unstable Expensive TCO Hard to operate We started to consider alternative products. Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

12.

Requirements • Our requirements • • • • High performance enough for our services High scalability to respond to rapid increase in data demands High availability with less operation cost High cost efficiency • Mission • To establish a company-wide storage infrastructure 12 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

13.

Alternatives • Existing Open Source Products • Riak CS • • OpenStack Swift • • Concerns about peformance degration when object count increases. Public Cloud Providers • • 13 Some of our products introduced it, but it did not meet our performance requiremnt. cost inefficient We mainly provides our services with our own DC. We needed a high scalable storage system which runs on-premise. Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

14.

Alternatives • Existing Open Source Products • Riak CS • • OpenStack Swift • • Some of our products introduced it, but it did not meet our performance requiremnt. Concerns about peformance degration when object count increases. Public Cloud Providers • • cost inefficient We mainly provides our services with our own DC. We needed a high scalable storage system which runs on-premise. OK, let’s make it by ourselves! 14 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

15.

Architecture Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

16.

Architecture Overview • Dragon consists of 3 components: API Nodes, Storage Cluster and MetaDB. • API Node • • Storage Node • • • 16 Provides S3 compatible API and serves all user requets. HTTP file servers that store BLOBs of uploaded objects. 3 nodes make up a VolumeGroup. BLOBs in each group are periodically synchronized. MetaDB • Apache Cassandra cluster • Stores metadata of uploaded objects including the location of its BLOB. Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

17.

Architecture HTTP (S3 API) API Nodes Meta DB Metadata BLOB Storage Cluster VolumeGroup: 01 17 VolumeGroup: 02 StorageNode 1 StorageNode 2 StorageNode 3 StorageNode 4 StorageNode 5 StorageNode 6 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

18.

Architecture API and Storage nodes are witten in Go HTTP (S3 API) API Nodes Meta DB Metadata BLOB Storage Cluster VolumeGroup: 01 18 VolumeGroup: 02 StorageNode 1 StorageNode 2 StorageNode 3 StorageNode 4 StorageNode 5 StorageNode 6 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

19.

Architecture API Nodes periodically fetch and cache VolumeGroup configuration from MetaDB. volumegroup configuration API Nodes id hosts Volumes 01 node1,node2,node3 HDD1, HDD2 02 node4,node5,node6 HDD1, HDD2 Meta DB BLOB Storage Cluster VolumeGroup: 01 19 VolumeGroup: 02 StorageNode 1 StorageNode 2 StorageNode 3 StorageNode 4 StorageNode 5 StorageNode 6 HDD3 HDD3 HDD3 HDD3 HDD3 HDD3 HDD4 HDD4 HDD4 HDD4 HDD4 HDD4 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

20.

Upload 1. When a user uploads an object, the API Node first randomly picks a VolumeGroup and transfers the object’s BLOB to the nodes in the VolumeGroup using HTTP PUT. 2. Stores the metadata including its BLOB location into the MetaDB. PUT bucket1/sample.jpg key: bucket1/sample.jpg, size: 1024bytes blob: volumegroup01/hdd1/..., API Nodes ② Metadata ① HTTP PUT VolumeGroup: 01 20 Meta DB VolumeGroup: 02 StorageNode 1 StorageNode 2 StorageNode 3 StorageNode 4 StorageNode 5 StorageNode 6 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

21.

Download 1. When a user downloads an Object, the API Node retrieves its metadata from the MetaDB. 2. Requests a HTTP GET to a Storage holding the BLOB based on the metadata and transfer the response to the user. PUT bucket1/sample.jpg key: bucket1/sample.jpg, size: 1024bytes blob: volumegroup01/hdd1/..., API Nodes ① Metadata Meta DB ② HTTP GET VolumeGroup: 01 21 VolumeGroup: 02 StorageNode 1 StorageNode 2 StorageNode 3 StorageNode 4 StorageNode 5 StorageNode 6 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

22.

Failure Recovery When a Hard Disk fails... API Nodes Meta DB VolumeGroup: 01 22 VolumeGroup: 02 StorageNode 1 StorageNode 2 StorageNode 3 StorageNode 4 StorageNode 5 StorageNode 6 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

23.

Failure Recovery The drive will be replaced and data that should be in the drive will be recovered by transferring from the other StorageNodes in the VolumeGroup. API Nodes Meta DB VolumeGroup: 01 23 VolumeGroup: 02 StorageNode 1 StorageNode 2 StorageNode 3 StorageNode 4 StorageNode 5 StorageNode 6 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

24.

Scaling out When you add capacity to the cluster... volumegroup Configuration id hosts Volumes 01 node1,node2,node3 HDD1, HDD2 02 node4,node5,node6 HDD1, HDD2 API Nodes Meta DB VolumeGroup: 01 VolumeGroup: 02 StorageNode 1 StorageNode 2 StorageNode 3 StorageNode 4 StorageNode 5 StorageNode 6 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 24 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

25.

Scaling out • ... simply set up a new set of StorageNodes and update the VolumeGroup configuration. volumegroup Configuration API Nodes id hosts Volumes 01 node1,node2,node3 HDD1, HDD2 02 node4,node5,node6 HDD1, HDD2 03 node7,node8,node9 HDD1, HDD2 Meta DB VolumeGroup: 01 VolumeGroup: 02 VolumeGroup: 03 StorageNode 1 StorageNode 2 StorageNode 3 StorageNode 4 StorageNode 5 StorageNode 6 StorageNode 7 StorageNode 8 StorageNode 9 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 HDD2 25 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

26.

Why not Consistent Hash? • Dragon’s distributed architecture is based on mapping managed by the DB. • Q. Why not Consistent Hash? • Consistent Hash • • Data is distributed uniformly by hash of key Used by many existing distributed systems • • e.g. Riak CS, OpenStack Swift No need for external DB to manage the map quoted from: http://docs.basho.com/riak/kv/2.2.3/learn/concepts/clusters/ 26 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

27.

Why not Consistent Hash? • A. Able to add storage capacities without Rebalancing • It heavily consumes Disk I/O, bandwidth, and often takes a long time. • eg. Adding 1 node into 10 node * 720TB cluster which is 100% utilized requires transfering 655TB. 655TB/2Gbps = 30 days • Scaling hash-based DB to more than 1000 nodes with large nodes is very challenging. 655TB (720TB*10Node)/11Node = 655TB 27 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

28.

Other Pros/Cons • Pros • • We can scale out MetaDB and BLOB Storage independently. Backend Storage Engine is pluggable. • • Cons • • We need external Database to manage the map BLOB load would be non-uniform • 28 We can easily add or change the storage technology/class in the future We’ll rebalance periodically. Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

29.

Storage Node Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

30.

Storage Hardware • High density Storage Servers for cost efficiency • We need to make use of the full potential of the hardware. https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR90L.cfm 30 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR90L.cfm

31.

Storage Configuration • HDDs are configured as independent logical volumes instead of RAID • Reason 1: To reduce time to recover when HDDs fail. VolumeGroup 31 StorageNode StorageNode StorageNode HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD3 HDD3 HDD3 HDD4 HDD4 HDD4 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

32.

Storage Configuration • Reason 2: RAID is slow for random access. Configure Requests per sec Non RAID 178.9 RAID 0 73.4 RAID 5 68.6 Throughput for random access work load. Served by Nginx. 4HDDs. Filesize: 500KB 32 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved . 2.4x Faster than RAID 0

33.

File Persistence Strategy • Dragon’s Storage Nodes use one file per BLOB. • • But, it is known that file systems can not handle large numbers of files well. • • Strategy to increase robustness by using stable filesystem (ext4). It is reported that Swift has poor writing performance as the number of files increases. To get over this problem, Dragon uses a unique technique. ref.1: “OpenStack Swiftによる画像ストレージの運用” http://labs.gree.jp/blog/2014/12/11746/ ref.2: “画像システムの車窓から｜サイバーエージェント公式エンジニアブログ” http://ameblo.jp/principia-ca/entry-12140148643.html 33 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

34.

File Persistence Strategy • Typical approach: Write files into directories evenly which are created in advance • Swift writes files in this manner. • As the number of files increases, the number of seeks increases and the write throughput decreases. • Cost for updating dentries increases. photo1.jpg photo2.jpg photo3.jpg photo4.jpg Hash function 01 02 03 ... fe ff 256 dirs (256dirs) Seek count and throughput when randomly writing 3 million files in 256 directories. Implemented as a smple HTTP server. Used ab, blktrace, seekwatcher for measurement. 34 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

35.

Dynamic Partitioning • Dynamic Partitioning Approach 1. Create a sequentially numbered directories (partitions). API Nodes upload files into the latest directory. 2. Once the number of files in the partition reaches a threshold (1000 here), the Storage Node creates the next partition and informs the API nodes about it. • Keep the number of files in the directory constant by adding directories at any time. 0 1 1000 Files! 0 1 2 New Dir! When # of files/dir exceeds approximately 1000, Dragon creates a next directory and uploads there. 35 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

36.

Dynamic Partitioning • Comparison with hash strategy. Green is Dyamic Partitioning. • Even if file count increases, seek count does not increase, throughput is stable Writing Files in Hash Based Strategy (blue) and Dynamic Partitioning (green) 36 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

37.

Microbenchmark Confirmed the maintenance of writing throughput up to 10 Million files for single HDD. Writing throughput when creating up to 10 Million files. We syncd and dropped cache after each creating 100,000 files. 37 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

38.

Eventual Consistency • To achieve high availability, writing to Storage Nodes uses eventual consistency with Quorum. • Uploads succeed if writing to the majority of 3 nodes is successful. • Anti-Entropy Repair process synchronizes failed nodes periodically. API Nodes OK VolumeGroup: 01 38 StorageNode 1 StorageNode 2 StorageNode 3 HDD1 HDD1 HDD1 HDD2 HDD2 HDD2 HDD3 HDD3 HDD3 HDD4 HDD4 HDD4 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

39.

Anti-Entropy Repair • Anti-Entropy Repair • Process to compare data between nodes, detect data that is not replicated and recover the consistency. Node A file1 file2 file3 file4 file1 file2 file4 Node B Node C file3 39 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved . file1 file2 file3 file4

40.

Anti-Entropy Repair • Detect and correct inconsistency of Storage Nodes in a partition unit. 1. 2. 3. • Calculate the hash of the names of the files in a partition. Compare the hashes between nodes in a VolumeGroup. There are inconsistencies if the hashes do not match. If the hashes do not match, compare the files in the partition and transfer missing files. Comparing process is IO efficient as we can cache the hash and the update is concentrated in the latest partition. node1 node2 node3 HDD2 HDD2 HDD2 01 60b725f... 01 60b725f... 01 60b725f... 02 e8191b3... 02 e8191b3... 02 e8191b3... 03 97880df... 03 97880df... 03 10c9c85c... file1001.data file1002.data file1003.data file1001.data file1002.data file1003.data file1001.data ----file1003.data transfer file1002.data to node1 40 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

41.

42.

Cassandra • Apache Cassandra • • • • Eventual Consistency • 42 High Availability Linear Scalability Low operation cost Cassandra does not support ACID transactions Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

43.

Cassandra • Tables • • • • • 43 VolumeGroup Account Bucket Object ObjectIndex Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

44.

Object Table • Object Table • Table to retain Object Metadata • • size, BLOB location, ACL, Content-Type... Distributed evenly within the cluster by the partition key which is composed of (bucket, key). Partition Key 44 Co p yrig ht © 2 0 1 7 bucket key mtime status metadata... b1 photo1.jpg uuid(t2) ACTIVE {size, location, acl...,} b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....} b3 photo1.jpg uuid(t3) ACTIVE {size, location, acl....} Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

45.

PUT Object • Update matadata • • • Within each partition, metadata is clustered in descending order by UUIDv1 based on creation time. When an object is overwritten, the metadata of the latest version is inserted into the top of the partition. Since we keep records of multiple versions, no inconsistency occurs even if the object is overwritten concurrently. Clustering Column PUT b1/photo2.jpg (time: t5) bucket PUT b1/photo2.jpg (time: t4) b1 key photo2.jpg photo2.jpg reaches consistency. (t5 wins) b1 45 Co p yrig ht © 2 0 1 7 photo2.jpg mtime status metadata... uuid(t5) ACTIVE {size, location, acl...,} uuid(t4) ACTIVE {size, location, acl...,} uuid(t1) ACTIVE {size, location, acl...,} uuid(t1) ACTIVE {size, location, acl....} Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

46.

GET Object • Retrieving Metadata • • Retrieve the first row of the partition with SELECT query Since the partition is sorted by the creation time, the first row always indicates the current state of the object. Partition Key bucket key b1 photo1.jpg b1 photo2.jpg Clustering Column mtime status metadata... uuid(t5) ACTIVE {size, location, acl...} uuid(t3) ACTIVE {size, location, acl....} uuid(t1) ACTIVE {size, location, acl....} (time:t5) SELECT * FROM bucket=‘b1’ AND key= ‘photo1.jpg’ LIMIT 1; 46 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

47.

DELETE Object • Request Deletion of object • Insert row with deletion status without deleting the row immediately. Partition Key DELETE b1/photo1.jpg (time: t7) 47 Co p yrig ht © 2 0 1 7 bucket key b1 photo1.jpg b1 photo2.jpg Clustering Column mtime status metadata... uuid(t5) ACTIVE {size, location, acl...} uuid(t3) ACTIVE {size, location, acl....} uuid(t7) DELETED N/A uuid(t1) ACTIVE {size, location, acl....} Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

48.

GET Object (deleted) • Retrieving Metadata (in case of deleted) • If the retrieved latest row has DELETED status, the object is considered deleted logically and returns error Partition Key bucket key b1 photo1.jpg b1 photo2.jpg Clustering Column mtime status metadata... uuid(t5) ACTIVE {size, location, acl...} uuid(t3) ACTIVE {size, location, acl....} uuid(t7) DELETED N/A uuid(t1) ACTIVE {size, location, acl....} (time:t7) SELECT * FROM bucket=‘b1’ AND key= ‘photo2.jpg’ LIMIT 1; 48 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

49.

Object Garbage Collection • Garbage Collection (GC) • • Periodically deletes metadata and the linked BLOBs of overwritten or deleted Objects. Full scan of Object table • The second and subsequent rows of each partition are garbage. GC Deletes them. Partition Key bucket key b1 photo1.jpg b1 photo2.jpg Clustering Column full scan mtime status metadata... uuid(t5) ACTIVE {size, location, acl...} uuid(t3) ACTIVE {size, location, acl....} uuid(t7) DELETED N/A uuid(t3) ACTIVE {size, location, acl...,} uuid(t1) ACTIVE {size, location, acl....} Upload 0 byte tomstone files to delete the BLOB 49 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved . Garbage Garbage Garbage

50.

Object Garbage Collection • GC completed Partition Key Clustering Column bucket key mtime status metadata... b1 photo1.jpg uuid(t5) ACTIVE {size, location, acl...} b1 photo2.jpg uuid(t7) DELETED N/A GC completed We achieved Concurrency control on Eventual Consistency Database by using partitioning and UUID clustering. 50 Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

51.

52.

ObjectIndex Table • ObjectIndex Table • Objects in bucket are sorted and stored in ObjectIndex table in asc order by key name for ListObjects API • Since the partitions get extremely large, objects in a bucket are split into 16 partitions. Clustering Column Partition Key Retrieve 16 partitions and merge them to respond bucket hash key metadata key0001 ... key0003 ... key0012 ... key metadata key0001 ... key0002 ... key0024 ... key0003 ... ... ... key0004 ... key0004 ... key0005 ... key0009 ... key0006 ... key0011 ... key0007 ... ... ... key0008 ... key0002 ... key0005 ... ... ... ... ... ... ... bucket1 bucket1 bucket1 ... 52 ObjectIndex Table Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved . 0 1 2 ...

53.

Issues • ObjectIndex related problems • • • 53 Some API requests cause a lot of queries to Cassandra, resulting in high load and high latency. Because of Cassandra’s limitation, the # of Objects in Bucket is restricted to 32 Billion. We’d like to eliminate constraints on the number of Objects by introducing a mechanism that dynamically divides the index partition. Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

54.

Future Plans • Improvement of Storage Engine • • • Serverless Architecture • • Push notification to messaging queues such as Kafka, Pulsar Integration with other distributed systems • 54 WAL (Write Ahead Log) based Engine? Erasure Coding? Hadoop, Spark, Presto, etc... Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .

55.

Wrap up • Yahoo! JAPAN is developing a large scale object storage named “Dragon”. • “Dragon” is a highly scalable object storage platform. • We’re going to improve it to meet our new requirements. • Thank you! Co p yrig ht © 2 0 1 7 Yaho o Jap an Co rp o ratio n. All Rig hts Reserved .