반응형

# [Apache BookKeeper](https://bookkeeper.apache.org/)

* 실시간 워크로드에 최적화된 확장가능, 내결함성, 짧은 지연시간 스토리지 서비스`1
* Java 로 만들어짐
* Apache top level project
* [EMC 의 Pravega](http://www.pravega.io/)([github](https://github.com/pravega/pravega)), [Apache Pulsar](http://pulsar.apache.org/)([github](https://github.com/apache/pulsar)) 등에서 BookKeeper 를 storage 로 사용함
* cluster 로 구성 가능(zookeeper 사용)
* [DC/OS](https://dcos.io/), [Kubernetes](https://kubernetes.io/) 를 이용하여 배포 가능

> 글 작성 당시 최신 버전인 4.10.0 기준으로 문서를 작성함

## [BookKeeper concepts and architecture](https://bookkeeper.apache.org/docs/4.10.0/getting-started/concepts/)

### Basic terms

* entry: 각 log 단위(aka record)
* ledgers: log entry stream
* bookies: entry ledger 을 저장하는 개별 서버

### Entries

* metadata 와 ledger 에서 쓴 실제 데이터

|Field|Java type|Description|
|:---|:---|:---|
|Ledger number|`long`|entry 에 쓰여진 ledger 의 ID|
|Entry number|`long`|entry 의 unique ID|
|Last confirmed (LC)|`long`|마지막 기록된 entry 의 ID|
|Data|`byte[]`|client application 에 의해 쓰여진 entry 의 데이터|
|Authentication code|`byte[]`|entry 에 다른 모든 필드를 포함하는 메시지 인증 코드|

### Ledgers

* BookKeeper 의 기본 저장 단위
* entry 의 sequence
* 각 entry 는 byte sequence
* entry 가 저장됨
* 순차적으로, 최대 한 번
* append-only
* 수정 불가능
* 적절한 쓰기 순서는 client application 의 책임

### Clients and APIs

* ledger 생성, 삭제
* ledger 에서 entry 읽기, 쓰기
* [ledger API](https://bookkeeper.apache.org/docs/4.10.0/api/ledger-api)
* [DistributedLog API](https://bookkeeper.apache.org/docs/4.10.0/api/distributedlog-api)

### Bookies

* ledger fragment 를 처리하는 개별 BookKeeper storage server
* 성능을 위해 ledger 전체가 아닌 fragment 를 저장함(like kafka partition?)

#### Motivation

* HDFS NameNode 같은.. 이제는 그 이상
* 효율적 쓰기
* 메시지 복제를 통해 fault tolerance

### Metadata storage

* [ZooKeeper](https://zookeeper.apache.org) 를 사용

### Data management in bookies

* [log-structured](https://en.wikipedia.org/wiki/Log-structured_file_system) 방식으로 데이터 관리

#### Journals

* BookKeeper transaction log 포함
* update transaction 을 비휘발성 저장소에 기록 -> ledger 에 대한 update
* Bookie 가 시작되거나 이전 journal file 이 journal file size threshold 에 도달하면 새 journal file 이 생성됨

#### Entry logs

* BookKeeper client 로부터 받은 entry 를 관리
* ledger 의 entry 는 순차적으로
* offset 은 빠른 lookup 을 위해 ledger cache 에 pointer 로 유지
* Bookie 가 시작되거나 이전 entry log file 이 entry log size threshold 에 도달하면 새 entry log file 이 생성됨
* 오래된 entry log file 은 active ledger 와 연관없는 Garbage Collector Thread 에 의해 지워짐

#### Index files

* entry log file 에 저장된 data 의 offset 을 기록하는 header 와 fixed length index page 로 구성된 각 ledger 에 대해 index file 이 생성됨
* index file 을 update 하면 random disk I/O index file 이 sync thread 에 의해 background 로 lazy update 함

#### Ledger cache

* ledger index page 는 memory pool 에 cache
* disk head scheduling 을 효율적으로 관리함

#### Adding entries

* client 가 ledger 에 entry 를 쓰면 entry 는 다음 단계를 거쳐 disk 에 저장됨

1. entry 가 entry log 에 저장됨
1. entry index 는 ledger cache 에서 updated 됨
1. entry update 에 해당하는 transaction 이 journal 에 추가됨
1. BookKeeper client 에게 response 전송

> 성능성의 이유로, entry log 는 memory 에 buffer, batch 로 commit.
> ledger cache 는 index page 를 memory 에 저장하고 lazy flush.

#### Data flush

Ledger index pages are flushed to index files in the following two cases:

* The ledger cache memory limit is reached. There is no more space available to hold newer index pages. Dirty index pages will be evicted from the ledger cache and persisted to index files.
* A background thread synchronous thread is responsible for flushing index pages from the ledger cache to index files periodically.

Besides flushing index pages, the sync thread is responsible for rolling journal files in case that journal files use too much disk space. The data flush flow in the sync thread is as follows:

* A `LastLogMark` is recorded in memory. The `LastLogMark` indicates that those entries before it have been persisted (to both index and entry log files) and contains two parts:
1. A `txnLogId` (the file ID of a journal)
1. A `txnLogPos` (offset in a journal)
* Dirty index pages are flushed from the ledger cache to the index file, and entry log files are flushed to ensure that all buffered entries in entry log files are persisted to disk.

Ideally, a bookie only needs to flush index pages and entry log files that contain entries before `LastLogMark`. There is, however, no such information in the ledger and entry log mapping to journal files. Consequently, the thread flushes the ledger cache and entry log entirely here, and may flush entries after the `LastLogMark`. Flushing more is not a problem, though, just redundant.
* The `LastLogMark` is persisted to disk, which means that entries added before `LastLogMark` whose entry data and index page were also persisted to disk. It is now time to safely remove journal files created earlier than `txnLogId`.

If the bookie has crashed before persisting `LastLogMark` to disk, it still has journal files containing entries for which index pages may not have been persisted. Consequently, when this bookie restarts, it inspects journal files to restore those entries and data isn't lost.

Using the above data flush mechanism, it is safe for the sync thread to skip data flushing when the bookie shuts down. However, in the entry logger it uses a buffered channel to write entries in batches and there might be data buffered in the buffered channel upon a shut down. The bookie needs to ensure that the entry log flushes its buffered data during shutdown. Otherwise, entry log files become corrupted with partial entries.

#### Data compaction

On bookies, entries of different ledgers are interleaved in entry log files. A bookie runs a garbage collector thread to delete un-associated entry log files to reclaim disk space. If a given entry log file contains entries from a ledger that has not been deleted, then the entry log file would never be removed and the occupied disk space never reclaimed. In order to avoid such a case, a bookie server compacts entry log files in a garbage collector thread to reclaim disk space.

There are two kinds of compaction running with different frequency: minor compaction and major compaction. The differences between minor compaction and major compaction lies in their threshold value and compaction interval.

* The garbage collection threshold is the size percentage of an entry log file occupied by those undeleted ledgers. The default minor compaction threshold is 0.2, while the major compaction threshold is 0.8.
* The garbage collection interval is how frequently to run the compaction. The default minor compaction interval is 1 hour, while the major compaction threshold is 1 day.

> If either the threshold or interval is set to less than or equal to zero, compaction is disabled.

The data compaction flow in the garbage collector thread is as follows:

* The thread scans entry log files to get their entry log metadata, which records a list of ledgers comprising an entry log and their corresponding percentages.
* With the normal garbage collection flow, once the bookie determines that a ledger has been deleted, the ledger will be removed from the entry log metadata and the size of the entry log reduced.
* If the remaining size of an entry log file reaches a specified threshold, the entries of active ledgers in the entry log will be copied to a new entry log file.
* Once all valid entries have been copied, the old entry log file is deleted.

### ZooKeeper metadata

BookKeeper requires a ZooKeeper installation for storing [ledger](#ledger) metadata. Whenever you construct a [`BookKeeper`](../../api/javadoc/org/apache/bookkeeper/client/BookKeeper) client object, you need to pass a list of ZooKeeper servers as a parameter to the constructor, like this:

```java
String zkConnectionString = "127.0.0.1:2181";
BookKeeper bkClient = new BookKeeper(zkConnectionString);
```

> For more info on using the BookKeeper Java client, see [this guide](../../api/ledger-api#the-java-ledger-api-client).

### Ledger manager

A *ledger manager* handles ledgers' metadata (which is stored in ZooKeeper). BookKeeper offers two types of ledger managers: the [flat ledger manager](#flat-ledger-manager) and the [hierarchical ledger manager](#hierarchical-ledger-manager). Both ledger managers extend the [`AbstractZkLedgerManager`](../../api/javadoc/org/apache/bookkeeper/meta/AbstractZkLedgerManager) abstract class.

> ##### Use the flat ledger manager in most cases
> The flat ledger manager is the default and is recommended for nearly all use cases. The hierarchical ledger manager is better suited only for managing very large numbers of BookKeeper ledgers (> 50,000).

#### Flat ledger manager

The *flat ledger manager*, implemented in the [`FlatLedgerManager`](../../api/javadoc/org/apache/bookkeeper/meta/FlatLedgerManager.html) class, stores all ledgers' metadata in child nodes of a single ZooKeeper path. The flat ledger manager creates [sequential nodes](https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming) to ensure the uniqueness of the ledger ID and prefixes all nodes with `L`. Bookie servers manage their own active ledgers in a hash map so that it's easy to find which ledgers have been deleted from ZooKeeper and then garbage collect them.

The flat ledger manager's garbage collection follow proceeds as follows:

* All existing ledgers are fetched from ZooKeeper (`zkActiveLedgers`)
* All ledgers currently active within the bookie are fetched (`bkActiveLedgers`)
* The currently actively ledgers are looped through to determine which ledgers don't currently exist in ZooKeeper. Those are then garbage collected.
* The *hierarchical ledger manager* stores ledgers' metadata in two-level [znodes](https://zookeeper.apache.org/doc/current/zookeeperOver.html#Nodes+and+ephemeral+nodes).

#### Hierarchical ledger manager

The *hierarchical ledger manager*, implemented in the [`HierarchicalLedgerManager`](../../api/javadoc/org/apache/bookkeeper/meta/HierarchicalLedgerManager) class, first obtains a global unique ID from ZooKeeper using an [`EPHEMERAL_SEQUENTIAL`](https://zookeeper.apache.org/doc/current/api/org/apache/zookeeper/CreateMode.html#EPHEMERAL_SEQUENTIAL) znode. Since ZooKeeper's sequence counter has a format of `%10d` (10 digits with 0 padding, for example `<path>0000000001`), the hierarchical ledger manager splits the generated ID into 3 parts:

```shell
{level1 (2 digits)}{level2 (4 digits)}{level3 (4 digits)}
```

These three parts are used to form the actual ledger node path to store ledger metadata:

```shell
{ledgers_root_path}/{level1}/{level2}/L{level3}
```

For example, ledger 0000000001 is split into three parts, 00, 0000, and 00001, and stored in znode `/{ledgers_root_path}/00/0000/L0001`. Each znode could have as many 10,000 ledgers, which avoids the problem of the child list being larger than the maximum ZooKeeper packet size (which is the [limitation](https://issues.apache.org/jira/browse/BOOKKEEPER-39) that initially prompted the creation of the hierarchical ledger manager).

## install

* zookeeper 를 3대의 node 에 설치하여 cluster 구성
* bookkeeper 를 4대의 node 에 설치하여 cluster 구성
* ssh key 교환

### Environment Variable

```shell script
vim ~/.bashrc
```

```shell script
...
export BK_HOME="/usr/local/bookkeeper/default"
...
export PATH="${BK_HOME}/bin:${PATH}"
...
```

### 참고사항

* `zookeeper.conf.dynamic` file 적용은 ZooKeeper 3.5.0 이상부터 가능한 듯
* zookeeper 실행 후 `./bin/bookkeeper shell metaformat` 실행해야 함
* `bk_server.conf` 에서 `metadataServiceUri=zk://<HOST1>:<PORT1>;<HOST2>:<PORT2>;<HOST3>:<PORT3>/ledgers` 형식이어야 함
* `${BK_HOME}/conf/bookies` file 에 한 줄에 하나씩 host 적어주면 클러스터로 구성 가능
* cluster 실행시 `${BK_HOME}/bin/bookkeeper-cluster.sh start`

### Download & Untar

```shell script
mkdir /usr/local/bookkeeper
cd /usr/local/bookkeeper
wget http://apache.mirror.cdnetworks.com/bookkeeper/bookkeeper-4.10.0/bookkeeper-server-4.10.0-bin.tar.gz
tar zxf bookkeeper-server-4.10.0-bin.tar.gz
rm bookkeeper-server-4.10.0-bin.tar.gz
ln -s bookkeeper-server-4.10.0/ default
```

### prepare directories

```shell script
mkdir -p /data/bk/{data,journal,ranges}
mkdir -p /data/zk/txlog
chown -R hskimsky:hskimsky /data/*
```

### ${BK_HOME}/conf/bk_server.conf

```shell script
vim ${BK_HOME}/conf/bk_server.conf
```

```
...
httpServerEnabled=true
...
journalDirectories=/data/bk/journal
...
ledgerDirectories=/data/bk/data
...
metadataServiceUri=zk://bk1.sky.local:2181;bk2.sky.local:2181;bk3.sky.local:2181/ledgers
...
zkServers=bk1.sky.local:2181,bk2.sky.local:2181,bk3.sky.local:2181
...
storage.range.store.dirs=/data/bk/ranges
...
```

### ${BK_HOME}/conf/zookeeper.conf

```shell script
vim ${BK_HOME}/conf/zookeeper.conf
```

```
...
dataDir=/data/zk
...
dataLogDir=/data/zk/txlog
...
server.1=bk1.sky.local:2888:3888
server.2=bk2.sky.local:2888:3888
server.3=bk3.sky.local:2888:3888
```

### ${BK_HOME}/conf/bookies

```shell script
vim ${BK_HOME}/conf/bookies
```

```
bk1.sky.local
bk2.sky.local
bk3.sky.local
bk4.sky.local
```

### Execute ZooKeeper

```shell script
${BK_HOME}/bin/bookkeeper-daemon.sh start zookeeper
```

### [Cluster metadata setup](https://bookkeeper.apache.org/docs/4.10.0/deployment/manual#cluster-metadata-setup)

```shell script
${BK_HOME}/bin/bookkeeper shell metaformat
```

### Execute Bookie

* bookie 하나만 실행
* cluster 로 실행할 경우 `bookkeeper-cluster.sh` 로 실행

```shell script
${BK_HOME}/bin/bookkeeper-daemon.sh start bookie
```

### Execute BookKeeper cluster

```shell script
${BK_HOME}/bin/bookkeeper-cluster.sh start
```

### 설치 확인

```shell script
for i in {1..4} ; do curl http://bk${i}.sky.local:8080/heartbeat ; done
```

```
OK
OK
OK
OK
```

```shell script
curl "http://bk1.sky.local:8080/api/v1/bookie/list_bookies?print_hostnames=true"
```

```
{
"192.168.181.122:3181" : "192.168.181.122",
"192.168.181.123:3181" : "192.168.181.123",
"192.168.181.124:3181" : "192.168.181.124",
"192.168.181.121:3181" : "192.168.181.121"
}
```

### 재설치

#### cluster 내 모든 node 에서

```shell script
# ${BK_HOME}/bin/bookkeeper-daemon.sh stop zookeeper
# rm -rf /data/zk/{txlog,version-2}
rm -rf /data/bk/*/*
rm ${BK_HOME}/bin/bookkeeper-bookie.pid
# mkdir -p /data/zk/txlog
# ${BK_HOME}/bin/bookkeeper-daemon.sh start zookeeper
```

#### 한 node 에서만

```shell script
${BK_HOME}/bin/bookkeeper shell metaformat
```

#### Execute BookKeeper cluster

```shell script
${BK_HOME}/bin/bookkeeper-cluster.sh start
```

반응형
Posted by FeliZ_하늘..
,