| WAPBL(9) | Kernel Developer's Manual | WAPBL(9) | 
WAPBL, wapbl_start,
  wapbl_stop, wapbl_begin,
  wapbl_end, wapbl_flush,
  wapbl_discard, wapbl_add_buf,
  wapbl_remove_buf,
  wapbl_resize_buf,
  wapbl_register_inode,
  wapbl_unregister_inode,
  wapbl_register_deallocation,
  wapbl_jlock_assert,
  wapbl_junlock_assert —
#include <sys/wapbl.h>
typedef void (*wapbl_flush_fn_t)(struct mount *, daddr_t *, int *, int);
int
  
  wapbl_start(struct
    wapbl **wlp, struct mount
    *mp, struct vnode
    *devvp, daddr_t
    off, size_t count,
    size_t blksize,
    struct wapbl_replay *wr,
    wapbl_flush_fn_t flushfn,
    wapbl_flush_fn_t
    flushabortfn);
int
  
  wapbl_stop(struct
    wapbl *wl, int
    force);
int
  
  wapbl_begin(struct
    wapbl *wl, const char
    *file, int
  line);
void
  
  wapbl_end(struct
    wapbl *wl);
int
  
  wapbl_flush(struct
    wapbl *wl, int
    wait);
void
  
  wapbl_discard(struct
    wapbl *wl);
void
  
  wapbl_add_buf(struct
    wapbl *wl, struct buf
    *bp);
void
  
  wapbl_remove_buf(struct
    wapbl *wl, struct buf
    *bp);
void
  
  wapbl_resize_buf(struct
    wapbl *wl, struct buf
    *bp, long oldsz,
    long oldcnt);
void
  
  wapbl_register_inode(struct
    wapbl *wl, ino_t
    ino, mode_t
  mode);
void
  
  wapbl_unregister_inode(struct
    wapbl *wl, ino_t
    ino, mode_t
  mode);
void
  
  wapbl_register_deallocation(struct
    wapbl *wl, daddr_t
    blk, int len);
void
  
  wapbl_jlock_assert(struct
    wapbl *wl);
void
  
  wapbl_junlock_assert(struct
    wapbl *wl);
WAPBL, or write-ahead physical block
  logging, is an abstraction for file systems to write physical blocks in
  the buffercache(9) to a
  bounded-size log first before their real destinations on disk. The name means:
When a file system using WAPBL issues
    writes (as in bwrite(9) or
    bdwrite(9)), they are grouped
    in batches called transactions in memory, which are
    serialized to be consistent with program order before
    WAPBL submits them to disk atomically.
Thus, within a transaction, after one write, another write need not wait for disk I/O, and if the system is interrupted, e.g. by a crash or by power failure, either both writes will appear on disk, or neither will.
When a transaction is full, it is written to a circular buffer on
    disk called the log. When the transaction has been written
    to disk, every write in the transaction is submitted to disk asynchronously.
    Finally, the file system may issue new writes via
    WAPBL once enough writes submitted to disk have
    completed.
After interruption, such as a crash or power failure, some writes issued by the file system may not have completed. However, the log is written consistently with program order and before file system writes are submitted to disk. Hence a consistent program-order view of the file system can be attained by resubmitting the writes that were successfully stored in the log using wapbl_replay(9). This may not be the same state just before interruption — writes in transactions that did not reach the disk will be excluded.
For a file system to use WAPBL, its
    VFS_MOUNT(9) method should
    first replay any journal on disk using
    wapbl_replay(9), and
    then, if the mount is read/write, initialize WAPBL
    for the mount by calling wapbl_start(). The
    VFS_UNMOUNT(9) method
    should call wapbl_stop().
Before issuing any
    buffercache(9) writes,
    the file system must acquire a shared lock on the current
    WAPBL transaction with
    wapbl_begin(), which may sleep until there is room
    in the transaction for new writes. After issuing the writes, the file system
    must release its shared lock on the transaction with
    wapbl_end(). Either all writes issued between
    wapbl_begin() and
    wapbl_end() will complete, or none of them will.
File systems may also witness an exclusive lock
    on the current transaction when WAPBL is flushing
    the transaction to disk, or aborting a flush, and invokes a file system's
    callback. File systems can assert that the transaction is locked with
    wapbl_jlock_assert(), or not
    exclusively locked, with
    wapbl_junlock_assert().
If a file system requires multiple transactions to initialize an
    inode, and needs to destroy partially initialized inodes during replay, it
    can register them by ino_t inode number before
    initialization with wapbl_register_inode() and
    unregister them with wapbl_unregister_inode() once
    initialization is complete. WAPBL does not actually
    concern itself whether the objects identified by ino_t
    values are ‘inodes’ or ‘quaggas’ or anything
    else — file systems may use this to list any objects keyed by
    ino_t value in the log.
When a file system frees resources on disk and issues writes to
    reflect the fact, it cannot then reuse the resources until the writes have
    reached the disk. However, as far as the
    buffercache(9) is
    concerned, as soon as the file system issues the writes, they will appear to
    have been written. So the file system must not attempt to reuse the resource
    until the current WAPBL transaction has been flushed
    to disk.
The file system can defer freeing a resource by calling
    wapbl_register_deallocation() to record the disk
    address of the resource and length in bytes of the resource. Then, when
    WAPBL next flushes the transaction to disk, it will
    pass an array of the disk addresses and lengths in bytes to a
    file-system-supplied callback. (Again, WAPBL does
    not care whether the ‘disk address’ or ‘length in
    bytes’ is actually that; it will pass along
    daddr_t and int values.)
wapbl_start(wlp,
    mp, devvp,
    off, count,
    blksize, wr,
    flushfn, flushabortfn)WAPBL for the file system mounted at
      mp, storing a log of count
      disk sectors at disk address off on the block device
      devvp writing blocks in units of
      blksize bytes. On success, stores an opaque
      struct wapbl * cookie in
      *wlp for use with the other
      WAPBL routines and returns zero. On failure,
      returns an error number.
    If the file system had replayed the log with
        wapbl_replay(9),
        then wr must be the struct
        wapbl_replay * cookie used to replay it, and
        wapbl_start() will register any inodes that were
        in the log as if with wapbl_register_inode();
        otherwise wr must be
      NULL.
flushfn is a callback that
        WAPBL will invoke as
        flushfn (mp,
        deallocblks, dealloclens,
        dealloccnt) just before it flushes a transaction
        to disk, with the an exclusive lock held on the transaction, where
        mp is the mount point passed to
        wapbl_start(), deallocblks
        is an array of dealloccnt disk addresses, and
        dealloclens is an array of
        dealloccnt lengths, corresponding to the addresses
        and lengths the file system passed to
        wapbl_register_deallocation(). If flushing the
        transaction to disk fails, WAPBL will call
        flushabortfn with the same arguments to undo any
        effects that flushfn had.
wapbl_stop(wl,
    force)WAPBL. If flushing the transaction fails and
      force is zero, return error. If flushing the
      transaction fails and force is nonzero, discard the
      transaction, permanently losing any writes in it. If flushing the
      transaction is successful or if force is nonzero,
      free memory associated with wl and return zero.wapbl_begin(wl,
    file, line)The lock is not exclusive: other threads may acquire shared locks on the transaction too. The lock is not recursive: a thread may not acquire it again without calling wapbl_end first.
May sleep.
file and line are the file name and line number of the caller for debugging purposes.
wapbl_end(wl)wapbl_begin().wapbl_flush(wl,
    wait)The current transaction must not be locked.
wapbl_discard(wl)The current transaction must not be locked.
wapbl_add_buf(wl,
    bp)This is meant to be called from within buffercache(9), not by file systems directly.
wapbl_remove_buf(wl,
    bp)This is meant to be called from within buffercache(9), not by file systems directly.
wapbl_resize_buf(wl,
    bp, oldsz,
    oldcnt)This is meant to be called from within buffercache(9), not by file systems directly.
wapbl_register_inode(wl,
    ino, mode)wapbl_unregister_inode(wl,
    ino, mode)wapbl_register_deallocation(wl,
    blk, len)wapbl_start().wapbl_jlock_assert(wl)Note that it might not be locked by the current thread: this assertion passes if any thread has it locked.
wapbl_junlock_assert(wl)Users of WAPBL observe exclusive locks
        only in the flushfn and
        flushabortfn callbacks to
        wapbl_start(). Outside of such contexts, the
        transaction is never exclusively locked, even between
        wapbl_begin() and
        wapbl_end().
There is no way to assert that the current transaction is not
        locked at all — i.e., that the caller may acquire a shared lock
        on the transaction with wapbl_begin() without
        danger of deadlock.
WAPBL subsystem is implemented in
  sys/kern/vfs_wapbl.c, with hooks in
  sys/kern/vfs_bio.c.
WAPBL works only for file system metadata managed via
  the buffercache(9), and
  provides no way to log writes via the page cache, as in
  VOP_GETPAGES(9),
  VOP_PUTPAGES(9), and
  ubc_uiomove(9), which is
  normally used for file data.
Not only is WAPBL unable to log writes via
    the page cache, it is also unable to defer
    buffercache(9) writes
    until cached pages have been written. This manifests as the well-known
    garbage-data-appended-after-crash bug in FFS: when appending to a file, the
    pages containing new data may not reach the disk before the inode update
    reporting its new size. After a crash, the inode update will be on disk, but
    the new data will not be — instead, whatever garbage data in the free
    space will appear to have been appended to the file.
    WAPBL exacerbates the problem by increasing the
    throughput of metadata writes, because it can issue many metadata writes
    asynchronously that FFS without WAPBL would need to
    issue synchronously in order for
    fsck(8) to work.
The criteria for when the transaction must be flushed to disk
    before wapbl_begin() returns are heuristic, i.e.
    wrong. There is no way for a file system to communicate to
    wapbl_begin() how many buffers, inodes, and
    deallocations it will issue via WAPBL in the
    transaction.
WAPBL mainly supports write-ahead, and has
    only limited support for rolling back operations, in the form of
    wapbl_register_inode() and
    wapbl_unregister_inode(). Consequently, for example,
    large writes appending to a file, which requires multiple disk block
    allocations and an inode update, must occur in a single transaction —
    there is no way to roll back the disk block allocations if the write fails
    in the middle, e.g. because of a fault in the middle of the user buffer.
wapbl_jlock_assert() does not guarantee
    that the current thread has the current transaction locked.
    wapbl_junlock_assert() does not guarantee that the
    current thread does not have the current transaction locked at all.
There is only one WAPBL transaction for
    each file system at any given time, and only one
    WAPBL log on disk. Consequently, all writes are
    serialized. Extending WAPBL to support multiple logs
    per file system, partitioned according to an appropriate scheme, is left as
    an exercise for the reader.
There is no reason for WAPBL to require
    its own hooks in
    buffercache(9).
The on-disk format used by WAPBL is
    undocumented.
| March 26, 2015 | NetBSD 10.0 |