Sunday, December 13, 2009

Update on HAST Project

Pawel Jakub Dawidek has completed the first milestone for the High Available Storage Project. He writes:

I want to report that first milestone of the HAST project is complete.

Summary of the work that have been done:

  • Implementation of hastd daemon.
  • Implementation of hastctl utility to manage hastd daemons.
  • GEOM_GATE class was extended so that the caller can specify the name of GEOM provider. Before only /dev/ggateX names were supported. HAST will use /dev/hast/<resource_name> names.
  • Implementation of communication protocol. There is abstraction layer on top and below there are three protocols implemented currently:
  1. proto_tcp4 - It is used for communication between primary and secondary nodes.
  2. proto_uds - (UDS - UNIX Domain Socket) It is used for communication between hastctl and hastd.
  3. proto_socketpair - It is used for communication between main hastd daemon and worker processes forked from it.
  • Implementation of nv (name-value) API, which allows to easy create packets containing name-value pairs. It is used for entire communication through the protocols above. It is also responsible for managing correct byte-order.
  • Implementation of ebuf (extendable buffer) API, which provides a way to extend given buffer by adding data at the back, but also at the front without reallocating it and copying the data very often (or never).
  • Implementation of logging API (pjdlog). The API decides if messages should be logged on standard output/error (before going into background) or to syslog (when we daemonize). It also provides some shortcuts for logging a message and exiting, etc. It supports notion of debug level and can skip messages intended for higher debug level than requested.
  • Implementation of configuration file parser in lex/yacc. Configuration file is designed in a way that it can be kept identical on both nodes.
  • Checksumming and compression for the data is not one of the project's goal, but the stubs are there, so this can be added easly.
  • A lot of care was taken to be able to handle more nodes in the future. This is not implemented and in not project goal, but I wanted to make it ready for future improvements.

HAST can be used by starting hastd daemons on both nodes:

# hastd -h
hastd: [-dh] [-c config]

Then on the secondary node we mark all resources as secondary:

# hastctl -h
hastctl: [-d] [-c config]
hastctl: [-d] [-c config] status [all | name ...]

# hastctl secondary all

On the primary node we mark all resources as primary:

# hastctl primary all

The hastd daemon running on primary node will connect to the secondary node and fork a child to handle communication. There is a socketpair between parent and child so that they can communicate. Primary node creates two connections: one for incoming data and one for outgoing data. There are seven threads in total for each working resource:

1. ggate_recv

Thread receives ggate I/O requests from the kernel and passes them to appropriate threads:
WRITE - always goes to both local_send and remote_send threads
READ (when the block is up-to-date on local component) -only local_send thread
READ (when the block isn't up-to-date on local component) -only remote_send thread
DELETE - always goes to both local_send and remote_send threads
FLUSH - always goes to both local_send and remote_send threads

2. local_send

Thread reads from or writes to local component.
If local read fails, it redirects it to remote_send thread.

3. remote_send

Thread sends request to secondary node.

4. remote_recv

Thread receives answer from secondary node and passes it to ggate_send thread.

5. ggate_send

Thread sends answer to the kernel.

6. ctrl

Thread handles control requests from the parent.

7. guard

Thread guards remote connections and reconnects when needed, handles signals, etc.

On the secondary node when both connections are successfully established it forks a worker process, which operates using four threads:

1. recv

Thread receives requests from the primary node.

2. disk

Thread reads from or writes to local component and also handles DELETE and FLUSH requests.

3. send

Thread sends requests back to primary node.

4. ctrl

Thread handles control requests from the parent.

At this point primary and secondary nodes can communicate and requests are properly replicated. IO errors on local read failure are handled by redirecting read request to remote node. Replicated storage can be accessed through /dev/hast/<resource_name> GEOM provider.

I'm confident that the first milestone is complete. If you have any questions, I'll be happy to answer them. If you have any suggestions or comments, I'll also be happy to hear them.

No comments:

Post a Comment