viff

changeset 717:50cf746b5f18

Added documentation on program counters.
author Martin Geisler <mg@daimi.au.dk>
date Wed, 23 Apr 2008 23:12:33 +0200
parents 760b07b6e6ea
children 2ba503807eda
files doc/implementation.txt doc/program-counters.txt
diffstat 2 files changed, 201 insertions(+), 1 deletions(-) [+]
line diff
     1.1 --- a/doc/implementation.txt	Wed Apr 23 23:12:17 2008 +0200
     1.2 +++ b/doc/implementation.txt	Wed Apr 23 23:12:33 2008 +0200
     1.3 @@ -12,4 +12,4 @@
     1.4     runtime
     1.5     prss
     1.6     config
     1.7 -
     1.8 +   program-counters
     2.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     2.2 +++ b/doc/program-counters.txt	Wed Apr 23 23:12:33 2008 +0200
     2.3 @@ -0,0 +1,200 @@
     2.4 +
     2.5 +Program Counters
     2.6 +================
     2.7 +
     2.8 +When two players execute a large computation they need to agree with
     2.9 +each other on a labelling of the individual operations so that they
    2.10 +know where each received result belongs. In VIFF we call these labels
    2.11 +*program counters*. We will try to explain the design of these
    2.12 +counters in this document and list some ideas for alternative
    2.13 +implementations.
    2.14 +
    2.15 +The basic setup in VIFF is a set of players who communicate over
    2.16 +reliable point-to-point links, e.g., TCP or SSL connections. It is
    2.17 +important to remember that these links guarantee that all transmitted
    2.18 +messages arrive unchanged at the destination and that they arrive in
    2.19 +the order sent.
    2.20 +
    2.21 +
    2.22 +The naive solution
    2.23 +------------------
    2.24 +
    2.25 +The very first version of VIFF network data was numbered in the most
    2.26 +naive way possible: using a single counter for each player. This
    2.27 +worked fine most of the time, but once in a while a test would fail to
    2.28 +give the correct result. It was only when one ran thousands of
    2.29 +multiplications that the bug appeared, but its cause was quite simple.
    2.30 +Consider a program like this where we assume that the shares *a*, *b*,
    2.31 +*c*, and *d* have already been correctly defined::
    2.32 +
    2.33 +  x = mul(a, b)
    2.34 +  y = mul(c, d)
    2.35 +
    2.36 +Back then, the :func:`mul` function was implemented like this::
    2.37 +
    2.38 +  def mul(share_a, share_b):
    2.39 +      inc_pc()
    2.40 +      result = gather_shares([share_a, share_b])
    2.41 +      result.addCallback(lambda (a, b): a * b)
    2.42 +      result.addCallback(shamir_share)
    2.43 +      result.addCallback(recombine, threshold=2*threshold)
    2.44 +      return result
    2.45 +
    2.46 +where :func:`inc_pc` took care of incrementing the global program
    2.47 +counter. This simple implementation worked 99.99% of the time with
    2.48 +three players connected on a LAN, but once in a while it would fail to
    2.49 +calculate the correct results.
    2.50 +
    2.51 +In our example program, :func:`shamir_share` is called twice: once
    2.52 +when *a* and *b* are ready, and once when *c* and *d* are ready. Most
    2.53 +of the time *a* and *b* are ready first on all players, and so they
    2.54 +all agree on the program counter value for this call to
    2.55 +:func:`shamir_share`. But when we have bad luck, they one player sees
    2.56 +*c* and *d* arrive first and so the two calls to :func:`shamir_share`
    2.57 +are switched for that player.
    2.58 +
    2.59 +The problem is the asynchronous nature of Twisted: all players agree
    2.60 +on the execution tree, but depending on the exact timing they might
    2.61 +reach the nodes in the tree in a different order. The tree looks like
    2.62 +this in our little example:
    2.63 +
    2.64 +.. code-block:: none
    2.65 +
    2.66 +     x       y
    2.67 +     |       |
    2.68 +    mul     mul
    2.69 +    / \     / \
    2.70 +   a   b   c   d
    2.71 +
    2.72 +and the two :func:`mul` can be called in either order since they do
    2.73 +not depend on each other.
    2.74 +
    2.75 +
    2.76 +The working solution
    2.77 +--------------------
    2.78 +
    2.79 +The solution used now in VIFF has two ingredients. First, callbacks
    2.80 +that depend on the program counter (like `func`:shamir_share in our
    2.81 +example above) are not added with :meth:`addCallback` but instead with
    2.82 +the special :meth:`viff.runtime.BasicRuntime.schedule_callback`
    2.83 +method. This method saves the program counter in effect at the time of
    2.84 +the its call, and ensures that the saved program counter is
    2.85 +temporarily made active when the callback is called.
    2.86 +
    2.87 +Secondly, the program counter is a *list* of counters. This is
    2.88 +necessary to ensure that we can allocate new fresh counters at any
    2.89 +point in the execution tree. The execution tree is never explicitly
    2.90 +constructed in VIFF, so a simple static numbering is not possible.
    2.91 +
    2.92 +Instead we mark methods that need to increment the program counter
    2.93 +with the :func:`viff.runtime.increment_pc` decorator. The program
    2.94 +counter starts at the value ``[0]`` and the decorated method will now
    2.95 +begin by doing::
    2.96 +
    2.97 +  self.program_counter[-1] += 1
    2.98 +  self.program_counter.append(0)
    2.99 +
   2.100 +before it executes its body. When the body is finished, the method
   2.101 +does::
   2.102 +
   2.103 +  self.program_counter.pop()
   2.104 +
   2.105 +before it returns. A method :meth:`foo` defined like this::
   2.106 +
   2.107 +  @increment_pc
   2.108 +  def foo(self):
   2.109 +      print "foo:", self.program_counter
   2.110 +
   2.111 +is thus turned into this::
   2.112 +
   2.113 +  def foo(self):
   2.114 +      self.program_counter[-1] += 1
   2.115 +      self.program_counter.append(0)
   2.116 +      print "foo:", self.program_counter
   2.117 +      self.program_counter.pop()
   2.118 +
   2.119 +and when executed starting from the initial program counter of ``[0]``
   2.120 +we see that it prints ``foo: [1, 0]`` and leaves the program counter
   2.121 +at ``[1]`` after it returns. It is very important that the program
   2.122 +counter is left changed like this, for this means that the next call
   2.123 +to :meth:`foo` will print ``foo: [2, 0]`` and increment the program
   2.124 +counter to ``[2]``.
   2.125 +
   2.126 +If we have a method :meth:`bar` which calls :meth:`foo` several times::
   2.127 +
   2.128 +  @increment_pc
   2.129 +  def bar(self):
   2.130 +      print "bar:", self.program_counter
   2.131 +      self.foo()
   2.132 +      print "bar:", self.program_counter
   2.133 +      self.foo()
   2.134 +      print "bar:", self.program_counter
   2.135 +
   2.136 +then the result of calling :meth:`bar` will be:
   2.137 +
   2.138 +.. code-block:: none
   2.139 +
   2.140 +   bar: [1, 0]
   2.141 +   foo: [1, 1, 0]
   2.142 +   bar: [1, 1]
   2.143 +   foo: [1, 2, 0]
   2.144 +   bar: [1, 2]
   2.145 +
   2.146 +Notice how each sub-call adds another digit to the counter and how it
   2.147 +increments the counter used at the level of the caller. This system
   2.148 +ensures that all program counters are unique.
   2.149 +
   2.150 +
   2.151 +Alternatives
   2.152 +------------
   2.153 +
   2.154 +We have come up with some alternative solutions, which are detailed
   2.155 +below. More good ideas are of course welcome!
   2.156 +
   2.157 +History-based labels
   2.158 +""""""""""""""""""""
   2.159 +
   2.160 +An attractive alternative is to label data sent over the net based on
   2.161 +its *history*. The idea is that we associate a label ``H(x)`` with
   2.162 +each variable *x*. The history is defined when new variables are
   2.163 +defined --- if ``x = a * b``, then we can set ``H(x) = ("mul", H(a),
   2.164 +H(b))``. To avoid growing the history without bounds we can hash it
   2.165 +with a cryptographic hash function to bring it down to a fixed size.
   2.166 +
   2.167 +The problem with this idea is that we sometimes need to assign a
   2.168 +history to a variable that depends on no other variables. An example
   2.169 +of this is the result of a call to :meth:`prss_share
   2.170 +<viff.runtime.Runtime.prss_share>` which takes no useful arguments. A
   2.171 +possible solution would be to add some dummy arguments on which the
   2.172 +history could be based, even though they wont be used by the method.
   2.173 +So if you would normally call the function :func:`hat` with no
   2.174 +arguments to get a :class:`Rabbit` object, you have to change your
   2.175 +code from this::
   2.176 +
   2.177 +  rabbit = hat()
   2.178 +
   2.179 +to this::
   2.180 +
   2.181 +  rabbit = hat(dummy=locals())
   2.182 +
   2.183 +where the call to :func:`locals` gives you access to the local
   2.184 +variables. If the use of :func:`locals` could be hidden this might be
   2.185 +an acceptable solution.
   2.186 +
   2.187 +Using the history of the variables has the big advantage that we label
   2.188 +each piece of transmitted data with just information that is relevant
   2.189 +to it: namely its position in the tree formed by the calculation and
   2.190 +*not* its position in the execution tree formed by the implementation
   2.191 +in VIFF. This is conceptually cleaner than the current solution.
   2.192 +
   2.193 +Program transformation
   2.194 +""""""""""""""""""""""
   2.195 +
   2.196 +Another idea is to solve the labelling problem by having some external
   2.197 +tool transform the program into one with explicit labels. Each send
   2.198 +and each receive operation needs to be labelled and the labels much
   2.199 +match pair-wise.
   2.200 +
   2.201 +It is not entirely clear how this should work in the presence of loops
   2.202 +and if it is possible to implement this in an easier way than the
   2.203 +current program counters.