From ea4e68b769634d236c623996c9a67e425419e6b7 Mon Sep 17 00:00:00 2001
From: josh <josh@45c1a28c-8058-47b2-ae61-ca45b979098e>
Date: Thu, 16 Apr 2009 02:49:37 +0000
Subject: [PATCH] finishing up report

git-svn-id: svn://anubis/gvsu@408 45c1a28c-8058-47b2-ae61-ca45b979098e
---
 cs658/html/report.html | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)
diff --git a/cs658/html/report.html b/cs658/html/report.html
index be683a8..6b464c2 100644
--- a/cs658/html/report.html
+++ b/cs658/html/report.html
@@ -206,6 +206,33 @@
         In this way, one worker process was spawned per core on the
         slave node.
     </p>
+    <p>
+        I had originally planned on implementing fault-tolerance in the
+        distribution architecture by establishing a second TCP connection
+        from each slave node to the master which served as a polling
+        connection to make sure that the slaves were still alive.
+        During implementation, I arrived at a more elegant solution.
+        I was already keeping track of the set of tasks that were
+        considered "in progress" as far as the master process was concerned.
+        If the master process received a request from a slave for a task
+        to work on, it would normally respond with the next available task
+        number until all tasks had been given out, and then it would
+        respond saying that there were no more tasks to work on.
+        I changed this slightly so that if the master got a request from
+        a slave for a task to work on, and all of the tasks were already
+        given out, then the master would respond to the slave with a
+        task ID from the set of tasks that were currently in progress.
+        That way, whether the original slave node or the new one finished
+        the task, the data for it would be collected.
+        If the original node was dead, then the new slave node would
+        take over the task and return the data.
+        If the original node was alive, but just responding very slowly,
+        then the replacement node could finish the task and return
+        the results before the original node.
+        This ended up working very well, as I was able to kill all of
+        the worker processes on a given slave node and the tasks
+        that they were working on were finished by other nodes later on.
+    </p>
 
     <a name="evaluation" />
     <h4>Evaluation</h4>