Times for big.seq and big.unk on mp:

Sequential: 5.33091 sec
Threaded:
  num_threads   time
  1             10.7522
  2             11.1476
  4             15.0832
  8             36.2662

The threaded version is slower in general than the sequential version because
there is a lot more logic involved in each step for calculating the indices
into the matrix and traversing it along diagonals instead of row-wise.
With more threads, the threaded version also has to synchronize more (each of
the threads does a barrier wait at the end of each step). Since some threads
do not have much work, especially at the beginning and end of the algorithm,
this leads to more time being taken doing the synchronization than the actual
computation.

In contrast, when the "unknown" search sequence grows in length and the known
database sequence shrinks, such that the two sequences are "more even" in
length, then the threaded version does much better because there is less
synchronization required compared to the amount of actual computation being
done:

Times for more-even.seq and more-even.unk on mp:

Sequential: 1.34995 sec
Threaded:
  num_threads   time
  1             10.0506
  2             4.26101
  4             2.7733
  8             2.64885