% Preamble
\documentclass[11pt,fleqn]{article}
\usepackage{amsmath, amsthm, amssymb}
\usepackage{fancyhdr}
\oddsidemargin	-0.25in
\textwidth	6.75in
\topmargin	-0.5in
\headheight	0.75in
\headsep	0.25in
\textheight	8.75in
\pagestyle{fancy}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}
\fancyhf{}
\lhead{HW Chap. 7\\\ \\\ }
\rhead{Josh Holtrop\\2008-10-15\\CS 677}
\rfoot{\thepage}

\begin{document}

\noindent
\begin{enumerate}
\item[1.]{
    Break the ``parallel region'' into a function accepting a \texttt{void *}
    parameter.
    Before the ``parallel region'' create a \texttt{for} loop which loops
    \textit{n} times (where \textit{n} is the number of threads),
    invoking \texttt{pthread\_create()} once for each thread.
    Any variables local to the function containing the ``parallel region''
    that the ``parallel region'' function needs access to
    would have to be stored as pointers in a structure whose address was
    passed as an argument to the thread function.
    Then, the thread would run the code in the ``parallel region''.
    After the region, a \texttt{for} loop would exist to loop over all
    the threads created in the first loop and execute \texttt{pthread\_join()}
    for each one.
}

\vskip 2em
\item[2.]{
    Each thread could store its result into an array indexed by its ID.
    Then, when computation is complete, a regular \texttt{for} loop
    within an OpenMP parallel region could iterate
    $\lceil \log_2 n \rceil$ times.
    In the first iteration, threads where $ID\mod 2 = 0$ would perform
    the reduction operation on their array value and the array value
    at index $ID + 1$ while the rest of the threads are idle.
    In the second iteration, threads where $ID\mod 4 = 0$ would perform
    the reduction operation on their array value and the array value
    at index $ID + 2$ while the rest of the threads are idle.
    This process would repeat (doubling the mod value and offset index
    each time) until the reduction operation has been
    performed to produce the final result value at index 0 of the
    array.
}

\vskip 2em
\item[3.]{
    My OpenMP solution to Floyd's algorithm was implemented by
    using a \texttt{\#pragma omp parallel for} on the second \texttt{for}
    loop of the algorithm.
    Thus, for each $k$ value, the rows are broken up for different
    threads to process.
    The same thread computes an entire row of the matrix.

    The run times nicely grow exponentially as $n$ grows linearly.
    On eos24, with $n >= 400$, the speedup was $\approx 3.6$.

    As the number of threads increased, the run time decreased
    exponentially until $t > 4$, where more threads did not gain
    anything since there were only 4 processing cores.
}

\end{enumerate}

\end{document}