77 lines
2.7 KiB
TeX
77 lines
2.7 KiB
TeX
% Preamble
|
|
\documentclass[11pt,fleqn]{article}
|
|
\usepackage{amsmath, amsthm, amssymb}
|
|
\usepackage{fancyhdr}
|
|
\oddsidemargin -0.25in
|
|
\textwidth 6.75in
|
|
\topmargin -0.5in
|
|
\headheight 0.75in
|
|
\headsep 0.25in
|
|
\textheight 8.75in
|
|
\pagestyle{fancy}
|
|
\renewcommand{\headrulewidth}{0pt}
|
|
\renewcommand{\footrulewidth}{0pt}
|
|
\fancyhf{}
|
|
\lhead{HW Chap. 7\\\ \\\ }
|
|
\rhead{Josh Holtrop\\2008-10-15\\CS 677}
|
|
\rfoot{\thepage}
|
|
|
|
\begin{document}
|
|
|
|
\noindent
|
|
\begin{enumerate}
|
|
\item[1.]{
|
|
Break the ``parallel region'' into a function accepting a \texttt{void *}
|
|
parameter.
|
|
Before the ``parallel region'' create a \texttt{for} loop which loops
|
|
\textit{n} times (where \textit{n} is the number of threads),
|
|
invoking \texttt{pthread\_create()} once for each thread.
|
|
Any variables local to the function containing the ``parallel region''
|
|
that the ``parallel region'' function needs access to
|
|
would have to be stored as pointers in a structure whose address was
|
|
passed as an argument to the thread function.
|
|
Then, the thread would run the code in the ``parallel region''.
|
|
After the region, a \texttt{for} loop would exist to loop over all
|
|
the threads created in the first loop and execute \texttt{pthread\_join()}
|
|
for each one.
|
|
}
|
|
|
|
\vskip 2em
|
|
\item[2.]{
|
|
Each thread could store its result into an array indexed by its ID.
|
|
Then, when computation is complete, a regular \texttt{for} loop
|
|
within an OpenMP parallel region could iterate
|
|
$\lceil \log_2 n \rceil$ times.
|
|
In the first iteration, threads where $ID\mod 2 = 0$ would perform
|
|
the reduction operation on their array value and the array value
|
|
at index $ID + 1$ while the rest of the threads are idle.
|
|
In the second iteration, threads where $ID\mod 4 = 0$ would perform
|
|
the reduction operation on their array value and the array value
|
|
at index $ID + 2$ while the rest of the threads are idle.
|
|
This process would repeat (doubling the mod value and offset index
|
|
each time) until the reduction operation has been
|
|
performed to produce the final result value at index 0 of the
|
|
array.
|
|
}
|
|
|
|
\vskip 2em
|
|
\item[3.]{
|
|
My OpenMP solution to Floyd's algorithm was implemented by
|
|
using a \texttt{\#pragma omp parallel for} on the second \texttt{for}
|
|
loop of the algorithm.
|
|
Thus, for each $k$ value, the rows are broken up for different
|
|
threads to process.
|
|
The same thread computes an entire row of the matrix.
|
|
|
|
The run times nicely grow exponentially as $n$ grows linearly.
|
|
On eos24, with $n >= 400$, the speedup was $\approx 3.6$.
|
|
|
|
As the number of threads increased, the run time decreased
|
|
exponentially until $t > 4$, where more threads did not gain
|
|
anything since there were only 4 processing cores.
|
|
}
|
|
|
|
\end{enumerate}
|
|
|
|
\end{document}
|