<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0012-7353</journal-id>
<journal-title><![CDATA[DYNA]]></journal-title>
<abbrev-journal-title><![CDATA[Dyna rev.fac.nac.minas]]></abbrev-journal-title>
<issn>0012-7353</issn>
<publisher>
<publisher-name><![CDATA[Universidad Nacional de Colombia]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0012-73532015000300030</article-id>
<article-id pub-id-type="doi">10.15446/dyna.v82n191.45513</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Matrix multiplication with a hypercube algorithm on multi-core processor cluster]]></article-title>
<article-title xml:lang="es"><![CDATA[Multiplicación de matrices con un algoritmo hipercubo en un cluster con procesadores multi-core]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Zavala-Díaz]]></surname>
<given-names><![CDATA[José Crispín]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Pérez-Ortega]]></surname>
<given-names><![CDATA[Joaquín]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Salazar-Reséndiz]]></surname>
<given-names><![CDATA[Efraín]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Guadarrama-Rogel]]></surname>
<given-names><![CDATA[Luis César]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Universidad Autónoma del Estado de Morelos Facultad de Contaduría, Administración e Informática ]]></institution>
<addr-line><![CDATA[Cuernavaca ]]></addr-line>
<country>México</country>
</aff>
<aff id="A02">
<institution><![CDATA[,Centro Nacional de Investigación y Desarrollo Tecnológico Departamento de Ciencias Computacionales ]]></institution>
<addr-line><![CDATA[Cuernavaca ]]></addr-line>
<country>México</country>
</aff>
<aff id="A">
<institution><![CDATA[,efras.salazar@cenidet.edu.mx  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
</aff>
<aff id="A">
<institution><![CDATA[,cesarguadarrama@cenidet.edu.mx  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>06</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>06</month>
<year>2015</year>
</pub-date>
<volume>82</volume>
<numero>191</numero>
<fpage>240</fpage>
<lpage>246</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.co/scielo.php?script=sci_arttext&amp;pid=S0012-73532015000300030&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.co/scielo.php?script=sci_abstract&amp;pid=S0012-73532015000300030&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.co/scielo.php?script=sci_pdf&amp;pid=S0012-73532015000300030&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[The algorithm of multiplication of matrices of Dekel, Nassimi and Sahani or Hypercube is analysed, modified and implemented on multi-core processor cluster, where the number of processors used is less than that required by the algorithm n³. 2³, 4³ and 8³ processing units are used to multiply matrices of the order of 10x10, 10²x10² and 10³X10³. The results of the mathematical model of the modified algorithm and those obtained from the computational experiments show that it is possible to reach acceptable speedup and parallel efficiencies, based on the number of used processor units. It also shows that the influence of the external communication link among the nodes is reduced if a combination of the available communication channels among the cores in a multi-core cluster is used.]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[Se analiza, modifica e implementa el algoritmo de multiplicación de matrices de Dekel, Nassimi y Sahani o hipercubo en un cluster de procesadores multi-core, donde el número de procesadores utilizado es menor al requerido por el algoritmo de n³. Se utilizan 2³, 4³ y 8³ unidades procesadoras para multiplicar matrices de orden de magnitud de 10X10, 10²X10² y 10³X10³. Los resultados del modelo matemático del algoritmo modificado y los obtenidos de la experimentación computacional muestran que es posible alcanzar rapidez y eficiencias paralelas aceptables, en función del número de unidades procesadoras utilizadas. También se muestra que la influencia del enlace externo de comunicación entre los nodos disminuye si se utiliza una combinación de los canales de comunicación disponibles entre los núcleos en un cluster multi-core.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[Hypercube algorithm]]></kwd>
<kwd lng="en"><![CDATA[multi-core processor cluster]]></kwd>
<kwd lng="en"><![CDATA[Matrix multiplication]]></kwd>
<kwd lng="es"><![CDATA[Algoritmo Hipercubo]]></kwd>
<kwd lng="es"><![CDATA[cluster de procesadores multi-core]]></kwd>
<kwd lng="es"><![CDATA[Multiplicación de matrices]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <p><font size="1" face="Verdana, Arial, Helvetica, sans-serif"><b>DOI: </b><a href="http://dx.doi.org/10.15446/dyna.v82n191.45513" target="_blank">http://dx.doi.org/10.15446/dyna.v82n191.45513</a></font></p>     <p align="center"><font size="4" face="Verdana, Arial, Helvetica, sans-serif"><b>Matrix multiplication with a hypercube algorithm   on multi-core processor cluster</b></font></p>     <p align="center"><b><font size="3" face="Verdana, Arial, Helvetica, sans-serif"><i>Multiplicaci&oacute;n   de matrices con un algoritmo hipercubo en un cluster con procesadores   multi-core </i></font></b></p>     <p align="center">&nbsp;</p>     <p align="center"><b><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Jos&eacute;   Crisp&iacute;n Zavala-D&iacute;az <i><sup>a</sup></i>,   Joaqu&iacute;n P&eacute;rez-Ortega <i><sup>b</sup></i>,   Efra&iacute;n Salazar-Res&eacute;ndiz <i><sup>b</sup></i> &amp; Luis C&eacute;sar Guadarrama-Rogel <i><sup>b</sup></i></font></b></p>     <p align="center">&nbsp;</p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><sup><i>a</i></sup><i> Facultad de Contadur&iacute;a, Administraci&oacute;n e Inform&aacute;tica, Universidad   Aut&oacute;noma del Estado de Morelos, Cuernavaca, M&eacute;xico. <a href="mailto:crispin_zavala@uaem.mx">crispin_zavala@uaem.mx</a>    <br>   <sup>b</sup> Departamento de Ciencias Computacionales, Centro Nacional de   Investigaci&oacute;n y Desarrollo Tecnol&oacute;gico, Cuernavaca, M&eacute;xico. <a href="mailto:jpo_cenidet@yahoo.com.mx">jpo_cenidet@yahoo.com.mx</a>, <a href="mailto:efras.salazar@cenidet.edu.mx">efras.salazar@cenidet.edu.mx</a>, <a href="mailto:cesarguadarrama@cenidet.edu.mx">cesarguadarrama@cenidet.edu.mx</a></i></font></p>     <p align="center">&nbsp;</p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>Received:   September 10<sup>th</sup>, 2014. Received in revised form: March 9<sup>th</sup>,   2015. Accepted: March 17<sup>th</sup>, 2015</b></font></p>     ]]></body>
<body><![CDATA[<p align="center">&nbsp;</p>     <p align="center"><font size="1" face="Verdana, Arial, Helvetica, sans-seriff"><b>This work is licensed under a</b> <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.</font><br />   <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a></p> <hr>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>Abstract    <br>   </b></font><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The algorithm of   multiplication of matrices of Dekel, Nassimi and Sahani or Hypercube is   analysed, modified and implemented on multi-core processor cluster, where the   number of processors used is less than that required by the algorithm <i>n<sup>3</sup></i>. 2<sup>3</sup>, 4<sup>3</sup> and 8<sup>3</sup> processing units are used to multiply matrices of the order   of 10x10, 10<sup>2</sup>x10<sup>2</sup> and 10<sup>3</sup>X10<sup>3</sup>. The   results of the mathematical model of the modified algorithm and those obtained   from the computational experiments show that it is possible to reach acceptable   speedup and parallel efficiencies, based on the number of used processor units.   It also shows that the influence of the external communication link among the   nodes is reduced if a combination of the available communication channels among   the cores in a multi-core cluster is used.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><i>Keywords<b>:</b> Hypercube algorithm; multi-core processor cluster; Matrix multiplication</i></font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>Resumen    <br>   </b></font><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Se analiza, modifica e implementa el algoritmo de multiplicaci&oacute;n de   matrices de <i>Dekel, Nassimi y Sahani</i> o hipercubo en un cluster de   procesadores multi-core, donde el n&uacute;mero de procesadores utilizado es menor al   requerido por el algoritmo de <i>n<sup>3</sup></i>.   Se utilizan 2<sup>3</sup>, 4<sup>3</sup> y 8<sup>3</sup> unidades procesadoras   para multiplicar matrices de orden de magnitud de 10X10, 10<sup>2</sup>X10<sup>2 </sup>y 10<sup>3</sup>X10<sup>3</sup>. Los resultados del modelo matem&aacute;tico del   algoritmo modificado y los obtenidos de la experimentaci&oacute;n computacional   muestran que es posible alcanzar rapidez y eficiencias paralelas aceptables, en   funci&oacute;n del n&uacute;mero de unidades procesadoras utilizadas. Tambi&eacute;n se muestra que   la influencia del enlace externo de comunicaci&oacute;n entre los nodos disminuye si   se utiliza una combinaci&oacute;n de los canales de comunicaci&oacute;n disponibles entre los   n&uacute;cleos en un cluster multi-core.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><i>Palabras Clave: Algoritmo Hipercubo, cluster de   procesadores multi-core, Multiplicaci&oacute;n de matrices</i></font></p> <hr>     <p>&nbsp;</p>     <p><font size="3" face="Verdana, Arial, Helvetica, sans-serif"><b>1. Introduction</b></font></p>     ]]></body>
<body><![CDATA[<p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The multicore clusters   are formed by either nodes or processors, often heterogeneous, connected with   architecture of dynamic configuration, high speed bus; it allows point to point   connections among each processing unit. The nodes are composed of processors   and the processors in turn are composed of multiple cores, where each of them   has the ability to run more than one process at a time &#91;1&#93;. These features   allow us to propose solutions to various problems using computing parallel and   concurrent &#91;2,3&#93;, such as the one presented in this work: The matrix   multiplication with the DNS algorithm (Dekel, Nassimi, Sahani or hypercube) on   a multicore cluster.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The execution time of   the DNS or Hypercube algorithm for multiplying <i>nxn</i> matrices is polylogarithmic <i>T&infin;</i>(<i>l</i>og<i>n</i>)   and requires a polynomial number of processors <i>Hw</i>(<i>n<sup>3</sup></i>), these   two parameters classify it as an efficiently parallelizable algorithm &#91;4&#93;. But,   in its implementation in a multicore cluster the following limitations are   presented: First, the number of processing units available is finite and is   lesser than those required by the hypercube algorithm for matrice sizes given <i>Hw</i>(<i>n<sup>3</sup></i>);   Second, the processing units are spread over various nodes and processors.   Where the speed of the communication links among the cores is different, they   can be on the same processor, in the same node, either in different nodes or   processors.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Similar modifications &#91;5&#93; are proposed to implement the   hypercube algorithm on multicore clusters. In the modification proposed in this   work the number of processing units remains constant, leaving as a variable the   grain size or the number of the subarrays that divides the <i>nxn </i>matrices. Contrary to what is proposed by &#91;5&#93;, where the number   of processing units is determined by the size of the input.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The implementation of   the algorithm is made by the library Message Passing Interface MPI, where MPI   processes are assigned to the cores. If an MPI process is assigned to a core,   then it will be parallel computation; but if more than one MPI process is assigned   to the same core, then it will be concurrent computation. Computational   experiments took place in the <i>ioevolution</i> multicore cluster. Where <i>nxn</i> matrices   are multiplied by 8, 64 and 512 processing units, the matrix's size is a   multiple of 72, from 72 x 72 up to 2304 x 2304 items.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Theoretical and   experimental results show that the solution proposed in this work allows to   multiply matrices of size multiples 1,000X1,000 in multi-core cluster, using a   smaller number of processors than required processors by original algorithm;   with modifications proposed, it is possible to achieve acceptable speedup and   good parallel efficiencies, depending on the number of processing units used;   the developed mathematical model of the modified DNS algorithm predicts the behavior   of the parallel implementation on a multi-core cluster.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The article consists of   the following parts: in the second the methodology and foundations of the   Hypercube algorithm are presented; in the same section the analysis of its   running time on a MIMD computer, of memory and distributed computation, is   presented. Based on this analysis, the modifications to multiply matrices on a   multi-core cluster are presented. The third part contains the computational   experiments and analysis of results. Finally, in the fourth section the   conclusions that come out of this work are presented.</font></p>     <p>&nbsp;</p>     <p><font size="3" face="Verdana, Arial, Helvetica, sans-serif"><b>2. Methodology</b></font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Algorithms for matrix   multiplication based on the model <i>Simple Instruction   Multiple Data</i> (SIMD) has been developed since the beginning of the parallel   computation, such as: Systolic Array or Mesh of Processors, Cannon, Fox, DNS   method (Dekel, Nassimi, Sahani) or hypercube and meshes of trees &#91;4-8&#93;. In all   of them, except for DNS, the parallel process of solution consists of   sequential steps of data distribution and computing. In general, in the first   stage the data between processor units are distributed. In the second, each   processor performs the multiplication of elements allocated. In the third, the   first and second steps are repeated until all columns and rows of the matrices   are rotated and multiplied. In the fourth stage the resulting elements are   added. In contrast, in either the DNS or Hypercube algorithm, the distribution   and multiplying of elements run only once, the first and second stages, and   similarly to the other algorithms the fourth stage is executed once.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Then the DNS or Hypercube algorithm is described.</font></p>     ]]></body>
<body><![CDATA[<p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b><i>2.1.   Hypercube algorithm</i></b></font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The algorithm is based on multiplying two <i>n</i>x<i>n</i> matrices using the model <i>Simple   Instruction Multiple Data</i> (SIMD), the execution time is <i>q</i>(log<i>n</i>) and   requires <i>n<sup>3</sup></i> processors.   This consists of <i>q</i> steps to   distribute and rotate the data among the processor units, the number of stages   is given by <i>5q = 5</i> logn and the   matrix size by <i>n<sup>3</sup> = 2<sup>3q</sup></i>&#91;4&#93;.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The solution strategy is: distribute and rotate the   elements of the matrices (<i>a<sub>ik</sub></i>,<i> b<sub>ik</sub></i>) in the computer nodes,   connected by hypercube architecture. Once the rotation and distribution of   elements is performed, each processing unit has the pair (<i>a<sub>ik</sub></i>, <i>b<sub>ki</sub></i>),   then they are multiplied. After the multiplication, the products are sent to be   added and get the <i>c<sub>ij</sub></i>elements. </font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The matrix multiplication algorithm SIMD Hypercube comprises   three phases: In the first, the elements are distributed and rotated in the   processing units; in the second phase the elements are multiplied; and in the   third, the products are summed to obtain the solution matrix.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b><i>2.2. Runtime hypercube algorithm in a multi-core   cluster</i></b></font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">For   the execution time on a multi-core cluster it is </font><font size="2" face="Verdana, Arial, Helvetica, sans-serif">necessary to introduce the factors that influence its implementation,   such as: time used: calculation <i>T<sub>cal</sub></i>,   communications <i>T<sub>comm</sub></i>, wait <i>T<sub>wait</sub></i> and local operations <i>T<sub>local</sub></i> &#91;9&#93;. The <i>T<sub>wait</sub></i> and <i>T<sub>local</sub></i>times are assumed to   be zero for the following reasons: First, each processing unit receives the   data and immediately executes the next operation (<i>T<sub>wait</sub></i> = 0); Second, the processor units do not perform   local operations for the distribution of the data (<i>T<sub>local</sub></i> = 0). Since the beginning of the process, the   units neighbouring for sending are defined. Therefore, parallel execution time   is determined by:</font></p>     <p><img src="/img/revistas/dyna/v82n191/v82n191a30eq01.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">To be able to add the <i>T<sub>cal</sub></i> and <i>T<sub>comm</sub></i> times, they must   be expressed in the same units: runtime or in the number of operations of the   processing unit, so it is necessary to standardize the communication time. To   this end the constant <i>g</i>, eq. (2)   &#91;10&#93;, is introduced.</font></p>     <p><img src="/img/revistas/dyna/v82n191/v82n191a30eq02.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The constant calculates the communication time depending   on processor operations &#91;10&#93;. Therefore, the communication time is determined   by the expression:</font></p>     ]]></body>
<body><![CDATA[<p><img src="/img/revistas/dyna/v82n191/v82n191a30eq03.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">To calculate the   execution time in a parallel computer of distributed memory and distributed   process the <i>T<sub>cal</sub></i> and <i>T<sub>comm</sub></i> times are entered in   the algorithm of <a href="#fig01">Fig. 1</a>. The execution time of the phases of Hypercube   algorithm is shown in <a href="#tab01">Table 1</a>.</font></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="fig01"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30fig01.gif"></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="tab01"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30tab01.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Whereas <i>q</i> = log<i>n</i> time parallel execution is: </font></p>     <p><img src="/img/revistas/dyna/v82n191/v82n191a30eq04.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">If eq. (4) is compared with equation unmodified <i>T<sub>p</sub></i>(<i>n</i>) = </font><font size="2" face="Verdana, Arial, Helvetica, sans-serif">1 + 5logn &#91;4&#93;, it shows that the   time of the parallel execution, eq. (4), grows approximately <i>gC</i> multiple, because normally g &gt; 1   and <i>C</i> is always greater than 1. This   indicates that the parallel algorithm is faster than the sequential for the   matrix's certain size. For example, if g = 2 and <i>C</i> is equal to 64 bits, the   matrix of size 14x14 is needed for which the modified hypercube parallel   algorithm is going to be faster than the sequential. Additionally, 2,744   processing units are required to implement it.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Multiplying matrices of dimensions 1,000X1,000 in a   multi-core cluster with this algorithm has the following drawback: the number   of processing units is not available. Consequently, it needs to be modified in   order to multiply matrices of this size, because the number of available   processing units is less than <i>n<sup>3</sup></i>. </font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b><i>2.3. Amendments to the DNS algorithm to it runs on   a multi-core computer </i></b></font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The modification is to increase the grain size, as follows:   to replace the scalar by sub matrices and that they follow the same procedure   for the distribution and rotation of elements <i>a<sub>ij</sub></i>and <i>b<sub>ij</sub></i>.   Consequently, the distribution and rotation of sub matrices will be a function   of the <i>q</i> variable equal to log<i>n</i>. The value of the variable is   determined by the number that divides the matrices to multiply. If the <i>n</i> dimension of the matrix is divided   into <i>k</i> parts to generate submatrices <img src="/img/revistas/dyna/v82n191/v82n191a30eq012.gif">,   then <i>q</i> is equal to log<i>k</i>. The number of submatrices generated   is 2<sup>3q</sup> = k<sup>3</sup> and the number of processors is given by <i>p = k<sup>3</sup></i>. The k variable can   define the size of the submatrices and the number of processing units that are   required. </font></p>     ]]></body>
<body><![CDATA[<p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The necessary   modifications to the algorithm in <a href="#fig01">Fig. 1</a> below:</font></p> <ul>       <li><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Replace to <i>a<sub>ij</sub></i> by <b><i>A</i></b><i><sub>ij</sub></i></font></li>       <li><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Replace to <i>b<sub>ij</sub></i> by <b><i>B</i></b><i><sub>ij</sub></i></font></li>       <li><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Replace to <i>c<sub>ij</sub></i> by <b><i>C</i></b><i><sub>ij</sub></i></font></li>       <li><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Replace to <i>a<sub>ik</sub>b<sub>kj</sub></i> by <img src="/img/revistas/dyna/v82n191/v82n191a30eq014.gif"></font></li>     </ul>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The computation and communication times of the modified   algorithm are shown in <a href="#tab02">Table 2</a>.</font></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="tab02"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30tab02.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The execution time of   the parallel algorithm is determined by:</font></p>     <p><img src="/img/revistas/dyna/v82n191/v82n191a30eq05.gif"></p>     ]]></body>
<body><![CDATA[<p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Whereas <i>p</i> = <i>k<sup>3</sup></i> and <i>q</i> = log <i>k</i> the expression   (5) is reduced to: </font></p>     <p><img src="/img/revistas/dyna/v82n191/v82n191a30eq06.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The <i>C</i> variable, in the expression (6), is   the type of data that is sent and <i>g</i> is given by eq. (2). <a href="#tab03">Table 3</a> shows the results of eq. (6) for a value of <i>g</i> = 1 and 2, <i>C</i> = 32 and a number of processing units 8, 64 and 512 (<i>k</i> = 2, 4 and 8). The results are   presented in <a href="#tab03">Table 3</a> based on the parallel speedup <i>Sp</i> = (<i>T<sub>1</sub>/T<sub>p</sub></i>).</font></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="tab03"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30tab03.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The results of the eq.   (6) shows that when <i>g</i> = 1 runtimes   are smallest. This indicates that the algorithm is sensitive to the   characteristics of the communication link. Regarding the parallel speedup, it   is best using eight processing units for multiplying matrices smaller than   1,000X1,000, from 64 to matrices of smaller sizes than 10,000X10,000 and 512   for multiplying matrices over 10,000X10,000. <a href="#fig02">Fig. 2</a> plots the data of <a href="#tab03">Table 3</a>.</font></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="fig02"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30fig02.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">As seen in <a href="#fig02">Fig. 2</a>, the   largest differences occur in large arrays, the multiplication matrix is faster   when more processing units are used. When using as a reference the parallel   efficiency Ep = (Sp/p)X100, the better efficiencies are obtained when less   processing units are used, as shown in <a href="#fig03">Fig. 3</a>.</font></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="fig03"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30fig03.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">As is shown in <a href="#fig03">Fig. 3</a>,   the best efficiency is attained when the minor number of parallel processing   units are used, about 100% for the larger matrices. However, the parallel   efficiency decreases when more processing units are used, at best by 20%.   Although the trend is that the efficiency will increase as the size of the   matrix increases. This indicates that for 512 parallel processing units will   reach its asymptotic value for a given size of matrices.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The parallel efficiency low is related to the time spent   on communications, it grows exponentially when the number of processing units   is higher and, consequently, the data sent concurrently among processing units. </font></p>     ]]></body>
<body><![CDATA[<p>&nbsp;</p>     <p><font size="3" face="Verdana, Arial, Helvetica, sans-serif"><b>3. Computational experiments in the multi-core   cluster</b></font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The computational experiment was conducted on a multi-core   cluster with a finite number of processors and memory units. In consequence the   size of the matrices of the computational experiment was limited. </font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">In   view of the fact that the cores can be programmed as independent units &#91;1&#93;. The   algorithm was encoded in ANSI C and for sending data the MPI library was used.   The MPI-processes are assigned to the cores. If only one of them is assigned to   a core, then it will work with parallel computing. But, if more than one is   assigned, then it will work with parallel and concurrent computing. All nodes   in the multi-</font><font size="2" face="Verdana, Arial, Helvetica, sans-serif">core cluster are used in the   experimentation, each node has the same load in all tests, for which the same   number of MPI-processes to each processing unit is assigned.</font></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="tab04"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30tab04.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b><i>3.1.   Description of multi-core cluster   ioevolution</i></b></font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The multi-core cluster used has the following   characteristics: </font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Features cluster shows that there are different <i>g</i> constants. The first is when two   processes are assigned to the same core; the second is when the cores are in   the same processor; the third corresponds to the communication between   processors in the same node; and the fourth is for communication between nodes   in the cluster. Of these, it is considered the external link between nodes, it   is by means of fibre optics with a bandwidth of 1 Gbit per second, and with the   speed of nuclei the <i>g </i>&asymp; 2.4 is   obtained.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b><i>3.2. Computational test on cluster ioevolution </i></b></font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The computational tests consist of multiplying matrices   without gaps and in different sizes. Its elements are real, 32-bit floating   number, and are generated randomly. During each execution of the program   different matrices are used.</font></p>     ]]></body>
<body><![CDATA[<p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The sizes of the matrices are multiples of the number 72,   because this is a number divisible by <i>k</i> values (2, 4 and 8). <a href="#tab05">Table 5</a> shows the average execution time obtained in 30   runs and the speedup of the parallel implementation is presented.</font></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="tab05"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30tab05.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">If the theoretical results of <a href="#tab03">Table 3</a> and experimental   results of <a href="#tab05">Table 5</a> are compared, it is noted that the experimental values of   speedup are closer to those calculated when <i>g </i>= 1. In order to determine the value of the <i>g</i> constant that most   approximates the theoretical results to results experimentally, tests were   performed with different values of <i>g</i>,   of them the closest to the experimental is when <i>g</i> = 1.3. This is shown in <a href="#fig04">Figs. 4</a>, <a href="#fig05">5</a> and <a href="#fig06">6</a>.</font></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="fig04"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30fig04.gif"></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="fig05"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30fig05.gif"></p>     <p align="center"><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><a name="fig06"></a></font><img src="/img/revistas/dyna/v82n191/v82n191a30fig06.gif"></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The three <a href="#fig04">Figs. 4</a>, <a href="#fig05">5</a> and <a href="#fig06">6</a> show experimental results, they   are closer to the theoretical when <i>g</i> = 1.3, this value is lower than the <i>g</i> calculated by the external link (<i>g</i> =   2.4). The obtained value of <i>g</i> = 1.3   indicates that the use of the combination of the different communication   channels of multi-core cluster improves the parallel implementation.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The most approximate experimental results to the   theoretical are when 8 and 64 processing units are used. However, when 512   units of processing are used, the more approximate experimental results to the   theoretical is when the matrix's size is higher than 1,000X1,000.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">In <a href="#fig06">Fig. 6</a> the values obtained with g = 2.4 are also   plotted. It is observed that for matrices with a size smaller than 1,000X1,000   the experimental speedup is lower than those calculated using the external   link. In contrast to the matrix size bigger than 1,000X1,000, the experimental   speedup is better than the theoretical considering that same link. At this size   the experimental results follow the trend of theoretical with a g = 1.3. This   shows that there are factors that influence in the reduction of parallel   speedup.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Establishing communication among the processing units is   one of the factors influencing the parallel speedup. It follows the above with   the following, for the same size of matrices and different number of processor   units the changes made are: the size of the sub matrices and the number of   simultaneous communications. The subarrays are bigger when eight processing   units are used and a lesser number of concurrent communications are performed.   In contrast, the subarrays are the smallest with 512 processing units and the   largest number of simultaneous communications of the three tested cases are   performed. The time difference, between the theoretical and the experimental,   increases as the number of simultaneous communications increases, although   smaller matrices are sent, <a href="#fig04">Figs. 4</a> and <a href="#fig06">6</a>. This reveals that in sending data   from one processing unit to another there is a mechanism for communication,   which requires processor runtime.</font></p>     ]]></body>
<body><![CDATA[<p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">Another factor that influences the running time is the   number of processes that are calculated in each core. When 8 and 64 processing   units are used, a process is assigned to each core and all processes are   communicated simultaneously. The communication channel will be a function of   where the process will be allocated on either the same processor or node.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">On the other hand, when 512 processes are assigned to the   cores, each one of them runs more of a process. In this case the transmission   of data among processes may be sequential. Sequentially because there is a means   of communication among the cores, and it is used for various processes. Therefore,   when a process sends its data using the communication channel, other processes   assigned to the same core have to wait their turn. This sequentiality is   reflected in the experimental results, they follow the trend given by <i>g</i> = 1.3 and not given by <i>g </i>= 2.4.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">It shows that from size matrices 1,152X1,152 the speedup   maximum is obtained when eight processor units are used, for bigger matrices   the speedup parallel rapidly diminishes. With this value a parallel efficiency   of 70.15% is obtained. For the other tested cases, 64 and 512, the execution   time of parallel implementation continued to improve. This coincides with the   theoretical results, the best parallel efficiencies are obtained with 8   processors and matrices smaller than 1,000X1,000</font></p>     <p>&nbsp;</p>     <p><font size="3" face="Verdana, Arial, Helvetica, sans-serif"><b>4. Conclusions</b></font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif">It is concluded that: </font></p> <ul>       <li><font size="2" face="Verdana, Arial, Helvetica, sans-serif">With the proposed changes to DNS or hypercube     algorithm can multiply matrices of different orders of magnitude in a     multi-core cluster, using a number of processors lesser than <i>n<sup>3</sup></i>.</font></li>       <li><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The influence of the external communication link     between nodes in the cluster decreases, if a combination of communication     channels available among the cores of a multi-core cluster is used.</font></li>       <li><font size="2" face="Verdana, Arial, Helvetica, sans-serif">The amendment proposed in this paper, the number     of processing units, is a function of the number of submatrices in that the     matrix is divided.</font></li>       <li><font size="2" face="Verdana, Arial, Helvetica, sans-serif">For larger problems it was shown that the     influence of data access between processors affects parallel efficiency, rather     than smaller problems. It is expected that new designs of processors &#91;11, 12&#93;     will optimize access to data and consequently the best parallel efficiencies     will be obtained.</font></li>     ]]></body>
<body><![CDATA[</ul>     <p>&nbsp;</p>     <p><font size="3" face="Verdana, Arial, Helvetica, sans-serif"><b>References</b></font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;1&#93;</b> Rauber,   T. and Rünger, G., Parallel programming for multicore and cluster systems. Springer   Heidelberg Dordrecht London New York 2010. DOI 10.1007/978-3-642-04818-0.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000106&pid=S0012-7353201500030003000001&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;2&#93;</b> Muhammad,   A. I., Talat, A. and Mirza, S.H., Parallel matrix multiplication on multi-core   processors using SPC<sup>3</sup> PM, Proceedings of International Conference on   Latest Computational Technologies (ICLCT'2012), Bangkok, March 17-18, 2012 </font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000108&pid=S0012-7353201500030003000002&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;3&#93;</b> L'Excellent,   J.-Y. and Sid-Lakhdar, W.M., A study of shared-memory parallelism in a   multifrontal solver, Parallel Computing 40 (3-4), pp 34-46, 2014. DOI   10.1016/j.parco.2014.02.003 </font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000109&pid=S0012-7353201500030003000003&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;4&#93;</b> Quinn,   M.J., Parallel computing (2<sup>nd</sup> ed.): Theory and practice.   McGraw-Hill, Inc. New York, NY, USA 1994.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000110&pid=S0012-7353201500030003000004&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;5&#93;</b> Gupta,   A. and Kumar, V., Scalability of parallel algorithms for matrix multiplication,   Proceedings of International Conference on Parallel Processing, 3, pp. 115-123,   1991. DOI 10.1109/ICPP.1993.160.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000112&pid=S0012-7353201500030003000005&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;6&#93;</b> Alqadi,   Z.A.A., Aqel, M. and El Emary, I.M.M., Performance analysis and evaluation of   parallel matrix multiplication algorithms, World Applied Sciences Journal 5   (2), pp. 211-214, 2008.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000114&pid=S0012-7353201500030003000006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;7&#93;</b> J.   Choi, A new parallel matrix multiplication algorithm on distributed-memory   concurrent computers, Technical Report CRPC-TR97758, Center for Research on   Parallel Computation Rice University, &#91;On line&#93;</b> 1997. Aviable at <a href="http://citeseerx.ist.psu.edu/viewdoc/download" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/download?DOI=10.1.1.15.4213&amp;rep=rep1&amp;type=pdf</a>.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000116&pid=S0012-7353201500030003000007&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;8&#93;</b> Solomonik,   E. and Demmel, J., Communication-optimal parallel 2.5D matrix multiplication   and LU factorization algorithms, Proceeding Euro-Par'11 Proceedings of the 17<sup>th</sup> international conference on Parallel processing, Heidelberg, Volume Part II,   Springer-Verlag Berlin, Heidelberg, 2011.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000118&pid=S0012-7353201500030003000008&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;9&#93;</b> Zavala-D&iacute;az, J.C., Optimizaci&oacute;n con   c&oacute;mputo paralelo, teor&iacute;a y aplicaciones, M&eacute;xico, Ed. AmEditores, 2013.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000120&pid=S0012-7353201500030003000009&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;10&#93;</b> S&aacute;nchez,   J. and Barral, H., Multiprocessor implementation models for adaptative   algorithms, Journal IEEE Transactions on Signal Processing, USA, 44 (9), pp.   2319-2331, 1996.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000122&pid=S0012-7353201500030003000010&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;11&#93;</b> Un   Nuevo Proyecto Financiado con Fondos Europeos trata de Lograr los Primeros   Chips de Silicio para RAM &Oacute;ptica de 100 Gbps.Dyna 87 (1), pp. 24, 2012.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000124&pid=S0012-7353201500030003000011&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> </font></p>     <!-- ref --><p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>&#91;12&#93;</b> La   Polit&eacute;cnica de Valencia Patenta Mejoras en Procesadores. Dyna, 83 (8), pp. 155   2008.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=000126&pid=S0012-7353201500030003000012&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>     <p>&nbsp;</p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>J.C. Zavala-D&iacute;az</b>,   received the PhD. degree in Computational Science at Monterrey Institute of   Technology and Higher Education (ITESM), Mexico, in 1999, since 1999 he is a   research professor at the Autonomous University of the State of Morelos, Mexico.   He worked in Research Electrical Institute (IIE), M&eacute;xico 1986-1994. His   research interests include: parallel computing; modeling and problem solving of   discrete and linear optimization; and optimization using metaheuristics.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>J. P&eacute;rez-Ortega,</b> received the PhD. degree in Computational Science at Monterrey Institute of   Technology and Higher Education (ITESM), Mexico in 1999, since 2001 he is a   research of National Research and Technological Development (CENIDET). He   worked in Research Electrical Institute (IIE), M&eacute;xico 1985-2001. His research   interests include: optimization using metaheuristics; NP-problems; combinatorial   optimization; distributed systems; and software engineering.</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>E. Salazar-Res&eacute;ndiz,</b> received the MSc. degree in Computational Science at National Research and   Technological Development (CENIDET) in 2014. he works in Research Electrical Institute (IIE).</font></p>     <p><font size="2" face="Verdana, Arial, Helvetica, sans-serif"><b>C. Guadarrama-Rogel</b>,   received the MSc. degree in Computational Science at National Research and   Technological Development (CENIDET) in 2014. he works in Research Electrical Institute (IIE).</font></p>     ]]></body>
<body><![CDATA[ ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rauber]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Rünger]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
</person-group>
<source><![CDATA[Parallel programming for multicore and cluster systems.]]></source>
<year>2010</year>
<publisher-name><![CDATA[Springer Heidelberg Dordrecht]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Muhammad]]></surname>
<given-names><![CDATA[A. I.]]></given-names>
</name>
<name>
<surname><![CDATA[Talat]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Mirza]]></surname>
<given-names><![CDATA[S.H.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Parallel matrix multiplication on multi-core processors using SPC³ PM]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ International Conference on Latest Computational Technologies (ICLCT'2012)]]></conf-name>
<conf-date>March 17-18, 2012</conf-date>
<conf-loc>Bangkok </conf-loc>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[L'Excellent]]></surname>
<given-names><![CDATA[J.-Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Sid-Lakhdar]]></surname>
<given-names><![CDATA[W.M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A study of shared-memory parallelism in a multifrontal solver]]></article-title>
<source><![CDATA[Parallel Computing]]></source>
<year>2014</year>
<volume>40</volume>
<numero>3-4</numero>
<issue>3-4</issue>
<page-range>34-46</page-range></nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Quinn]]></surname>
<given-names><![CDATA[M.J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Parallel computing: Theory and practice]]></source>
<year>1994</year>
<publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[McGraw-Hill, Inc.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gupta]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Kumar]]></surname>
<given-names><![CDATA[V.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Scalability of parallel algorithms for matrix multiplication]]></article-title>
<source><![CDATA[]]></source>
<year>1991</year>
<volume>3</volume>
<conf-name><![CDATA[ International Conference on Parallel Processing]]></conf-name>
<conf-loc> </conf-loc>
<page-range>115-123</page-range></nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Alqadi]]></surname>
<given-names><![CDATA[Z.A.A.]]></given-names>
</name>
<name>
<surname><![CDATA[Aqel]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[El Emary]]></surname>
<given-names><![CDATA[I.M.M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Performance analysis and evaluation of parallel matrix multiplication algorithms]]></article-title>
<source><![CDATA[World Applied Sciences Journal]]></source>
<year>2008</year>
<volume>5</volume>
<numero>2</numero>
<issue>2</issue>
<page-range>211-214</page-range></nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Choi]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<collab>Rice University^dCenter for Research on Parallel Computation</collab>
<source><![CDATA[A new parallel matrix multiplication algorithm on distributed-memory concurrent computers: Technical Report CRPC-TR97758]]></source>
<year>1997</year>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Solomonik]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Demmel]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Euro-Par'11 Proceedings of the 17th international conference on Parallel processing]]></conf-name>
<conf-date>2011</conf-date>
<conf-loc>Heidelberg </conf-loc>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Zavala-Díaz]]></surname>
<given-names><![CDATA[J.C.]]></given-names>
</name>
</person-group>
<source><![CDATA[Optimización con cómputo paralelo, teoría y aplicaciones]]></source>
<year>2013</year>
<publisher-loc><![CDATA[México ]]></publisher-loc>
<publisher-name><![CDATA[AmEditores]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sánchez]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Barral]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Multiprocessor implementation models for adaptative algorithms]]></article-title>
<source><![CDATA[Journal IEEE Transactions on Signal Processing]]></source>
<year>1996</year>
<volume>44</volume>
<numero>9</numero>
<issue>9</issue>
<page-range>2319-2331</page-range></nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="journal">
<article-title xml:lang="es"><![CDATA[Un Nuevo Proyecto Financiado con Fondos Europeos trata de Lograr los Primeros Chips de Silicio para RAM Óptica de 100 Gbps]]></article-title>
<source><![CDATA[Dyna]]></source>
<year>2012</year>
<volume>87</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>24</page-range></nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="journal">
<article-title xml:lang="es"><![CDATA[La Politécnica de Valencia Patenta Mejoras en Procesadores.]]></article-title>
<source><![CDATA[Dyna]]></source>
<year>2008</year>
<volume>83</volume>
<numero>8</numero>
<issue>8</issue>
<page-range>155</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
