void **env_base; void (*env_blasth_signal_value)(); void blasth_daxpy(const int *n, double *alpha, double *X, const int *incx, double *Y, const int *incy){ // realize Y = *alpha * X + Y // where X and Y are vectors of // size *n with respective increments // of *incx and *incy // executed by the // master from the // application program env_base = (void **)&n; env_blasth_signal_value = TH_DAXPY; // tell the slave there is // some job to do blasth_master_sync(); // some job // wait for the slave blasth_master_sync_end(); } void blasth(){ // excuted by the slave from the // environement setup while(1){ // wait for the master blasth_sync(); //call the function set by the master env_blasth_signal_value(); } } TH_DAXPY(){ // at this point env_base // contains a pointer to the // first needed parameter // (int *)env_base[0] is a // pointer to the size of vectors (*n) // (double *)env_base[1] is a // pointer to the scaling factor (*alpha) // (double *)env_base[2] is a // pointer to the first element // of vector X // .... // some job // tell the master // that job is finished blasth_sync_end(); }The blasth_daxpy calling sequence is identical to the daxpy calling sequence from a C program (the BLAS library is originally written in f77 so the API is f77 compliant) and the parameters are written before the synchronization variable so the strong memory ordering (for write operations) of the Pentium processor family ensure that slave process will see exactly the same parameters in TH_DAXPY as the master in blasth_daxpy.
Data sharing is done by splitting the result between the master and the slave: if the result is a vector the master has to construct the first half and slave has to construct the second half; if the result is a matrix of size m x n the master will construct either the first n/2 columns or m/2 rows and the slave will construct the remaining columns or rows. We show splitting examples in figure for dgemv and dgemm (respectively matrix vector product and matrix matrix product).
We does not use cycling split of data to avoid cache line sharing between processors (especially when writing data). The splitting are also chosen to avoid temporary data which would require dynamic allocation.