Apache Spark 2.x Machine Learning Cookbook
上QQ阅读APP看书,第一时间看更新

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
 import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.mllib.linalg._
import breeze.linalg.{DenseVector => BreezeVector}
import Array._
import org.apache.spark.mllib.linalg.DenseMatrix
import org.apache.spark.mllib.linalg.SparseVector
  1. Set up the Spark session and application parameters so Spark can run:
val spark = SparkSession
.builder
.master("local[*]")
.appName("myVectorMatrix")
.config("spark.sql.warehouse.dir", ".")
.getOrCreate()
  1. We create the matrices:
val sparseMat33= Matrices.sparse(3,3 ,Array(0, 2, 3, 6) ,Array(0, 2, 1, 0, 1, 2),Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
val denseFeatureVector= Vectors.dense(1,2,1)
val denseVec13 = Vectors.dense(5,3,0)
  1. Multiply the matrix and vector and print the results. This is an extremely useful operation which becomes a common theme in most Spark ML cases. We use a SparseMatrix to demonstrate the fact that the Dense, Sparse, and Matrix are interchangeable and only the density (for example, the percent of non-zero elements) and performance should be the criteria for selection:
val result0 = sparseMat33.multiply(denseFeatureVector)
println("SparseMat33 =", sparseMat33)
println("denseFeatureVector =", denseFeatureVector)
println("SparseMat33 * DenseFeatureVector =", result0)

The output is as follows:

(SparseMat33 =,3 x 3 CSCMatrix
(0,0) 1.0
(2,0) 2.0
(1,1) 3.0
(0,2) 4.0
(1,2) 5.0
(2,2) 6.0)
denseFeatureVector =,[1.0,2.0,1.0]
SparseMat33 * DenseFeatureVector = [5.0,11.0,8.0]
  1. Multiplying a DenseMatrix with DenseVector.

This is provided for completeness and will help the user to follow the matrix and vector multiplication more easily without worrying about sparsity:

println("denseVec2 =", denseVec13)
println("denseMat1 =", denseMat1)
val result3= denseMat1.multiply(denseVec13)
println("denseMat1 * denseVect13 =", result3)

The output is as follows:

    denseVec2 =,[5.0,3.0,0.0]
    denseMat1 =  23.0  34.3  21.3  
                          11.0  33.0  22.6  
                          17.0  24.5  22.2 
    denseMat1 * denseVect13 =,[217.89,154.0,158.5]
  
  1. We demonstrate the transposing of a Matrix, which is an operation to swap rows with columns. It is an important operation and used almost on a daily basis if you are involved in Spark ML or data engineering.

Here we demonstrate two steps:

    1. Transposing a SparseMatrix and examining the new resulting matrix via the output:
val transposedMat1= sparseMat1.transpose
println("transposedMat1=\n",transposedMat1)

The output is as follows:

    
Original sparseMat1 =,3 x 2 CSCMatrix
(0,0) 11.0
(1,1) 22.0
(2,1) 33.0)
(transposedMat1=,2 x 3 CSCMatrix (0,0) 11.0 (1,1) 22.0 (1,2) 33.0)
1.0 4.0 7.0 2.0 5.0 8.0 3.0 6.0 9.0
    1. Demonstrating that the transpose of a transpose yields the original matrix:
val transposedMat1= sparseMat1.transpose
println("transposedMat1=\n",transposedMat1) println("Transposed twice", denseMat33.transpose.transpose) // we get the original back

The output is as follows:

Matrix transposed twice=
1.0 4.0 7.0 2.0 5.0 8.0 3.0 6.0 9.0

Transposing a dense matrix and examining the new resulting matrix via the output:

This makes it easier to see how row and column indexes are swapped:

val transposedMat2= denseMat1.transpose
println("Original sparseMat1 =", denseMat1)
println("transposedMat2=" ,transposedMat2)
Original sparseMat1 =
23.0 34.3 21.3
11.0 33.0 22.6
17.0 24.5 22.2
transposedMat2=
23.0 11.0 17.0
34.3 33.0 24.5
21.3 22.6 22.2
    1. We now look at matrix multiplication and how it would look in code.

We declare two 2x2 Dense Matrices:

// Matrix multiplication
val dMat1: DenseMatrix= new DenseMatrix(2, 2, Array(1.0, 3.0, 2.0, 4.0))
val dMat2: DenseMatrix = new DenseMatrix(2, 2, Array(2.0,1.0,0.0,2.0))

println("dMat1 * dMat2 =", dMat1.multiply(dMat2)) //A x B
println("dMat2 * dMat1 =", dMat2.multiply(dMat1)) //B x A not the same as A xB

The output is as follows:

dMat1 =,1.0  2.0  
               3.0  4.0  
dMat2 =,2.0  0.0  
       1.0 2.0 
dMat1 * dMat2 =,4.0   4.0  
                              10.0  8.0
//Note: A x B is not the same as B x A
dMat2 * dMat1 = 2.0   4.0   
                               7.0  10.0