Apache Spark 2.x Machine Learning Cookbook
上QQ阅读APP看书,第一时间看更新

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.mllib.linalg._
import breeze.linalg.{DenseVector => BreezeVector}
import Array._
import org.apache.spark.mllib.linalg.SparseVector
  1. Set up the Spark context and application parameters so Spark can run. See the first recipe in this chapter for more details and variations:
val spark = SparkSession
.builder
.master("local[*]")
.appName("myVectorMatrix")
.config("spark.sql.warehouse.dir", ".")
.getOrCreate()
  1. Here we look at creating a ML SparseVector that corresponds to its equivalent DenseVector. The call consists of three parameters: Size of the vector, indexes to non-zero data, and finally, the data itself.

In the following example, we can compare the dense versus SparseVector creation. As you can see, the four elements that are non-zero (5, 3, 8, 9) correspond to locations (0, 2, 18, 19) while the number 20 indicates the total size:

val denseVec1 = Vectors.dense(5,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,9)
val sparseVec1 = Vectors.sparse(20, Array(0,2,18,19), Array(5, 3, 8,9))
  1. To understand the data structure better, we compare the output and some of the important attributes that help us, especially with dynamic programming using vectors.

First we take a look at the printout for the DenseVector to see its representation:


println(denseVec1.size)
println(denseVec1.numActives)
println(denseVec1.numNonzeros)
println("denceVec1 presentation = ",denseVec1)

The output is as follows:

denseVec1.size = 20

denseVec1.numActives = 20

denseVec1.numNonzeros = 4

(denseVec1 presentation = ,[5.0,0.0,3.0,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,9.0])

  1. Next, we take a look at the printout for the SparseVector to see its internal representation:
println(sparseVec1.size)
println(sparseVec1.numActives)
println(sparseVec1.numNonzeros)
println("sparseVec1 presentation = ",sparseVec1)

If we compare and contrast the internal representation and the number of elements versus active and non-zero, you will see that the SparseVector only stores non-zero elements and indexes to reduce storage requirement.

The output is as follows:

denseVec1.size = 20
println(sparseVec1.numActives)= 4
sparseVec1.numNonzeros = 4
(sparseVec1 presentation = ,(20,[0,2,18,19],[5.0,3.0,8.0,9.0]))
  1. We can convert back and forth between sparse and DenseVectors as needed. The reason that you might want to do this is that external math and linear algebra do not conform to Spark's internal representation. We made the variable type explicit to make the point, but you can eliminate that extra declaration in actual practice:
val ConvertedDenseVect : DenseVector= sparseVec1.toDense
val ConvertedSparseVect : SparseVector= denseVec1.toSparse
println("ConvertedDenseVect =", ConvertedDenseVect)
println("ConvertedSparseVect =", ConvertedSparseVect)

The output is as follows:

(ConvertedDenseVect =,[5.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,9.0])

(ConvertedSparseVect =,(20,[0,2,18,19],[5.0,3.0,8.0,9.0]))