Apache Spark 2.x Machine Learning Cookbook
上QQ阅读APP看书,第一时间看更新

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
 import org.apache.spark.sql.{SparkSession}
import org.apache.spark.mllib.linalg._
import breeze.linalg.{DenseVector => BreezeVector}
import Array._
import org.apache.spark.mllib.linalg.SparseVector
  1. Set up the Spark session and application parameters so Spark can run:
val spark = SparkSession
.builder
.master("local[*]")
.appName("myVectorMatrix")
.config("spark.sql.warehouse.dir", ".")
.getOrCreate()
  1. Here we look at creating an ML vector feature from Scala arrays. Let us define a 2x2 dense matrix and instantiate it with an array:
val MyArray1= Array(10.0, 11.0, 20.0, 30.3)
val denseMat3 = Matrices.dense(2,2,MyArray1)

The output is as follows:

DenseMat3=
10.0 20.0
11.0 30.3

Constructing a dense matrix and assigning values via initialization in a single step:

Construct a dense local matrix directly by defining the array inline. This is an array of 3x3 and has nine members. You can think of it as three columns of three vectors (3x3):

val denseMat1 = Matrices.dense(3,3,Array(23.0, 11.0, 17.0, 34.3, 33.0, 24.5, 21.3,22.6,22.2))

The output is as follows:

    denseMat1=
    23.0  34.3  21.3  
    11.0  33.0  22.6  
    17.0  24.5  22.2
  

This is another example to show inline instantiation of a dense local matrix with vectors. This is a common case in which you collect vectors into a matrix (column order) and then perform an operation on the entire set. The most common case is to collect the vectors and then use a distributed matrix to do distributed parallel operation.

In Scala, we use the ++ operator with arrays to achieve concatenation:

val v1 = Vectors.dense(5,6,2,5)
val v2 = Vectors.dense(8,7,6,7)
val v3 = Vectors.dense(3,6,9,1)
val v4 = Vectors.dense(7,4,9,2)

val Mat11 = Matrices.dense(4,4,v1.toArray ++ v2.toArray ++ v3.toArray ++ v4.toArray)
println("Mat11=\n", Mat11)

The output is as follows:

    Mat11=
    5.0  8.0  3.0  7.0  
   6.0  7.0  6.0  4.0  
   2.0  6.0  9.0  9.0  
   5.0  7.0  1.0  2.0