Apache Spark 2.x Machine Learning Cookbook

上QQ阅读APP看书，第一时间看更新

There's more...

To reiterate, in a lot of machine learning applications, you end up dealing with sparsity due to the large dimensional nature of the feature space that is not linearly distributed. To illustrate, we take the simplest case in which we have 10 customers indicating their affinity for four themes in the product line:

As you can see, most of the elements are 0 and storing them as a dense matrix is not desirable while we increase the number of customers and themes to tens of millions (M x N). The SparseVector and matrix help with the storage and operation of these sparse structures in an efficient way.