机器学习向量数据之DenseVector &SparseVector

前言

Spark机器学习所需要的数据结构比如逻辑回归,使用的数据结构都是向量数据,即类似[0,1,0,0,1,1,0,1]这么一个空间向量(在编程语言中也可称作为数组)。

DenseVector密集

通过数组的方式创建DenseVector

1
2
3
4
5
6
7
8
9
10
# 导入DensVector
scala> import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.linalg.DenseVector
# 导入Vectors工厂
scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
import org.apache.spark.mllib.linalg.{Vectors, Vector}

# 通过数组方式创建一个Densvector数据结构(这个就跟我们平时用的数组一样,但是在数据模型中,我们把它作为一个向量空间来计算)
scala> val denseV = new DenseVector(Array(0,0,1,1,1,0,0))
denseV: org.apache.spark.ml.linalg.DenseVector = [0.0,0.0,1.0,1.0,1.0,0.0,0.0]

通过工厂方式创建DenseVector

1
2
3
# 通过MLlibVectors提供的工厂方法创建DenseVector
scala> Vectors.dense(0,0,1,1,1,0,0)
res2: org.apache.spark.mllib.linalg.Vector = [0.0,0.0,1.0,1.0,1.0,0.0,0.0]

SparseVector稀疏

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 导入SparseVector
scala> import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.ml.linalg.SparseVector
# 导入Vectors工厂
scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
import org.apache.spark.mllib.linalg.{Vectors, Vector}

# 以下的sparseVector与上面的denseVector数值一样,由于DenseVectorSparseVector所存储的数据必须是Number类型,所以像SparseVector所指定的值以外,其他默认都是为0
# 以下相当于创建一个长度为7的数组,234位的值分别为1,其他位默认为0
scala> val sparseV = new SparseVector(7, Array(2, 3, 4), Array(1, 1, 1))
sparseV: org.apache.spark.ml.linalg.SparseVector = (7,[2,3,4],[1.0,1.0,1.0])
scala> sparseV(0)
res11: Double = 0.0
scala> sparseV(2)
res12: Double = 1.0

通过工厂方式创建SparseVector

1
2
3
4
scala> val sparseV = Vectors.sparse(7, Array(2, 3, 4), Array(1, 1, 1))
sparseV: org.apache.spark.mllib.linalg.Vector = (7,[2,3,4],[1.0,1.0,1.0])
scala> sparseV(0)
res6: Double = 0.0

DenseVector互转SparseVector

1
2
3
4
5
6
7
8
9
# DenseVectorSparseVector
scala> val denseV = new DenseVector(Array(0,0,1,1,1,0,0))
denseV: org.apache.spark.ml.linalg.DenseVector = [0.0,0.0,1.0,1.0,1.0,0.0,0.0]
scala> val sparseV = denseV.toSparse
res0: org.apache.spark.ml.linalg.SparseVector = (7,[2,3,4],[1.0,1.0,1.0])

# SparseVectorDenseVector
scala> val denseV = sparseV.toDense
denseV: org.apache.spark.ml.linalg.DenseVector = [0.0,0.0,1.0,1.0,1.0,0.0,0.0]

总结

SparseVector稀疏向量通过这种表现方式可以更小的节省空间,比如在很多的深度学习计算当中大部分都是稀疏数据,从而可以减少在计算过程中所占用的空间,这就是为什么Spark Mllib当中的很多计算方式都是使用SparseVector数据类型。

分享到 评论