Apache Spark - Scala - ReduceByKey - with keys repeating up to twice only -
given following rdd:
val vectors = rdd [string, int] = ((k1,v1),(k1,v2),(k2,v3),...)
where keys appear either twice (k1) or once (k2), never more that. want get:
val uniqvectors = rdd[string, int] = ((k1, v1*v2), (k2, v3), ...)
one approach use reducebykey:
val uniqvectors = vectors.reducebykey((a,b) => a*b)
however, it's slow arrays 7b elements. there faster approach on specific case?
what (probably) takes time here shuffling data: when want group 2 or more records together, must reside within same partition, spark has first shuffle records records same key in single partition.
now, if each key has 2 records @ most, shuffle have take place, unless can somehow guarantee each key contained in single partition - example, if loaded rdd hdfs , somehow know each key resides on single file part begin with. in (unlikely) case, can use mappartitions
perform grouping on each partition separately, saving shuffle:
vectors.mappartitions( iter => iter.tolist.groupby(_._1).map { case (k, list) => (k, list.map(_._2).reduce(_ * _)) }.iterator, preservespartitioning = true)
none of special case maximum repetition of each key 2, way.
Comments
Post a Comment