Apache Spark - Scala - ReduceByKey - with keys repeating up to twice only -


given following rdd:

val vectors = rdd [string, int] = ((k1,v1),(k1,v2),(k2,v3),...) 

where keys appear either twice (k1) or once (k2), never more that. want get:

val uniqvectors = rdd[string, int] = ((k1, v1*v2), (k2, v3), ...) 

one approach use reducebykey:

val uniqvectors = vectors.reducebykey((a,b) => a*b) 

however, it's slow arrays 7b elements. there faster approach on specific case?

what (probably) takes time here shuffling data: when want group 2 or more records together, must reside within same partition, spark has first shuffle records records same key in single partition.

now, if each key has 2 records @ most, shuffle have take place, unless can somehow guarantee each key contained in single partition - example, if loaded rdd hdfs , somehow know each key resides on single file part begin with. in (unlikely) case, can use mappartitions perform grouping on each partition separately, saving shuffle:

vectors.mappartitions(    iter => iter.tolist.groupby(_._1).map { case (k, list) => (k, list.map(_._2).reduce(_ * _)) }.iterator,    preservespartitioning = true) 

none of special case maximum repetition of each key 2, way.


Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -