pyspark - Spark Delete Rows -


i have dataframe containing 20k rows.

i want delete 186 rows randomly in dataset.

to understand context - testing classification model on missing data, , each row has unix timestamp. 186 rows corresponds 3 seconds (there 62 rows of data per second.)

my aim is, when data streaming, data missing number of seconds. extracting features time window, want see how missing data effects model performance.

i think best approach convert rdd , use filter function, this, , put logic inside filter function.

dataframe.rdd.zipwithindex().filter(lambda x: )

but stuck logic - how implement this? (using pyspark)

try this:

import random startval = random.randint(0,dataframe.count() - 62) dataframe.rdd.zipwithindex()\              .filter(lambda x: not x[<<index>>] in range(startval, startval+62)) 

this should work!


Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -