pyspark - Spark Delete Rows -
i have dataframe containing 20k rows.
i want delete 186 rows randomly in dataset.
to understand context - testing classification model on missing data, , each row has unix timestamp. 186 rows corresponds 3 seconds (there 62 rows of data per second.)
my aim is, when data streaming, data missing number of seconds. extracting features time window, want see how missing data effects model performance.
i think best approach convert rdd
, use filter
function, this, , put logic inside filter function.
dataframe.rdd.zipwithindex().filter(lambda x: )
but stuck logic - how implement this? (using pyspark)
try this:
import random startval = random.randint(0,dataframe.count() - 62) dataframe.rdd.zipwithindex()\ .filter(lambda x: not x[<<index>>] in range(startval, startval+62))
this should work!
Comments
Post a Comment