python - How to discretize large dataframe by columns with variable bins in Pandas/Dask -
i able discretize pandas dataframe columns code:
import numpy np import pandas pd def discretize(x, n_scale=1): c in x.columns: loc = x[c].median() # median absolute deviation of column scale = mad(x[c]) bins = [-np.inf, loc - (scale * n_scale), loc + (scale * n_scale), np.inf] x[c] = pd.cut(x[c], bins, labels=[-1, 0, 1]) return x i want discretize each column using parameters: loc (the median of column) , scale (the median absolute deviation of column).
with small dataframes time required acceptable (since single thread solution).
however, larger dataframes want exploit more threads (or processes) speed computation.
i no expert of dask, should provide solution problem.
however, in case discretization should feasible code:
import dask.dataframe dd import numpy np import pandas pd def discretize(x, n_scale=1): # i'm using 2 partitions example x_dask = dd.from_pandas(x, npartitions=2) # fixme: # how can define bins compute loc , scale # each column? bins = [-np.inf, loc - (scale * n_scale), loc + (scale * n_scale), np.inf] x = x_dask.apply(pd.cut, axis=1, args=(bins,), labels=[-1, 0, 1]).compute() return x but problem here loc , scale dependent on column values, should computed each column, either before or during apply.
how can done?
i've never used dask, guess can define new function used in apply.
import dask.dataframe dd import multiprocessing mp import numpy np import pandas pd def discretize(x, n_scale=1): x_dask = dd.from_pandas(x.t, npartitions=mp.cpu_count()+1) x = x_dask.apply(_discretize_series, axis=1, args=(n_scale,), columns=x.columns).compute().t return x def _discretize_series(x, n_scale=1): loc = x.median() scale = mad(x) bins = [-np.inf, loc - (scale * n_scale), loc + (scale * n_scale), np.inf] x = pd.cut(x, bins, labels=[-1, 0, 1]) return x
Comments
Post a Comment