python - How to discretize large dataframe by columns with variable bins in Pandas/Dask -
i able discretize pandas dataframe columns code:
import numpy np import pandas pd def discretize(x, n_scale=1): c in x.columns: loc = x[c].median() # median absolute deviation of column scale = mad(x[c]) bins = [-np.inf, loc - (scale * n_scale), loc + (scale * n_scale), np.inf] x[c] = pd.cut(x[c], bins, labels=[-1, 0, 1]) return x
i want discretize each column using parameters: loc (the median of column) , scale (the median absolute deviation of column).
with small dataframes time required acceptable (since single thread solution).
however, larger dataframes want exploit more threads (or processes) speed computation.
i no expert of dask, should provide solution problem.
however, in case discretization should feasible code:
import dask.dataframe dd import numpy np import pandas pd def discretize(x, n_scale=1): # i'm using 2 partitions example x_dask = dd.from_pandas(x, npartitions=2) # fixme: # how can define bins compute loc , scale # each column? bins = [-np.inf, loc - (scale * n_scale), loc + (scale * n_scale), np.inf] x = x_dask.apply(pd.cut, axis=1, args=(bins,), labels=[-1, 0, 1]).compute() return x
but problem here loc
, scale
dependent on column values, should computed each column, either before or during apply.
how can done?
i've never used dask
, guess can define new function used in apply
.
import dask.dataframe dd import multiprocessing mp import numpy np import pandas pd def discretize(x, n_scale=1): x_dask = dd.from_pandas(x.t, npartitions=mp.cpu_count()+1) x = x_dask.apply(_discretize_series, axis=1, args=(n_scale,), columns=x.columns).compute().t return x def _discretize_series(x, n_scale=1): loc = x.median() scale = mad(x) bins = [-np.inf, loc - (scale * n_scale), loc + (scale * n_scale), np.inf] x = pd.cut(x, bins, labels=[-1, 0, 1]) return x
Comments
Post a Comment