python - How to discretize large dataframe by columns with variable bins in Pandas/Dask -


i able discretize pandas dataframe columns code:

import numpy np import pandas pd  def discretize(x, n_scale=1):      c in x.columns:         loc = x[c].median()          # median absolute deviation of column         scale = mad(x[c])          bins = [-np.inf, loc - (scale * n_scale),                 loc + (scale * n_scale), np.inf]         x[c] = pd.cut(x[c], bins, labels=[-1, 0, 1])      return x 

i want discretize each column using parameters: loc (the median of column) , scale (the median absolute deviation of column).

with small dataframes time required acceptable (since single thread solution).

however, larger dataframes want exploit more threads (or processes) speed computation.

i no expert of dask, should provide solution problem.

however, in case discretization should feasible code:

import dask.dataframe dd import numpy np import pandas pd  def discretize(x, n_scale=1):      # i'm using 2 partitions example     x_dask = dd.from_pandas(x, npartitions=2)      # fixme:      # how can define bins compute loc , scale     # each column?     bins = [-np.inf, loc - (scale * n_scale),             loc + (scale * n_scale), np.inf]      x = x_dask.apply(pd.cut, axis=1, args=(bins,), labels=[-1, 0, 1]).compute()      return x 

but problem here loc , scale dependent on column values, should computed each column, either before or during apply.

how can done?

i've never used dask, guess can define new function used in apply.

import dask.dataframe dd import multiprocessing mp import numpy np import pandas pd  def discretize(x, n_scale=1):      x_dask = dd.from_pandas(x.t, npartitions=mp.cpu_count()+1)     x = x_dask.apply(_discretize_series,                      axis=1, args=(n_scale,),                      columns=x.columns).compute().t      return x  def _discretize_series(x, n_scale=1):      loc = x.median()     scale = mad(x)     bins = [-np.inf, loc - (scale * n_scale),             loc + (scale * n_scale), np.inf]     x = pd.cut(x, bins, labels=[-1, 0, 1])      return x 

Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -