After years of CSV file format domination in the python world, I'm finally seeing more compelling options in the form of local column-oriented data storage. Some give advantages in read speed, file size or both.

While I've had good results in these formats for floats and ints, a bottleneck I still have is storing columns with strings. The file format landscape is improving fast, so I won't be surprised if the speed results of the benchmarks here are obsolete before they're published, but this can at least serve as a performance snapshot.

Feather is a storage format that aims for speed and interoperability between languages based on the Apache Arrow spec. It shares a creator Wes McKinney with Pandas, and integrates nicely into the python data ecosystem.

Fastparquet is a recent fork of python's implementation of the parquet file format, with an unsurprising focus on speed. Both parquet and Bcolz offer compression as a feature, potentially speeding up IO and shrinking the required size on disk.

String data¶

A cursory look at these formats shows an obvious performance increase for numerical data, but I quickly found that performance is much more hit-or-miss for data involving strings. My hope here was to get an empirical look at my options when dealing with string data.

A difficulty with using strings in Python is that pandas seems to treat them as arbitrary python objects, eliminating any performance gains you can get from numerical datatypes using numpy underneath. A couple of options that you have are to convert the series to a categorical data type, or convert all of the string elements to bytes.

In this post I compare the performance of these different file formats on string data encoded as strings, categoricals, and bytes. While I sometimes treat categorical encoding as a silver bullet for performance, since it is represented internally as integers, the benefits only outweigh the overhead when there are just a few unique strings in a column that are repeated many times. To compare the performance of the different formats in both scenarios, I test them on a series with mostly unique strings, and another with just a few repeated strings.

The benchmarks below compare pandas' CSV, feather, fastparquet and bcolz. While the ultimate file size is important, it is secondary to speed for me, so I also ignored compression options like gzip and brotli.

I'm only using series with about 200k rows because I don't have all day to wait on the benchmarks to finish. Also note that with the combinatorial explosion of possible options, these benchmarks only represent a comparison of the narrow range of options that I would consider.

The code for this can be found on github.

In [1]:

# import datetime as dtimportimports;imports.reload(imports);fromimportsimport*importutils;reload(utils);fromutilsimport(Timer,get_dtypes,mod_cols,add_nulls,get_obj_type,plot_scatter,part,getsize,INFEASIBLE,nan,check_args,BEnc,StrEnc,combine_rankings)importsysdeffprint(*a,**k):print(*a,**k)sys.stdout.flush()%matplotlib inline
# ;;

In [2]:

defnull_type(s:pd.Series)->(bool,type):n=(~(s==s)).sum()>0t1=s.dtypeift1==object:tp=type(s.iloc[0])returnn,tpreturnn,t1

Load and set up df¶

String columns¶

Here's the base dataframe with a column for unique values (Unq) and one for repetitive values (Low_card).

In [3]:

nr.seed(0)N=200000chars=np.array(list(string.ascii_lowercase)+lmap(str,range(10)))nchars=len(chars)rix=nr.randint(0,high=nchars,size=(N,32))rstrs=[''.join(chars[ix])forixinrix]cols=['Unq','Low_card']df=DataFrame(dict(Unq=rstrs,Low_card=nr.choice(rstrs[:100],size=N)))[cols]

In [4]:

df.apply(lambdax:x.nunique())

Out[4]:

Unq         200000
Low_card       100
dtype: int64

Bytes | categorical | strings & Nulls | Nonulls¶

This function takes a named Series with string data, and returns a DataFrame whose columns are combinations of the different string data encodings, and some with nulls randomly inserted, since this can have a big effect on parquet's performance.

In [5]:

defnew_cols(s):df=s.to_frame()c=df.columns[0]df=df.assign(Str_nls=lambdax:add_nulls(x[c]),Bytes=lambdax:x[c].str.encode('utf-8'),Cat=lambdax:tocat(x[c])).assign(Bytes_nls=lambdax:add_nulls(x['Bytes']),Cat_nls=lambdax:add_nulls(x['Cat']),)returndf.rename(columns={c:'Str'})tocat=lambdax:x.astype('category')

Here's a summary:

In [6]:

df_types=new_cols(df.Unq)DataFrame({'Nulls':df_types.isnull().sum(),'Dtypes':df_types.dtypes})

Out[6]:

             Dtypes  Nulls
Str          object      0
Bytes        object      0
Cat        category      0
Str_nls      object    100
Bytes_nls    object    100
Cat_nls    category    100

Parquet writer/reader¶

Here are some helper functions to read and write with fastparquet using different options. Specifying the maximum length of a string in the column with fixed_text gives a big speed boost, but comes with the cost of finding this length before saving.

In [7]:

defmax_len(s,nulls=True):ifnulls:s=s[s==s]returns.map(len).max()defpq_writer(get_lens=False,**kw):defwrite_pq(df,fn):ifget_lens:lns={c:max_len(df[c],nulls=True)forcinget_dtypes(df,object)}else:lns=Nonefastparquet.write(fn,df,fixed_text=lns,**kw)returnwrite_pqdefpq_reader(categories=None):defread_pq(fn):f=fastparquet.ParquetFile(fn)returnf.to_pandas(categories=categories)returnread_pq

Bcolz¶

I was originally going to write off bcolz because of some stunningly bad default performance on the data I initially tried. Simply creating a carray from a column takes about 20x as long as feather's writing to disk, for example:

In [8]:

%time _ = bcolz.carray(df.Unq)

CPU times: user 715 ms, sys: 21.4 ms, total: 736 ms
Wall time: 737 ms

In [9]:

%time feather.write_dataframe(df, '/tmp/derp.ftr')

CPU times: user 32.3 ms, sys: 28.8 ms, total: 61.1 ms
Wall time: 136 ms

But after playing around with every idea I could think of, I eventually found out that saving the underlying array after converting its type yields more reasonable results:

In [10]:

%time _ = bcolz.carray(df.Unq.values.astype('U'), mode='r')

CPU times: user 59 ms, sys: 32.8 ms, total: 91.8 ms
Wall time: 78.1 ms

As does saving a list rather than a Series:

In [11]:

%time _ = bcolz.carray(df.Unq.tolist(), mode='r')

CPU times: user 52.9 ms, sys: 18.8 ms, total: 71.8 ms
Wall time: 54.8 ms

But because some of the settings are orders of magnitude slower, I had to write a somewhat convoluted feasible_bcolz function designate which combinations of settings to avoid.

In [12]:

tolist=lambdax:x.tolist()to_unicode=lambdax:x.values.astype('U')mk_bcolz_writer=lambda**kw:part(write_bcolz,**kw)@check_argsdefwrite_bcolz(df,fn=None,method:BEnc=None,str_enc:StrEnc=None,null:bool=None):ifnotfeasible_bcolz(method=method,str_enc=str_enc,null=null):returnINFEASIBLEifmethod==BEnc.df:ct=bcolz.ctable.fromdataframe(df,rootdir=fn)else:converter={BEnc.list:tolist,BEnc.utf8:to_unicode}[method]cs=[converter(col)if(col.dtype=='O')elsecolfor_,colindf.iteritems()]ct=bcolz.ctable(columns=cs,names=list(df),rootdir=fn)returnctdefread_bcolz(fn):ct=bcolz.open(fn,mode='r')returnDataFrame(ct[:])@check_argsdeffeasible_bcolz(method:BEnc=None,str_enc:StrEnc=None,null:bool=None,**_):"Some bcolz settings are unbearably slow."ifmethod==BEnc.df:return(str_enc,null)==(StrEnc.str,False)elifmethod==BEnc.list:return(str_encin{StrEnc.str,StrEnc.byte})and(notnull)elifmethod==BEnc.utf8:returnstr_enc!=StrEnc.catraiseTypeError("Shouldn't reach here")# feasible_bcolz(method=BEnc.df, str_enc=StrEnc.str, null=False)

Benchmarkers¶

And here are some functions to automate the benchmarking, iterating through different inputs and read/write options. I wrapped them all in a sloppy try/except block in a fit of undisciplined laziness.

In [13]:

defbench(fn,df,writer,reader,desc=''):twrite=Timer(start=1)res=writer(df,fn)ifresisINFEASIBLE:# print('INFEASIBLE')returnnan,nan,nantwrite.end()# print('Written with', writer)tread=Timer(start=1)rdf=reader(fn)tread.end()assertdf.shape==rdf.shape,'{} != {}'.format(df.shape,rdf.shape)returntwrite.time,tread.time,getsize(fn)/10**6deftry_bench(*a,**kw):known_errors=['with dtype bytes',"'NoneType' object has no attribute 'encode'","'bytes' object has no attribute 'encode'",]try:returnbench(*a,**kw)exceptExceptionase:print(e)print(a[0])ifany(sinstr(e)forsinknown_errors):na=float('nan')returnna,na,naelse:raise(e)

In [14]:

deftodf(x):d=DataFrame(x).Td.columns=['Write_time','Read_time','Mb']d.index.name='Fmt'returnd.reset_index(drop=0)defstack_results(res):returnpd.concat([(df.assign(Enc=type).assign(Null=null))fordf,type,nullinres],ignore_index=True)

In [15]:

@check_argsdefrun_writers(df,asdf=True,cats=None,dirname='/tmp/test',str_enc:StrEnc=None,null:bool=False):"""For given df (should be single column), run series of    reads/writes"""null_,_=null_type(df.iloc[:,0])assertnull_==nullobj_tp=get_obj_type(df)ifcatsisNoneelse'infer'pqr=pq_reader(categories=cats)csv_dtype=(NoneifcatsisNoneelsedict(zip(cats,it.repeat('category'))))ifos.path.exists(dirname):print('Deleting',dirname,'... ',end='')shutil.rmtree(dirname)fprint('Done')os.mkdir(dirname)dir=lambdax:os.path.join(dirname,x)csv_reader=partial(pd.read_csv,dtype=csv_dtype,index_col=0)pq_write_enc=pq_writer(compression='SNAPPY',object_encoding=obj_tp)pq_write_len=pq_writer(get_lens=True,compression='SNAPPY')bc_mkr_mkr=part(mk_bcolz_writer,null=null,str_enc=str_enc)blosc_df_wrt=bc_mkr_mkr(method=BEnc.df)blosc_uni_wrt=bc_mkr_mkr(method=BEnc.utf8)blosc_lst_wrt=bc_mkr_mkr(method=BEnc.list)res={'Csv':try_bench(dir('t.csv'),df,DataFrame.to_csv,csv_reader),'Bcolz-df':try_bench(dir('t_df.blsc'),df,blosc_df_wrt,read_bcolz),'Bcolz-uni':try_bench(dir('t_uni.blsc'),df,blosc_uni_wrt,read_bcolz),'Bcolz-lst':try_bench(dir('t_lst.blsc'),df,blosc_lst_wrt,read_bcolz),'Feather':try_bench(dir('t.fth'),df,feather.write_dataframe,feather.read_dataframe),'Pq-Snappy':try_bench(dir('t_snap.parq'),df,pq_writer(compression='SNAPPY'),pqr),'Pq-Snappy-enc':try_bench(dir('t_snap_utf8.parq'),df,pq_write_enc,pqr),'Pq-Snappy-ft':try_bench(dir('t_snap_f.parq'),df,pq_write_len,pqr),'Pq-Uncompressed':try_bench(dir('t_unc.parq'),df,pq_writer(compression='UNCOMPRESSED'),pqr),}# try_bench('/tmp/test/t_gzip.parq', df, pq_writer(compression='GZIP'), pqr) + ('Pq-Gzip',),  # <= slow writesifasdf:returntodf(res)else:returnres

In [16]:

defrun_dfs(base_srs):"""For base df with single col, generate new columns    based on original with different ways of encoding    the string type. Run the full battery of    read/write benchmarks on each of these new columns."""d=new_cols(base_srs)#global res, resnull, resc, rescnull, resb, resbnullres=run_writers(d[['Str']],str_enc=StrEnc.str)resnull=run_writers(d[['Str_nls']],str_enc=StrEnc.str,null=True)resc=run_writers(d[['Cat']],cats=cols,str_enc=StrEnc.cat)rescnull=run_writers(d[['Cat_nls']],cats=cols,str_enc=StrEnc.cat,null=True)resb=run_writers(d[['Bytes']],str_enc=StrEnc.byte)resbnull=run_writers(d[['Bytes_nls']],str_enc=StrEnc.byte,null=True)allres=stack_results([(res,'Str',False),(resnull,'Str',True),(resc,'Cat',False),(rescnull,'Cat',True),(resb,'Byte',False),(resbnull,'Byte',True),])returnallres.assign(Ran=lambdax:~x.Mb.isnull())

Actually run benchmarks¶

In [23]:

%time reslo = run_dfs(df.Low_card)

Warning, `null` not explicitly passed
Deleting /tmp/test ... Done
Deleting /tmp/test ... Done
'NoneType' object has no attribute 'encode'
/tmp/test/t_snap_f.parq
Warning, `null` not explicitly passed
Deleting /tmp/test ... Done
Deleting /tmp/test ... Done
Warning, `null` not explicitly passed
Deleting /tmp/test ... Done
cannot serialize column 0 named Bytes with dtype bytes
/tmp/test/t.fth
Deleting /tmp/test ... Done
CPU times: user 7.96 s, sys: 1.11 s, total: 9.07 s
Wall time: 9.31 s

In [18]:

%time resunq = run_dfs(df.Unq)

Warning, `null` not explicitly passed
Deleting /tmp/test ... Done
Deleting /tmp/test ... Done
'NoneType' object has no attribute 'encode'
/tmp/test/t_snap_f.parq
Warning, `null` not explicitly passed
Deleting /tmp/test ... Done
Deleting /tmp/test ... Done
Warning, `null` not explicitly passed
Deleting /tmp/test ... Done
cannot serialize column 0 named Bytes with dtype bytes
/tmp/test/t.fth
Deleting /tmp/test ... Done
CPU times: user 14.2 s, sys: 1.81 s, total: 16 s
Wall time: 17.5 s

Plots¶

In [24]:

_allreslo=reslo.dropna(axis=0)g=sns.FacetGrid(_allreslo,row='Enc',col='Null',aspect=1.2,size=4)g.map(plot_scatter,'Write_time','Read_time','Mb','Fmt')

Out[24]:

<seaborn.axisgrid.FacetGrid at 0x11b22d048>

Image may be NSFW.
Clik here to view.

In [20]:

_allres=resunq.dropna(axis=0)g=sns.FacetGrid(_allres,row='Enc',col='Null',aspect=1.2,size=4)g.map(part(plot_scatter,sz_fact=10),'Write_time','Read_time','Mb','Fmt')

Out[20]:

<seaborn.axisgrid.FacetGrid at 0x122441b70>

Image may be NSFW.
Clik here to view.

Ranking and conclusion¶

The following scores the formats based on the time to write and read (but gives twice the weight to read since that's more important to me). For each string encoding setting, it finds the ratio of each format to the one with the best time.

As an example, consider we're looking for the total weighted time with Non-null byte encoding for just Feather and CSV. If the weighted time for Feather is 2 seconds and for CSV is 4 seconds, then Feather gets a score of 2/2=1 and CSV gets a score of 4/2=2. I then get the median score, aggregating for each of the encoding scenarios.

In [21]:

defratio(df):s=score(df)returns.div(s.min())defapply_ranking(df):ranks=pd.concat([ratio(gdf)fork,gdfindf.groupby(['Enc','Null'],sort=False)]).astype(int)df['Ratio']=ranksreturndfscore=lambdadf:1*df.Write_time+2*df.Read_timecombine_rankings(reslo,resunq,scoring_func=apply_ranking)

Out[21]:

                Low_card Unique      
                  Median Median Fails
Feather              1.0    1.0     1
Bcolz-df             1.0    1.0     5
Bcolz-lst            1.5    1.5     4
Bcolz-uni            2.0    2.0     2
Pq-Snappy-ft         2.0    4.0     1
Pq-Uncompressed      4.0    4.5     0
Pq-Snappy            6.0    5.5     0
Pq-Snappy-enc        3.0    6.5     0
Csv                  7.5    8.5     0

With these criteria, Feather appears to have consistently good speed across settings, and only fails when writing a bytes column without null values (#283). The bcolz Dataframe mode does well on the one case where it doesn't take forever to run, and Parquet with snappy encoding and the length setting seems to do well when there are few unique values. It seems that the others which are robust to all the settings have significantly worse performance, though.

The format that will come closest to Feather for me would probably be Bcolz with the unicode setting (applying .values.astype('U') to a frame before saving). It suffers a bit with lots of unique values, but works under all of the string and bytes conditions, unlike Feather:

In [22]:

combine_rankings(reslo.query('Enc != "Cat"'),resunq.query('Enc != "Cat"'),scoring_func=apply_ranking)

Out[22]:

                Low_card Unique      
                  Median Median Fails
Feather              1.0    1.0     1
Bcolz-df             1.0    1.0     3
Bcolz-lst            1.5    1.5     2
Bcolz-uni            2.0    2.0     0
Pq-Snappy-ft         2.0    2.0     1
Pq-Uncompressed      4.0    5.0     0
Pq-Snappy            6.0    6.5     0
Pq-Snappy-enc        4.0    7.5     0
Csv                  6.0    9.0     0

But despite the occasionaly and unpredictable worst case performance scenarios, the options for temporary local DataFrame storage have been improving considerably for python, and I'm happy that there are fewer and fewer reasons to resort to CSV these days.

Benchmarking recent pydata local storage options

String data¶

Load and set up df¶

String columns¶

Bytes | categorical | strings & Nulls | Nonulls¶

Parquet writer/reader¶

Bcolz¶

Benchmarkers¶

Actually run benchmarks¶

Plots¶

Ranking and conclusion¶

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112