This is less of a blog post, and more of my annotated progress in implementing CRF's based on Charles Elkan's very excellent video and pdf tutorials to get a better understanding of log-linear models.
This notebook contains a bunch of the core functions, though there are also some in crf.py
. The full repo is here.
importwarnings;warnings.filterwarnings('ignore')frompy3k_importsimport*fromproject_imports3import*pu.psettings(pd)pd.options.display.width=150# 200%matplotlib inline
%%javascriptIPython.keyboard_manager.command_shortcuts.add_shortcut('Ctrl-k','ipython.move-selected-cell-up')IPython.keyboard_manager.command_shortcuts.add_shortcut('Ctrl-j','ipython.move-selected-cell-down')IPython.keyboard_manager.command_shortcuts.add_shortcut('Shift-m','ipython.merge-selected-cell-with-cell-after')
fromcollectionsimportdefaultdict,CounterimportinspectfromtypingimportList,Dict,TupleDf=DictY=strifsys.version_info.major>2:unicode=str
importutils;fromutilsimport*importcrf;fromcrfimport*FeatUtils.bookend=False
Series.__matmul__=Series.dotDataFrame.__matmul__=DataFrame.dotfrommatmul_newimporttest_matmultest_matmul()
Probabilistic model¶
Given a sequence $\bar x$, the linear chain CRF model gives the probability of a corresponding sequence $\bar y$ as follows, for feature functions $F_j$, where each $F_j$ is a sum of a corresponding lower level feature function $f_j$ over every element of the sequence:
$$ p(\bar y | \bar x;w) = \frac {1} {Z(\bar x, w)} \exp \sum_j w_j F_j(\bar x, \bar y) $$$$ F_j(\bar x, \bar y) = \sum_{i=1}^n f_j(y_{i-1}, y_i, \bar x, i) $$$Z(\bar x, w)$ is the partition function, that sums the probabilities of all possible sequences to normalize it to a proper probability:
$$ Z(x, w) = \sum_{y' \in Y} \exp \sum_{j=1} ^J w_j F_j (x, y'). $$This way of summing feature functions along a sequence can be seen as a way of extending logistic regression from a single (or multiclass) output to a model that outputs sequences.
Argmax¶
Computing the most likely sequence $\text{argmax}_{\bar y} p(\bar y | \bar x;w)$ naively involves iterating through the exponentially large space of every possible sequence that can be built from the tag vocabulary, rendering the computation impractical for even medium sized tag-spaces.
Since the scoring function only depends on 2 (consecutive in this situation) elements of $\bar y$, argmax can be computed in polynomial time with a table ($\in ℝ^{|Y| \times |y|}$). $U_{ij}$ is the highest score for sequences ending in $y_i$ at position $y_j$. It is useful to compute the most likely sequence in terms of $g_i$, which sums over all lower level functions $f_j$ evaluated at position $i$:
Generate maximum score matrix U¶
$$U(k, v) = \max_u [U(k-1, u) + g_k(u,v)]$$$$U(1, vec) = \max_{y_0} [U(0, y_0) + g_k(y_0,vec)]$$This implementation is pretty slow, because every low level feature function is evaluated at each $i, y_{i-1}$ and $y_i$, for each feature function $f_j$ ($\mathcal{O}(m^2 n J )$ where $J=$ number of feature functions, $m=$ number of possible tags and $n=$ length of sequence $\bar y$). Also, using python functions in the inner-loop is slow. This could be significantly reduced if the feature functions could be arranged such that they would only be evaluated for the relevant combinations of $x_i, y_{i-1}$ and $y_i$. I started arranging them in this way in dependency.py
, but the complexity got a bit too unwieldy for a toy educational project.
definit_score(tags,tag=START,sort=True):"Base case for recurrent score calculation U"i=Series(0,index=sorted(tags)ifsortelsetags)i.loc[tag]=1returnidefget_u(k:int=None,gf:"int -> (Y, Y') -> float"=None,collect=True,verbose=False)->'([max score], [max ix])':"""Recursively build up g_i matrices bottom up, adding y-1 score to get max y score. Returns score. - k is in terms of y vector, which is augmented with beginning and end tags - also returns indices yprev that maximize y at each level to help reconstruct tmost likely sequence"""pt=testprint(verbose)imx=len(gf.xbar)+1ifkisNone:pt(gf.xbar)returnget_u(imx,gf=gf,collect=1,verbose=verbose)ifk==0:return[init_score(gf.tags,START)],[]uprevs,ixprevs=get_u(k-1,gf=gf,collect=False,verbose=verbose)gmat=getmat(gf(k))uadd=gmat.add(uprevs[-1],axis='index')ifk>0:# START tag only possible at beginning.# There should be a better way of imposing these constraintsuadd[START]=-1ifk<imx:uadd[END]=-1# END only possible at the...endifk==1:idxmax=Series(START,index=gf.tags)# uadd.ix[START].idxmax()else:idxmax=uadd.idxmax()pt('idxmax:',idxmax,sep='\n')retu,reti=uprevs+[uadd.max()],ixprevs+[idxmax]ifnotcollect:returnretu,retireturns2df(retu),s2df(reti)defmlp(idxs,i:int=None,tagsrev:List[Y]=[END])->List[Y]:"Most likely sequence"ifiisNone:returnmlp(idxs,i=int(idxs.columns[-1]),tagsrev=tagsrev)elifi<0:returntagsrev[::-1]tag=tagsrev[-1]yprev=idxs.loc[tag,i]returnmlp(idxs,i=i-1,tagsrev=tagsrev+[yprev])
defpredict(xbar=None,fs=None,tags=None,ws=None,gf=None):"Return argmax_y with corresponding score"ifgfisNone:ws=wsormkwts1(fs)gf=G(ws=ws,fs=fs,tags=tags,xbar=xbar)u,i=get_u(gf=gf,collect=True,verbose=0)path=mlp(i)returnpath,u.ix[END].iloc[-1]# path2, score2 = predict(xbar=EasyList(['wd1', 'pre-end', 'whatevs']),# fs=no_test_getu3.fs,# tags=[START, 'TAG1', 'PENULTAG', END])
importtest;reload(test);fromtestimport*no_test_getu1(get_u,mlp)no_test_getu2(get_u,mlp)no_test_getu3(get_u,mlp)test_corp()
Gradient¶
$$\frac{\partial}{\partial w_j} \log p(y | x;w) = F_j (x, y) - \frac1 {Z(x, w)} \sum_{y'} F_j (x, y') [\exp \sum_{j'} w_{j'} F_{j'} (x, y')]$$$$= F_j (x, y) - E_{y' \sim p(y | x;w) } [F_j(x,y')]$$Forward-backward algorithm¶
- Partition function $Z(\bar x, w) = \sum_{\bar y} \exp \sum _{j=1} ^ J w_j F_j (\bar x, \bar y) $ can be intractible if calculated naively (similar to argmax); forward-backward vectors can make it easier to compute
Compute partition function $Z$ from either forward or backward vectors
$$ Z(\bar x, w) = \beta(START, 0) $$$$ Z(\bar x, w) = \alpha(n+1, END) $$[It seems there could be an error in the notes, which state that $Z(\bar x, w) = \sum_v \alpha(n, v) $. If this is the case, $Z$ calculated with $\alpha$ will never get a contribution from $g_{n+1}$, while $Z$ calculated with $\beta$ will in the $\beta(u, n)$ step.]
Check correctness of forward and backward vectors.
- $ Z(\bar x, w) = \beta(START, 0) = \alpha(n+1, END) $
- For all positions $k=0...n+1$, $\sum_u \alpha(k, u) \beta(u, k) = Z(\bar x, w)$
defmk_asum(gf,vb=False):n=len(gf.xbar)tags=gf.tagsp=testprint(vb)@memoizedefget_asum(knext=None):ifknextisNone:# The first use of the forward vectors is to writereturnget_asum(n+1)ifknext<0:raiseValueError('k ({}) cannot be negative'.format(k))ifknext==0:returninit_score(tags,tag=START)k=knext-1gnext=gf(knext).matak=get_asum(k)ifvb:names='exp[g{k1}] g{k1} a_{k}'.format(k1=knext,k=k).split()p(side_by_side(np.exp(gnext),gnext,ak,names=names))# expsum = Series([sum([ak[u] * np.exp(gnext.loc[u, v])# for u in tags]) for v in tags], index=tags)# vectorizing is much faster:expsum=np.exp(gnext).mul(ak,axis=0).sum(axis=0)returnexpsumreturnget_asum#(knext, vb=vb)defmk_bsum(gf,vb=False):p=testprint(vb)n=len(gf.xbar)tags=gf.tags@memoizedefget_bsum(k=None):ifkisNone:returnget_bsum(0)ifk>n+1:raiseValueError('{} > length of x {} + 1'.format(k,n))ifk==n+1:returninit_score(gf.tags,tag=END)gnext=gf(k+1).matbnext=get_bsum(k+1)ifvb:names=['exp[g{}]'.format(k+1),'g{}'.format(k+1),'b_{}'.format(k+1)]p(side_by_side(np.exp(gnext),gnext,bnext,names=names))# expsum = Series([sum([np.exp(gnext.loc[u, v]) * bnext[v]# for v in tags]) for u in tags], index=tags)expsum=np.exp(gnext).mul(bnext,axis=1).sum(axis=1)returnexpsumreturnget_bsum
deftest_fwd_bkwd():tgs=[START,'TAG1',END]x=EasyList(['wd1','pre-end'])fs={# 'eq_wd1': mk_word_tag('wd1', 'TAG1'),'pre_endx':lambdayp,y,x,i:((x[i-1]=='pre-end')and(y==END))}ws=z.merge(mkwts1(fs),{'pre_endx':1})gf=G(fs=fs,tags=tgs,xbar=x,ws=ws)amkr=mk_asum(gf)bmkr=mk_bsum(gf)za=amkr().ENDzb=bmkr().STARTassertza==zbforkinrange(len(x)+2):assertamkr(k)@bmkr(k)==zareturnzatest_fwd_bkwd()
Calculate expected value of feature function¶
Weighted by conditional probability of $y'$ given $x$
$$ E_{\bar y \sim p(\bar y | \bar x;w) } [F_j(\bar x, \bar y)] = \sum _{i=1} ^n \sum _{y_{i-1}} \sum _{y_i} f_j(y_{i-1}, y_i, \bar x, i) \frac {\alpha (i-1, y_{i-1}) [\exp g_i(y_{i-1}, y_i)] \beta(y_i, i) } {Z(\bar x, w)} $$defsdot(s1:Series,s2:Series):"""It's quite a bit faster to get the dot product of raw numpy arrays rather than of the Series"""d1,d2=s1.values[:,None],s2.values[:,None]returnd1@d2.T
defexpectation2(gf,fj):"Faster matrix multiplication version"tags=gf.tagsn=len(gf.xbar)ss=0ss2=0asummer=mk_asum(gf)bsummer=mk_bsum(gf)za=partition(asummer=asummer)globalα,β,alpha_vec,beta_vec,gfix,smatdefsumi(i):gfix=np.exp(gf(i).mat.values)alpha_vec=asummer(i-1)beta_vec=bsummer(i)fmat=np.array([[fj(yprev,y,gf.xbar,i)foryintags]foryprevintags])smat=sdot(alpha_vec,beta_vec)*gfix*fmatreturnsmat.sum()#.sum()returnsum([sumi(i)foriinrange(1,n+2)])/zadefexpectation_(gf,fj):"Slow, looping version"n=len(gf.xbar)ss=0za=get_asum(gf).ENDforiinrange(1,n+2):gfix=np.exp(gf(i).mat)alpha_vec=get_asum(gf,i-1)beta_vec=get_bsum(gf,i)ss+=sum([fj(yprev,y,gf.xbar,i)*alpha_vec[yprev]*gfix.loc[yprev,y]*beta_vec[y]foryprevintgsforyintgs])returnss/za
defpartial_d(gf,fj,y,Fj=None)->float:f=fjifcallable(fj)elsegf.fs[fj]ifFjisNone:Fj=FeatUtils.mk_sum(f)#ex1 = expectation(gf, f)ex2=expectation2(gf,f)#assert np.allclose(ex1, ex2)returnFj(gf.xbar,y)-ex2defprob(gf,y,norm=True):Fs=z.valmap(FeatUtils.mk_sum,gf.fs)p=np.exp(sum([Fj(gf.xbar,y)*gf.ws[fname]forfname,FjinFs.items()]))ifnotnorm:returnpza=partition(gf=gf)returnp/zadefpartition(gf=None,asummer=None):assertasummerorgf,'Supply at least one argument'asummer=asummerormk_asum(gf)returnasummer().END
Test Partial¶
Train¶
λ=1deftrain_(zs:List[Tuple[EasyList,AugmentY]],fjid='ly_VBZ',fs=None,ws=None,vb=True,tgs=None,rand=None):fj=fs[fjid]Fj=FeatUtils.mk_sum(fj)pt=testprint(vb)forx,yinzs:gf_=G(fs=fs,tags=tgs,xbar=x,ws=ws)ifnotFj(x,y):# TODO: is this always right?continuepder=partial_d(gf_,fj,y,Fj=Fj)wj0=ws[fjid]ws[fjid]+=λ*pderpt('wj: {} -> {}'.format(wj0,ws[fjid]))pt('pder: {:.2f}'.format(pder),Fj(x,y))returnwsdeftrain_j(zs:List[Tuple[EasyList,AugmentY]],fjid='ly_VBZ',fs=None,ws=None,tol=.01,maxiter=10,vb=True,tgs=None,sec=None):ws1=wspt=testprint(vb)st=time.time()foriincount(1):nr.shuffle(zs)pt('Iter',i)wj1=ws1[fjid]ws2=train_(zs,fjid=fjid,fs=fs,ws=ws1,vb=vb,tgs=tgs)wj2=ws2[fjid]ifabs((wj2-wj1)/wj1)<tol \
or(i>=maxiter) \
or(secisnotNoneand(time.time()-st>sec)):returnws,iws1=ws2deftrain(zs_,gf,ws=None,tol=.001,maxiter=10,vb=False,sec=None,seed=1):wst=(wsorgf.ws).copy()nr.seed(seed)zs=zs_.copy()forfname,fingf.fs.items():itime=time.time()wst,i=train_j(zs,fjid=fname,fs=gf.fs,ws=wst,tol=tol,maxiter=maxiter,vb=vb,tgs=gf.tags,sec=sec)print(fname,'trained in',i,'iters: {:.2f} ({:.2f}s)'.format(wst[fname],time.time()-itime))sys.stdout.flush()returnwst# %time ws1c = train(zs, gf, mkwts1(gf.fs, 1), maxiter=100, tol=.005)
Evaluation¶
Since I'm maximizing the log-likelihood during testing, that would seem a natural measure to evaluate improvement. I'm a bit suspicious about bugs in my implementation, so I'd like to evaluate Hamming distance between actual $y$ and the predicted sequence see how much the predictions improve.
Load data¶
withopen('data/pos.train.txt','r')asf:txt=f.read()sents=filter(None,[zip(*[e.split()foreinsent.splitlines()])forsentintxt[:].split('\n\n')])X=map(itg(0),sents)Y_=map(itg(1),sents)Xa=map(EasyList,X)Ya=map(AugmentY,Y_)tags=sorted({tagforyinY_fortaginyiftag.isalpha()})
# common bigramsbigs=defaultdict(lambda:defaultdict(int))foryinY_:fort1,t2inzip(y[:-1],y[1:]):bigs[t1][t2]+=1bigd=DataFrame(bigs).fillna(0)[tags].ix[tags]wcts_all=defaultdict(Counter)forxi,yiinzip(X,Y_):forxw,ywinzip(xi,yi):wcts_all[xw][yw]+=1
# Split training and testing examplesZs=zip(Xa,Ya)print(len(Zs),'examples')Zstrn=Zs[:100]Ztst=Zs[100:201]# it's too slow right now,# so 100 examples in each set should do
defhamming(y,ypred,norm=True):sm=sum(a!=bfora,binzip(y,ypred))returnsm/len(y)ifnormelsesm
%%time
fs0 = crf.fs
ws0 = rand_weights(fs, seed=0)
gf0 = G(fs=fs0, tags=sorted([START, END] + tags),
xbar=EasyList(['']), ws=ws0)
hams0 = [
hamming(y.aug[1:-2], predict(gf=gf0._replace(xbar=x))[0][1:-2])
for x, y in Ztst[:]]
print('Initial error rate with random weights: {:.2%}'
.format(np.mean(hams0)))
This training takes forever...not recommended
%time ws_trn = train(Zstrn[:], gf, ws1e, maxiter=100, tol=.0005, sec=None, seed=3)
ws_trn={'cap_nnp':5.42,'dig_cd':6.2,'dt_in':3.26,'fst_dt':4.44,'fst_nnp':1.49,'last_nn':7.34,'post_mr':6.68,'wd_a':10.17,'wd_and':10.64,'wd_for':10.51,'wd_in':10.50,'wd_of':10.64,'wd_the':12.9,'wd_to':11.18}
%%time
gf_trn = gf0._replace(ws=ws_trn)
hams_trn = [hamming(y.aug[1:-2],
predict(gf=gf_trn._replace(xbar=x))[0][1:-2])
for x, y in Ztst[:]]
print('Error rate after training weights: {:.2%}'
.format(np.mean(hams_trn)))
The 78% to 64% error rate decrease seems to be a decent improvement, considering the small number of feature functions.
!osascript -e beep