Attributes
- entropydual
An alias for the dual function:
entropydual = dual
Method summary
- __init__(self)
- beginlogging(self, filename, freq = 10)
- clearcache(self)
- crossentropy(self, fx, log_prior_x = None, base = numpy.e)
- dual(self, params = None, ignorepenalty = False, ignoretest = False)
- endlogging(self)
- fit(self, K, algorithm = 'CG')
- grad(self, params = None, ignorepenalty = False)
- log(self, params)
- logparams(self)
- normconst(self)
- reset(self, numfeatures = None)
- setcallback(self, callback = None, callback_dual = None, callback_grad = None)
- setparams(self, params)
- setsmooth(sigma)
Methods
- __init__(self)
- beginlogging(self, filename, freq = 10)
Enable logging params for each fn evaluation to files named 'filename.freq.pickle', 'filename.(2*freq).pickle', ... each 'freq' iterations.
- clearcache(self)
Clears the interim results of computations depending on the parameters and the sample.
- crossentropy(self, fx, log_prior_x = None, base = numpy.e)
Returns the cross entropy H(q, p) of the empirical distribution q of the data (with the given feature matrix fx) with respect to the model p. For discrete distributions this is defined as:
H(q, p) = - n^{-1} sum_{j=1}^n log p(x_j)
where x_j are the data elements assumed drawn from q whose features are given by the matrix fx = {f(x_j)}, j=1,...,n.
The 'base' argument specifies the base of the logarithm, which defaults to e.
For continuous distributions this makes no sense!
- dual(self, params = None, ignorepenalty = False, ignoretest = False)
Computes the Lagrangian dual L(theta) of the entropy of the model, for the given vector theta=params. Minimizing this function (without constraints) should fit the maximum entropy model subject to the given constraints. These constraints are specified as the desired (target) values self.K for the expectations of the feature statistic.
- This function is computed as:
- L(theta) = log(Z) - theta^T . K
For 'bigmodel' objects, it estimates the entropy dual without actually computing p_theta. This is important if the sample space is continuous or innumerable in practice. We approximate the norm constant Z using importance sampling as in [Rosenfeld01whole]. This estimator is deterministic for any given sample. Note that the gradient of this estimator is equal to the importance sampling ratio estimator of the gradient of the entropy dual [see my thesis], justifying the use of this estimator in conjunction with grad() in optimization methods that use both the function and gradient. Note, however, that convergence guarantees break down for most optimization algorithms in the presence of stochastic error.
Note that, for 'bigmodel' objects, the dual estimate is deterministic for any given sample. It is given as:
L_est = log Z_est - sum_i{theta_i K_i}
- where
- Z_est = 1/m sum_{x in sample S_0} p_dot(x) / aux_dist(x),
and m = # observations in sample S_0, and K_i = the empirical expectation E_p_tilde f_i (X) = sum_x {p(x) f_i(x)}.
- endlogging(self)
Stop logging param values whenever setparams() is called.
- fit(self, K, algorithm = 'CG')
Fit the maxent model p whose feature expectations are given by the vector K.
Model expectations are computed either exactly or using Monte Carlo simulation, depending on the 'func' and 'grad' parameters passed to this function.
For 'model' instances, expectations are computed exactly, by summing over the given sample space. If the sample space is continuous or too large to iterate over, use the 'bigmodel' class instead.
For 'bigmodel' instances, the model expectations are not computed exactly (by summing or integrating over a sample space) but approximately (by Monte Carlo simulation). Simulation is necessary when the sample space is too large to sum or integrate over in practice, like a continuous sample space in more than about 4 dimensions or a large discrete space like all possible sentences in a natural language.
Approximating the expectations by sampling requires an instrumental distribution that should be close to the model for fast convergence. The tails should be fatter than the model. This instrumental distribution is specified by calling setsampleFgen() with a user-supplied generator function that yields a matrix of features of a random sample and its log pdf values.
The algorithm can be 'CG', 'BFGS', 'LBFGSB', 'Powell', or 'Nelder-Mead'.
The CG (conjugate gradients) method is the default; it is quite fast and requires only linear space in the number of parameters, (not quadratic, like Newton-based methods).
The BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm is a variable metric Newton method. It is perhaps faster than the CG method but requires O(N^2) instead of O(N) memory, so it is infeasible for more than about 10^3 parameters.
The Powell algorithm doesn't require gradients. For small models it is slow but robust. For big models (where func and grad are simulated) with large variance in the function estimates, this may be less robust than the gradient-based algorithms.
- grad(self, params = None, ignorepenalty = False)
Computes or estimates the gradient of the entropy dual.
- log(self, params)
This method is called every iteration during the optimization process. It calls the user-supplied callback function (if any), logs the evolution of the entropy dual and gradient norm, and checks whether the process appears to be diverging, which would indicate inconsistent constraints (or, for bigmodel instances, too large a variance in the estimates).
- logparams(self)
Saves the model parameters if logging has been enabled and the # of iterations since the last save has reached self.paramslogfreq.
- normconst(self)
Returns the normalization constant, or partition function, for the current model. Warning -- this may be too large to represent; if so, this will result in numerical overflow. In this case use lognormconst() instead.
For 'bigmodel' instances, estimates the normalization term as Z = E_aux_dist [{exp (params.f(X))} / aux_dist(X)] using a sample from aux_dist.
- reset(self, numfeatures = None)
Resets the parameters self.params to zero, clearing the cache variables dependent on them. Also resets the number of function and gradient evaluations to zero.
- setcallback(self, callback = None, callback_dual = None, callback_grad = None)
Sets callback functions to be called every iteration, every function evaluation, or every gradient evaluation. All callback functions are passed one argument, the current model object.
Note that line search algorithms in e.g. CG make potentially several function and gradient evaluations per iteration, some of which we expect to be poor.
- setparams(self, params)
Set the parameter vector to params, replacing the existing parameters. params must be a list or numpy array of the same length as the model's feature vector f.
- setsmooth(sigma)
Speficies that the entropy dual and gradient should be computed with a quadratic penalty term on magnitude of the parameters. This 'smooths' the model to account for noise in the target expectation values or to improve robustness when using simulation to fit models and when the sampling distribution has high variance. The smoothing mechanism is described in Chen and Rosenfeld, 'A Gaussian prior for smoothing maximum entropy models' (1999).
The parameter 'sigma' will be squared and stored as self.sigma2.
