API Documentation Generated by Endo, 2006-08-14
stats.py module
#################################################
####### Written by: Gary Strangman ###########
####### Last modified: Apr 13, 2000 ###########
#################################################
A collection of basic statistical functions for python. The function
names appear below.
*** Some scalar functions defined here are also available in the scipy.special
package where they work on arbitrary sized arrays. ****
Disclaimers: The function list is obviously incomplete and, worse, the
functions are not optimized. All functions have been tested (some more
so than others), but they are far from bulletproof. Thus, as with any
free software, no warranty or guarantee is expressed or implied. :-) A
few extra functions that don't appear in the list below can be found by
interested treasure-hunters. These functions don't necessarily have
both list and array versions but were deemed useful
CENTRAL TENDENCY: gmean (geometric mean)
hmean (harmonic mean)
mean
median
medianscore
mode
MOMENTS: moment
variation
skew
kurtosis
normaltest (for arrays only)
ALTERED VERSIONS: tmean
tvar
tstd
tsem
describe
FREQUENCY STATS: freqtable
itemfreq
scoreatpercentile
percentileofscore
histogram
cumfreq
relfreq
VARIABILITY: obrientransform
samplevar
samplestd
signaltonoise (for arrays only)
var
std
stderr
sem
z
zs
TRIMMING FCNS: threshold (for arrays only)
trimboth
trim1
around (round all vals to 'n' decimals)
CORRELATION FCNS: paired
pearsonr
spearmanr
pointbiserialr
kendalltau
linregress
INFERENTIAL STATS: ttest_1samp
ttest_ind
ttest_rel
chisquare
ks_2samp
mannwhitneyu
ranksums
wilcoxon
kruskal
friedmanchisquare
PROBABILITY CALCS: chisqprob
erfcc
zprob
fprob
betai
## Note that scipy.stats.distributions has many more statistical probability
## functions defined.
ANOVA FUNCTIONS: f_oneway
f_value
SUPPORT FUNCTIONS: ss
square_of_sums
shellsort
rankdata
References
----------
[CRCProbStat2000] Zwillinger, D. and Kokoska, S. _CRC Standard Probablity and
Statistics Tables and Formulae_. Chapman & Hall: New York. 2000.
erfc = special.erfc
fprob = special.fdtrc
ksprob = special.kolmogorov
zprob = special.ndtr
Returns the incomplete beta function.
I_x(a,b) = 1/B(a,b)*(Integral(0,x) of t^(a-1)(1-t)^(b-1) dt)
where a,b>0 and B(a,b) = G(a)*G(b)/(G(a+b)) where G(a) is the gamma
function of a.
The standard broadcasting rules apply to a, b, and x.
Parameters
----------
a : array or float > 0
b : array or float > 0
x : array or float
x will be clipped to be no greater than 1.0 .
Returns
-------
Returns the (1-tail) probability value associated with the provided chi-square value and degrees of freedom.
Broadcasting rules apply.
chisq : array or float > 0 df : array or float, probably int >= 1
The area from chisq to infinity under the Chi^2 probability distribution with degrees of freedom df.
Calculates a one-way chi square for array of observed frequencies and returns the result. If no expected frequencies are given, the total N is assumed to be equally distributed across all groups.
Returns: chisquare-statistic, associated p-value
Returns the computed median value of an array.
All of the values in the input array are used. The input array is first
histogrammed using numbins bins. The bin containing the median is
selected by searching for the halfway point in the cumulative histogram.
The median value is then computed by linearly interpolating across that bin.
Parameters
----------
a : array
numbins : int
The number of bins used to histogram the data. More bins give greater
accuracy to the approximation of the median.
Returns
-------
A floating point value approximating the median.
References
----------
[CRCProbStat2000] Section 2.2.6
The correlation coefficients formed from 2-d array x, where the rows are the observations, and the columns are variables.
corrcoef(x,y) where x and y are 1d arrays is the same as corrcoef(transpose([x,y]))
If rowvar is True, then each row is a variables with observations in the columns.
Estimate the covariance matrix.
If m is a vector, return the variance. For matrices where each row is an observation, and each column a variable, return the covariance matrix. Note that in this case diag(cov(m)) is a vector of variances for each column.
cov(m) is the same as cov(m, m)
Normalization is by (N-1) where N is the number of observations (unbiased estimate). If bias is True then normalization is by N.
If rowvar is False, then each row is a variables with observations in the columns.
Returns a cumulative frequency histogram, using the histogram function. Defaultreallimits can be None (use all data), or a 2-sequence containing lower and upper limits on values to include.
Returns: array of cumfreq bin values, lowerreallimit, binsize, extrapoints
Computes several descriptive statistics of the passed array.
a : array axis : int or None
Performs a 1-way ANOVA, returning an F-value and probability given
any number of groups. From Heiman, pp.394-7.
Usage: f_oneway (*args) where *args is 2 or more arrays, one per
treatment group
Returns: f-value, probability
Returns an F-statistic given the following:
ER = error associated with the null hypothesis (the Restricted model)
EF = error associated with the alternate hypothesis (the Full model)
dfR = degrees of freedom the Restricted model
dfF = degrees of freedom associated with the Restricted model
where ER and EF are matrices from a multivariate F calculation.
Calculation of Wilks lambda F-statistic for multivarite data, per Maxwell & Delaney p.657.
Sort an array and provide the argsort. Parameters ---------- a : array Returns ------- (sorted array, indices into the original array, )
Friedman Chi-Square is a non-parametric, one-way within-subjects ANOVA. This function calculates the Friedman Chi-square test for repeated measures and returns the result, along with the associated probability value. It assumes 3 or more repeated measures. Only 3 levels requires a minimum of 10 subjects in the study. Four levels requires 5 subjects per level(??).
Returns: chi-square statistic, associated p-value
Calculates a linear model fit ... anova/ancova/lin-regress/t-test/etc. Taken
from:
Peterson et al. Statistical limitations in functional neuroimaging
I. Non-inferential methods and statistical models. Phil Trans Royal Soc
Lond B 354: 1239-1260.
Returns: statistic, p-value ???
Calculates the geometric mean of the values in the passed array.
That is: n-th root of (x1 * x2 * ... * xn)
a : array axis : int or None
The geometric mean computed over a single dimension of the input array or all values in the array if axis==None.
Returns (i) an array of histogram bin counts, (ii) the smallest value of the histogram binning, and (iii) the bin width (the last 2 are not necessarily integers). Default number of bins is 10. Defaultlimits can be None (the routine picks bins spanning all the numbers in the a) or a 2-sequence (lowerlimit, upperlimit). Returns all of the following: array of bin values, lowerreallimit, binsize, extrapoints.
Returns: (array of bin counts, bin-minimum, min-width, #-points-outside-range)
histogram2(a,bins) -- Compute histogram of a using divisions in bins
Description:
Count the number of times values from array a fall into
numerical ranges defined by bins. Range x is given by
bins[x] <= range_x < bins[x+1] where x =0,N and N is the
length of the bins array. The last range is given by
bins[N] <= range_N < infinity. Values less than bins[0] are
not included in the histogram.
Arguments:
a -- 1D array. The array of values to be divied into bins
bins -- 1D array. Defines the ranges of values to use during
histogramming.
Returns:
1D array. Each value represents the occurences for a given
bin (range) of values.
Caveat:
This should probably have an axis argument that would histogram
along a specific axis (kinda like matlab)
Calculates the harmonic mean of the values in the passed array.
That is: n / (1/x1 + 1/x2 + ... + 1/xn)
a : array axis : int or None
The harmonic mean computed over a single dimension of the input array or all values in the array if axis=None.
Returns a 2D array of item frequencies.
Column 1 contains item values, column 2 contains their respective counts. Assumes a 1D array is passed.
a : array
A 2D frequency table (col [0:n-1]=scores, col n=frequencies)
Calculates Kendall's tau, a correlation measure for ordinal data, and an associated p-value.
Returns: Kendall's tau, two-tailed p-value
The Kruskal-Wallis H-test is a non-parametric ANOVA for 2 or more groups, requiring at least 5 subjects in each group. This function calculates the Kruskal-Wallis H and associated p-value for 2 or more independent samples.
Returns: H-statistic (corrected for ties), associated p-value
Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified from Numerical Recipies in C, page 493. Returns KS D-value, prob. Not ufunc- like.
Returns: KS D-value, p-value
Return the D-value and the p-value for a Kolmogorov-Smirnov test of the null that N RV's generated by the rvs fits the cdf given the extra arguments. rvs needs to accept the size= keyword if a function. rvs can also be a vector of RVs.
cdf can be a function or a string indicating the distriubtion type.
if the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution.
Computes the kurtosis (Fisher or Pearson) of a dataset.
Kurtosis is the fourth central moment divided by the square of the variance.
If Fisher's definition is used, then 3.0 is subtracted from the result to
give 0.0 for a normal distribution.
If bias is False then the kurtosis is calculated using k statistics to
eliminate bias comming from biased moment estimators
Use kurtosistest() to see if result is close enough to normal.
Parameters
----------
a : array
axis : int or None
fisher : bool
If True, Fisher's definition is used (normal ==> 0.0). If False,
Pearson's definition is used (normal ==> 3.0).
bias : bool
If False, then the calculations are corrected for statistical bias.
Returns
-------
The kurtosis of values along an axis, returning 0 where all values are
equal.
References
----------
[CRCProbStat2000] section 2.2.25
Tests whether a dataset has normal kurtosis (i.e., kurtosis=3(n-1)/(n+1)). Valid only for n>20. Parameters ---------- a : array axis : int or None Returns ------- (Z-score, 2-tail Z-probability) The Z-score is set to 0 for bad entries.
Calculates a regression line on two arrays, x and y, corresponding to x,y pairs. If a single 2D array is passed, linregress finds dim with 2 levels and splits data into x,y pairs along that dim.
Returns: slope, intercept, r, two-tailed prob, stderr-of-the-estimate
Calculates a Mann-Whitney U statistic on the provided scores and returns the result. Use only when the n in each condition is < 20 and you have 2 independent samples of ranks. REMEMBER: Mann-Whitney U is significant if the u-obtained is LESS THAN or equal to the critical value of U.
Returns: u-statistic, one-tailed p-value (i.e., p(z(U)))
Mask an array for values outside of given limits.
This is primarily a utility function.
Parameters
----------
a : array
limits : (float or None, float or None)
A tuple consisting of the (lower limit, upper limit). Values in the
input array less than the lower limit or greater than the upper limit
will be masked out. None implies no limit.
inclusive : (bool, bool)
A tuple consisting of the (lower flag, upper flag). These flags
determine whether values exactly equal to lower or upper are allowed.
Returns
-------
A MaskedArray.
Raises
------
A ValueError if there are no values within the given limits.
Returns the arithmetic mean of m along the given dimension.
That is: (x1 + x2 + .. + xn) / n
a : array axis : int or None
The arithmetic mean computed over a single dimension of the input array or all values in the array if axis=None. The return value will have a floating point dtype even if the input data are integers.
Returns the median of the passed array along the given axis.
If there is an even number of entries, the mean of the 2 middle values is returned.
a : array axis=0 : int
The median of each remaining axis, or of all of the values in the array if axis is None.
Returns an array of the modal (most common) value in the passed array.
If there is more than one such value, only the first is returned. The bin-count for the modal bins is also returned.
a : array axis=0 : int
(array of modal values, array of counts for each mode)
Calculates the nth moment about the mean for a sample.
Generally used to calculate coefficients of skewness and kurtosis.
a : array moment : int axis : int or None
The appropriate moment along the given axis or over all values if axis is None.
Compute the mean over the given axis ignoring nans.
Compute the median along the given axis ignoring nan values
Compute the standard deviation over the given axis ignoring nans
Tests whether skew and/or kurtosis of dataset differs from normal curve.
This is the omnibus test of D'Agostino and Pearson, 1973
a : array axis : int or None
Computes a transform on input data (any number of columns). Used to test for homogeneity of variance prior to running one-way stats. Each array in *args is one level of a factor. If an F_oneway() run on the transformed data and found significant, variances are unequal. From Maxwell and Delaney, p.112. Returns: transformed data for use in an ANOVA
Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson's correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
x : 1D array y : 1D array the same length as x
Note: result of this function depends on the values used to histogram the data(!).
Returns: percentile-position of score (0-100) relative to a
Calculates a point biserial correlation coefficient and the associated p-value.
The point biserial correlation is used to measure the relationship between a binary variable, x, and a continuous variable, y. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply a determinative relationship.
x : array of bools y : array of floats
Ranks the data in a, dealing with ties appropriately.
Equal values are assigned a rank that is the average of the ranks that would have been otherwise assigned to all of the values within that set. Ranks begin at 1, not 0.
In [15]: stats.rankdata([0, 2, 2, 3]) Out[15]: array([ 1. , 2.5, 2.5, 4. ])
An array of length equal to the size of a, containing rank scores.
Calculates the rank sums statistic on the provided scores and returns the result.
Returns: z-statistic, two-tailed p-value
Returns a relative frequency histogram, using the histogram function. Defaultreallimits can be None (use all data), or a 2-sequence containing lower and upper limits on values to include.
Returns: array of cumfreq bin values, lowerreallimit, binsize, extrapoints
Returns the sample standard deviation of the values in the passed array (i.e., using N). Axis can equal None (ravel array first), an integer (the axis over which to operate).
Returns the sample standard deviation of the values in the passed array (i.e., using N). Axis can equal None (ravel array first), an integer (the axis over which to operate)
Calculates the score at the given 'per' percentile of the sequence a. For example, the score at per=50 is the median.
If the desired quantile lies between two data points, we interpolate between them.
If the parameter 'limit' is provided, it should be a tuple (lower, upper) of two values. Values of 'a' outside this (closed) interval will be ignored.
Returns the standard error of the mean (i.e., using N) of the values in the passed array. Axis can equal None (ravel array first), or an integer (the axis over which to operate)
Calculates signal-to-noise. Axis can equal None (ravel array first), an integer (the axis over which to operate).
Computes the skewness of a data set.
For normally distributed data, the skewness should be about 0. A skewness
value > 0 means that there is more weight in the left tail of the
distribution. The function skewtest() can be used to determine if the
skewness value is close enough to 0, statistically speaking.
Parameters
----------
a : array
axis : int or None
bias : bool
If False, then the calculations are corrected for statistical bias.
Returns
-------
The skewness of values along an axis, returning 0 where all values are
equal.
References
----------
[CRCProbStat2000] section 2.2.24.1
Tests whether the skew is significantly different from a normal distribution. The size of the dataset should be >= 8. Parameters ---------- a : array axis : int or None Returns ------- (Z-score, 2-tail Z-probability, )
Calculates a Spearman rank-order correlation coefficient and the p-value
to test for non-correlation.
The Spearman correlation is a nonparametric measure of the linear
relationship between two datasets. Unlike the Pearson correlation, the
Spearman correlation does not assume that both datasets are normally
distributed. Like other correlation coefficients, this one varies between -1
and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an
exact linear relationship. Positive correlations imply that as x increases,
so does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Spearman correlation at least as extreme as
the one computed from these datasets. The p-values are not entirely reliable
but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
The lengths of both arrays must be > 2.
Returns
-------
(Spearman correlation coefficient,
2-tailed p-value)
References
----------
[CRCProbStat2000] section 14.7
Adds the values in the passed array, squares that sum, and returns the result.
Returns: the square of the sum over axis.
Squares each value in the passed array, adds these squares, and returns the result.
a : array axis : int or None
The sum along the given axis for (a*a).
Returns the estimated population standard deviation of the values in the passed array (i.e., N-1). Axis can equal None (ravel array first), or an integer (the axis over which to operate).
Returns the estimated population standard error of the values in the passed array (i.e., N-1). Axis can equal None (ravel array first), or an integer (the axis over which to operate).
Like numpy.clip() except that values <threshmid or >threshmax are replaced by newval instead of by threshmin/threshmax (respectively).
Returns: a, with values <threshmin or >threshmax replaced with newval
Tie-corrector for ties in Mann Whitney U and Kruskal Wallis H tests. See Siegel, S. (1956) Nonparametric Statistics for the Behavioral Sciences. New York: McGraw-Hill. Code adapted from |Stat rankind.c code. Returns: T correction factor for U or H
Returns the maximum value of a, along axis, including only values greater than (or equal to, if inclusive is True) upperlimit. If the limit is set to None, a limit larger than the max value in the array is used.
Returns the arithmetic mean of all values in an array, ignoring values
strictly outside given limits.
Parameters
----------
a : array
limits : None or (lower limit, upper limit)
Values in the input array less than the lower limit or greater than the
upper limit will be masked out. When limits is None, then all values are
used. Either of the limit values in the tuple can also be None
representing a half-open interval.
inclusive : (bool, bool)
A tuple consisting of the (lower flag, upper flag). These flags
determine whether values exactly equal to lower or upper are allowed.
Returns
-------
A float.
Returns the minimum value of a, along axis, including only values less than (or equal to, if inclusive is True) lowerlimit. If the limit is set to None, all values in the array are used.
Slices off the passed proportion of items from ONE end of the passed array (i.e., if proportiontocut=0.1, slices off 'leftmost' or 'rightmost' 10% of scores). Slices off LESS if proportion results in a non-integer slice index (i.e., conservatively slices off proportiontocut).
Returns: trimmed version of array a
Return mean with proportiontocut chopped from each of the lower and upper tails.
Slices off the passed proportion of items from BOTH ends of the passed array (i.e., with proportiontocut=0.1, slices 'leftmost' 10% AND 'rightmost' 10% of scores. You must pre-sort the array if you want "proper" trimming. Slices off LESS if proportion results in a non-integer slice index (i.e., conservatively slices off proportiontocut).
Returns: trimmed version of array a
Returns the standard error of the mean for the values in an array, (i.e., using N for the denominator), ignoring values strictly outside the sequence passed to 'limits'. Note: either limit in the sequence, or the value of limits itself, can be set to None. The inclusive list/tuple determines whether the lower and upper limiting bounds (respectively) are open/exclusive (0) or closed/inclusive (1).
Returns the standard deviation of all values in an array, ignoring values strictly outside the sequence passed to 'limits'. Note: either limit in the sequence, or the value of limits itself, can be set to None. The inclusive list/tuple determines whether the lower and upper limiting bounds (respectively) are open/exclusive (0) or closed/inclusive (1).
Calculates the t-obtained for the independent samples T-test on ONE group of scores a, given a population mean.
Returns: t-value, two-tailed prob
Calculates the t-obtained T-test on TWO INDEPENDENT samples of scores a, and b. From Numerical Recipies, p.483. Axis can equal None (ravel array first), or an integer (the axis over which to operate on a and b).
Returns: t-value, two-tailed p-value
Calculates the t-obtained T-test on TWO RELATED samples of scores, a and b. From Numerical Recipies, p.483. Axis can equal None (ravel array first), or an integer (the axis over which to operate on a and b).
Returns: t-value, two-tailed p-value
Returns the sample variance of values in an array, (i.e., using N-1), ignoring values strictly outside the sequence passed to 'limits'. Note: either limit in the sequence, or the value of limits itself, can be set to None. The inclusive list/tuple determines whether the lower and upper limiting bounds (respectively) are open/exclusive (0) or closed/inclusive (1).
Returns the estimated population variance of the values in the passed array (i.e., N-1). Axis can equal None (ravel array first), or an integer (the axis over which to operate).
Computes the coefficient of variation, the ratio of the biased standard deviation to the mean.
a : array axis : int or None
[CRCProbStat2000] section 2.2.20
Returns the z-score of a given input score, given thearray from which that score came. Not appropriate for population calculations, nor for arrays > 1D.
Returns an array of z-scores the shape of scores (e.g., [x,y]), compared to array passed to compare (e.g., [time,x,y]). Assumes collapsing over dim 0 of the compare array.
Returns a 1D array of z-scores, one for each score in the passed array, computed relative to the passed array.
| Local name | Refers to |
|---|---|
| distributions | scipy.stats.distributions |
| linalg | scipy.linalg |
| math | numpy.core.umath |
| np | numpy |
| scipy.stats | scipy.stats |
| sp | scipy |
| special | scipy.special |
| sys | sys |
| warnings | warnings |
| _move_axis_to_0 | numpy.core.numeric._move_axis_to_0 |
| _support | scipy.stats._support |