Regression Techniques

REGRESSION TECHNIQUES

MathCAD has a variety of tools allowing for one to fit a curve to a given set of data. In all cases, you must have an idea as to the general form of the equation to fit the data set in question. We will discuss each of the following techniques:

Linear regression finds the line that best fits a set of data points

slope(vx,vy) to find the slope of the line and
intercept(vx,vy) to find the Intercept of the line that best fits data in vectors vx and vy.

Polynomial regression finds a polynomial that best fits a set of data points in vectors vx and vy

regress(vx,vy,k) to find the kth order polynomial that best fits
the x and y data values in vx and vy.

loess(vx,vy,span) to help smooth data noise. Fits a series of
second order polynomials within a region defined
by 'span' to best fit the x and y data values.

interp(vs,vx,vy,x) based on dataset vx and vy, interp returns an
interpolated or extrapolated y value corresponding
to x. Vector 'vs' is generated by regress.

Fitting a linear combination of functions using linfit(vx,vy,F) to determine the coefficients of this combination of functions, as defined in matrix F, which best approximates the data in vx and vy.

Fitting an arbitrary series of functions to data using genfit(vx,vy,vg,F).
This returns a vector containing the parameters that make a function f of x and n parameters best approximate the data in vx and vy.

Consider the dataset defined in vectors x & y and plotted as shown below.

If one were to assume a linear fit, we could use the slope and intercept commands to find the equation of a best-fit straight line through these points.

A plot showing the fitted curve, f(t) plotted against the actual data.

Using the same dataset from above, we can also fit a curve to this dataset using the regress function.

The arguments for the regress function include data vectors x & y as well as the order of the polynomial you wish to fit. The example shows regress being used to fit three different polynomials of orders one, two and three.

The first three elements of each vector are used by the function, 'interp', which we'll discuss shortly. The last two, three and four elements of vectors b, c and d respectively are the coefficients we need to write the polynomials.

<-- As described by vector 'b'
<-- As described by vector 'c'
<-- As described by vector 'd'

Each result is plotted on the graph above. Note each result provides varying degrees of fit. If one desires a fit so as to be predictive within the limits of the data provided (interpolative), most any function that correlates well will do. However, if you need a function that will describe some physical phenomena over a range that forces the function to extraploate beyond the data you provided, then you must fit a function or some combination of functions known to describe such phenomena. The graph above shows why. Once the axis are expanded beyond the limits of the data provided, one can see that each function behaves quite differently.

Once the regress function is used, you can then apply the interp function to either interpolate or extrapolate a value of 'y' for a given value of 'x'. Using the above example, find the values of 'y' for values of x equal to -10, 3.5 and 12

From above:

Of course, you can use regress to find the coefficients, write the equation as a function, then determine the value at each point as usual. However, the interp function is an easier method to do so if you do not actually need the function defined.

The regress function is a convenient function to use when the dataset can be described with a single polynomial. However, it is not unusual for a set of data to require modeling with a linear combination of arbitrary functions. In such a case, the linfit function should be used. To use our original dataset as an example:

<-- Original Data Set (Transposed)

To use linfit, create a matrix, G, of the functions you wish to combine. In this case, a straight line. Assign linfit to a vector variable to store the coefficients.

<-- Display vector S (Transposed). The coefficients held in this vector correspond to the functions in vector G. The equation is shown as function G(xx). Note these are the same coefficients as found above using slope and intercept or regress for a linear fit.

The biggest advantage of linfit is the ability to determine the coefficients of a linear combination of arbitrary functions as shown in the vector of functions F(xz).

The resulting function is indicated as F(aa) and is plotted below.

Viewing the above plot, the arbitrary combination of functions specified by F(aa) fits the data well within the limits of the dataset. However, it becomes obvious the function is likely invalid outside of the limits of the dataset. In this particular case, the function is asymptotic at x=-1. The linear function, G(aa), is shown as a green dashed plot for reference.

The function linfit() is used to solve a LINEAR combination of ARBITRARY functions. As such, it will solve for the coefficients of nonlinear functions as long as they are combined in a linear fashion. In other words, we can fit the following function to our sample set of data points.

However, linfit CANNOT solve for the coefficients of nonlinear functions of the following nature.

In this case, the independent variable, 'z', which is an argument of both the Naperian base and the natural log, has coeeficients 'b' and 'd' respectively. To solve for the coefficients of the function indicated above, we must use genfit(). If the function we wish to fit is of one of the forms listed below, we can use the approriate fitting function.

expfit(vx, vy, vg) to find the coefficients of the an equation of the form a^.e^bx + c fitting the dataset vx and vy. Vector vg is optional and consists of guesses for coefficients a, b and c.

lgsfit(vx, vy, vg) to find the coefficients of an equation of the form a / (1 + be^-cx) fitting the dataset vx and vy. Vector vg is required and consists of guesses for coefficients a, b and c.

lnfit(vx, vy) to find the coefficients of an equation of the form a^.ln(x) + b fitting the dataset vx and vy. Note that guesses of the coefficients are not required..

logfit(vx, vy, vg) to find the coefficients of an equations of the form a^.ln(x + b) + c fitting the dataset vx and vy. Vector vg is is required and consists of guesses for coefficients a, b, and c.

pwrfit(vx, vy, vg) to find the coefficients of an equation of the form a^.x^b + c fitting the dataset vx and vy. Vector vg is required and consistes of guesses for coefficients a, b and c.

sinfit(vx, vy, vg) to find the coefficients of an equation of the form a^.sin(x + b) + c fitting the dataset vx and vy. Vector vg is required and consistes of guesses for coefficients a, b and c.

genfit(vx, vy, a, F) to find the coefficients of a nonlinear combination of functions against dataset vx and vy. Argument 'a' is a vector of guesses for the coefficients. Vector 'F' is a vector containing the function we wish to fit and its partial derivatives.

Let's compare each of the above by trying to fit each function to our data set. Vector's 'x' and 'y' are shown as a reminder. Let's define vector 'g' as our vector of guesses for coefficients a, b, and c.

Exponential

Logistic

Logarithmic

Power

Sinusoidal

The above graph plots all six functions on top of the original data.

A final method of curve fitting in MathCAD is the genfit command. This is a generalized curve fitting command that will do what all the previous methods will do, albeit somewhat less conveniently. The genfit command requires you identify the general form of the equation and its partial differentials. The biggest advantage of this command is it will fit a nonlinear set of functions; all previous commands will work only with a linear set of functions. We will use the same set of data as in the previous examples to demonstrate genfit(). The genfit() function uses the syntax genfit(x,y,a,F). The first two arguments are the 'x' and 'y' coordinate pairs. The arguments 'a' and 'F' are vectors. Vector 'a' contains guesses for each of the unknown coefficients of the function you are fitting. Vector 'F' contains the function you are fitting and each of its partial derivatives. There are the same number of partial derivatives as there are coefficients.

<-- Data set to which we will curve fit a linear set of functions.

First, establish a function F(s,a) equal to a matrix of functions. The first element is the equation you wish to fit. All other elements are the partial differentials of that function. The order of the arguments for vector 'F' is critical.

<-- General equation to fit

<-- Partial differentials of the equation

You can use MathCAD to aid in finding the partial differentials. Simply use the derivative function and apply it to the general equation as shown below. The derivative function can be accessed from the Calculus Palette (or Shift + /). The symbolic evaluation symbol (-->) is accessed from the Boolean Palette (or Ctrl + .)

<-- Genfit requires guesses for the coefficients a₁, a₂, a₃ and a₄. These guesses must be placed in matrix 'a'. Note: If array subscript is set to zero instead of one, then guesses would be for a₀ through a₃.

<-- As mentioned above, the arguments for the genfit command are vectors x and y representing the x and y coordinates of the dataset, vector 'a' representing the guesses for the coefficients and vector 'F', the function to be fit along with its partial derivatives. The components of vector 'ans' are the values of each of the coefficients.