Maths help?

Get your science fix here: research, quackery, activism and all the rest
Post Reply
User avatar
Formerly AvP
Stargoon
Posts: 119
Joined: Sun Oct 16, 2022 6:42 pm

Maths help?

Post by Formerly AvP » Thu Dec 08, 2022 12:26 pm

I have a number of data sets which look like this (in two columns i.e. 47.5 is the first entry, corresponding to 4.5 in the second column)

47.5 4.5
57.5 6.3333
67.5 6.66667
72.5 7.6
77.5 8
82.5 8
87.5 8
92.5 8

These are very close to lying on an ogive of a normal distribution. Using Excel, I can draw a normal distribution, and adjust the mean and sd so it passes through these points to a close approximation, but there must be a more elegant way of doing it! Any help gratefully acknowledged...
Was Allo V Psycho, but when my laptop died, I lost all the info on it...

monkey
After Pie
Posts: 1906
Joined: Wed Nov 13, 2019 5:10 pm

Re: Maths help?

Post by monkey » Thu Dec 08, 2022 1:20 pm

You want to do a least squares fit of the normal/Gaussian to the data. Not sure how to do that in Excel, or even if you can.

IvanV
Stummy Beige
Posts: 2660
Joined: Mon May 17, 2021 11:12 am

Re: Maths help?

Post by IvanV » Thu Dec 08, 2022 3:35 pm

monkey wrote:
Thu Dec 08, 2022 1:20 pm
You want to do a least squares fit of the normal/Gaussian to the data. Not sure how to do that in Excel, or even if you can.
Typically, if you want to do an OLS determination of a function A(x; m, s) with two parameters m and s we want to fit, I have called those two parameters m and s for mean and std dev, we need to transform that function into a form so that it looks like B f(x) + C g(x), where f(x) and g(x) are some new functions of the independent variable, and B and C are functions of m and s such that we can solve them simultaneously for m and s.

So, for example, if our function A were m exp{-sx], we could take logs of it and get log m - sx, and regress that, log m would be the intercept and -s the slope term.

The pdf of a normal distribution of mean m and std dev s is exp{-(x-m)2/2s2}/sqrt{2pi}.

The cumulative distribution is obtained by integrating that. That isn't an integral you can perform symbolically. It's a function you can have tables for, or get a computer to tell you its value, like sin or log. With the two parameters deeply embedded into it, it certainly isn't a function you can transform to get into the form needed for an OLS.

But if we know what the mean is, then there is a simple transformation we can perform to get the std out of it.

Do you know what you have got there? We can see you have the top bit where it flattens off at 8. Do you know where it flattens out at the bottom? If so then the mean is exactly half way between the top and bottom flats. Or maybe you know that 47.5 is the mean? Then at, say, 57.5 you can look at the value and work out how many stds from the mean you should be at that point, by looking up in the normal cum dist table. You can do this for each of the other points where the value is less than 8. That would give you a line to fit to estimate the std, or just some numbers you can average to estimate it.

But if that 47.5 is some random point in mid-distribution, I think you have little prospect of knowing what the mean is with much accuracy, and thus not the std dev either. Trying to fit the shape of a curve when you have so few points on it is a mug's game.

User avatar
shpalman
Princess POW
Posts: 8241
Joined: Mon Nov 11, 2019 12:53 pm
Location: One step beyond
Contact:

Re: Maths help?

Post by shpalman » Thu Dec 08, 2022 5:08 pm

monkey wrote:
Thu Dec 08, 2022 1:20 pm
You want to do a least squares fit of the normal/Gaussian to the data. Not sure how to do that in Excel, or even if you can.
Excel does linear fits but can also do multiple linear fits at the same time. An example based on something I've done is a polynomial fit. Say you want to fit a + b*x + c*x^2 = y you basically have a column with your x data and another column with your y data, and then you also make a column with x*2. Then it works like a linear fit of x and x*2 to y. It's still a bit fiddly, you need to select the data set in the right way and also select a range to put the multi-cell output into.

Here's someone fitting a Gaussian in Excel with a different strategy.

Maybe some of the statistical functions in Excel will essentially give you the important parameters of the normal distribution.

I tend to use scipy.optimize.curve_fit a lot these days instead, you can define whatever function you like.
having that swing is a necessary but not sufficient condition for it meaning a thing
@shpalman@mastodon.me.uk

User avatar
shpalman
Princess POW
Posts: 8241
Joined: Mon Nov 11, 2019 12:53 pm
Location: One step beyond
Contact:

Re: Maths help?

Post by shpalman » Thu Dec 08, 2022 5:12 pm

Oh and the integral of the Gaussian is called the error function although something like the logistic function looks similar enough so it depends on what your underlying process actually is, which one you should choose.
having that swing is a necessary but not sufficient condition for it meaning a thing
@shpalman@mastodon.me.uk

User avatar
Formerly AvP
Stargoon
Posts: 119
Joined: Sun Oct 16, 2022 6:42 pm

Re: Maths help?

Post by Formerly AvP » Thu Dec 08, 2022 8:02 pm

I love this forum. Thanks all!
Ivan, I know the bottom value will be (0,0) and the top value is (100,8). This is why I reckoned it was a Gaussian in the first place. Shpalman, that link to how to do it in Excel looks as if it would solve it, once I've down loaded Solver.
Gratitude again...

AvP
Was Allo V Psycho, but when my laptop died, I lost all the info on it...

monkey
After Pie
Posts: 1906
Joined: Wed Nov 13, 2019 5:10 pm

Re: Maths help?

Post by monkey » Thu Dec 08, 2022 8:54 pm

shpalman wrote:
Thu Dec 08, 2022 5:08 pm
I tend to use scipy.optimize.curve_fit a lot these days instead, you can define whatever function you like.
Similarly, I normally use the fitting/analysis functions in Matlab.

I used to use a free graphing package for fits back when I was doing my MSc, can't remember what it was called though. It was basically a free version of Sigmaplot. Started using it because it made nicer graphs were nicer than Excel, but then found the fitting tools when I needed them. I bet you could find something like it with a bit of googling.

User avatar
shpalman
Princess POW
Posts: 8241
Joined: Mon Nov 11, 2019 12:53 pm
Location: One step beyond
Contact:

Re: Maths help?

Post by shpalman » Thu Dec 08, 2022 9:12 pm

monkey wrote:
Thu Dec 08, 2022 8:54 pm
shpalman wrote:
Thu Dec 08, 2022 5:08 pm
I tend to use scipy.optimize.curve_fit a lot these days instead, you can define whatever function you like.
Similarly, I normally use the fitting/analysis functions in Matlab.

I used to use a free graphing package for fits back when I was doing my MSc, can't remember what it was called though. It was basically a free version of Sigmaplot. Started using it because it made nicer graphs were nicer than Excel, but then found the fitting tools when I needed them. I bet you could find something like it with a bit of googling.
Is QTIplot still a thing? It's more like Origin. I used to use Origin but now do fitting in python and plotting either there (or more usually) gnuplot.
having that swing is a necessary but not sufficient condition for it meaning a thing
@shpalman@mastodon.me.uk

User avatar
nekomatic
Dorkwood
Posts: 1376
Joined: Mon Nov 11, 2019 3:04 pm

Re: Maths help?

Post by nekomatic » Thu Dec 08, 2022 9:17 pm

Yeah this is how I’d do it, if the data was in excel. Plotting the graph is a good idea as it gives you a visual check that the answer is sensible.
Move-a… side, and let the mango through… let the mango through

monkey
After Pie
Posts: 1906
Joined: Wed Nov 13, 2019 5:10 pm

Re: Maths help?

Post by monkey » Thu Dec 08, 2022 10:05 pm

shpalman wrote:
Thu Dec 08, 2022 9:12 pm
monkey wrote:
Thu Dec 08, 2022 8:54 pm
shpalman wrote:
Thu Dec 08, 2022 5:08 pm
I tend to use scipy.optimize.curve_fit a lot these days instead, you can define whatever function you like.
Similarly, I normally use the fitting/analysis functions in Matlab.

I used to use a free graphing package for fits back when I was doing my MSc, can't remember what it was called though. It was basically a free version of Sigmaplot. Started using it because it made nicer graphs were nicer than Excel, but then found the fitting tools when I needed them. I bet you could find something like it with a bit of googling.
Is QTIplot still a thing? It's more like Origin. I used to use Origin but now do fitting in python and plotting either there (or more usually) gnuplot.
Looking at screenshots, that looks pretty similar to what I was using, so that could've been it. No longer open source though. Clicking around, I just found labplot (clicky), which has been updated recently, and should do the job.


ETA: Just to make sure, I have never used Labplot. No idea how well it works or how intuitive it is.

User avatar
shpalman
Princess POW
Posts: 8241
Joined: Mon Nov 11, 2019 12:53 pm
Location: One step beyond
Contact:

Re: Maths help?

Post by shpalman » Thu Dec 08, 2022 10:39 pm

Oh and I use LibreOffice instead of Excel as much as possible even if I get use of Office via my institution.
having that swing is a necessary but not sufficient condition for it meaning a thing
@shpalman@mastodon.me.uk

User avatar
dyqik
Princess POW
Posts: 7526
Joined: Wed Sep 25, 2019 4:19 pm
Location: Masshole
Contact:

Re: Maths help?

Post by dyqik » Thu Dec 08, 2022 10:50 pm

My reply has gone missing, but you can use moments to estimate the parameters of a normal/gaussian distribution.

The mean is approximated by the sample mean, calculated with AVERAGE in Excel, and the std dev can estimated as the sample variance with PAR.V.

User avatar
dyqik
Princess POW
Posts: 7526
Joined: Wed Sep 25, 2019 4:19 pm
Location: Masshole
Contact:

Re: Maths help?

Post by dyqik » Fri Dec 09, 2022 12:59 am

You can pretty easily do least squares fitting on arbitrary functions in Excel, using the Solver (far right of the data tab bar).

Here SumSq is the sum of the squares of the residuals. A, mu and sigma are the Gaussian parameters (height, mean, std dev) for the ERF integration.

You then ask the solver to minimize the sumsq of the residuals by varying A, mu and sigma. You can add constraints e.g. to keep A >= 8.

(I hadn't looked hard at shpalman's link above when I did this)
Excel least squares.png
Excel least squares.png (69.99 KiB) Viewed 3892 times
Last edited by dyqik on Fri Dec 09, 2022 1:10 am, edited 4 times in total.

User avatar
dyqik
Princess POW
Posts: 7526
Joined: Wed Sep 25, 2019 4:19 pm
Location: Masshole
Contact:

Re: Maths help?

Post by dyqik » Fri Dec 09, 2022 1:04 am

dyqik wrote:
Thu Dec 08, 2022 10:50 pm
My reply has gone missing, but you can use moments to estimate the parameters of a normal/gaussian distribution.

The mean is approximated by the sample mean, calculated with AVERAGE in Excel, and the std dev can estimated as the sample variance with PAR.V.
For the Pythoneers, there's a version of this in the gaussfitter package, which uses the moments to provide the starting point for a least squares fit. And if you are feeling brave, I have a Github repo that extends that package to arbitrary functions that are roughly gaussian shaped (I'm using 2d truncated Bessel functions to fit measured multimode Gaussian beams in my instruments). That restriction is so that moments can be used to give a reasonable starting point, but you can supply your own "moments" for non-Gaussian looking functions.

IvanV
Stummy Beige
Posts: 2660
Joined: Mon May 17, 2021 11:12 am

Re: Maths help?

Post by IvanV » Fri Dec 09, 2022 11:29 am

Formerly AvP wrote:
Thu Dec 08, 2022 8:02 pm
I love this forum. Thanks all!
Ivan, I know the bottom value will be (0,0) and the top value is (100,8). This is why I reckoned it was a Gaussian in the first place. Shpalman, that link to how to do it in Excel looks as if it would solve it, once I've down loaded Solver.
Gratitude again...

AvP
So you know that the cumulative function runs from 0 to 8, so you know that the mean is whatever number on the x axis has a value of 4 on the graph, according to the best estimate of that.

So actually you can fit it very simply as I show in this spreadsheet. I get a different answer from dyqik, similar std but very different mean. That is because I have used that piece of information above, and he hasn't.

First I normalised the y values to a 0 to 1 range by dividing by 8, the actual range. Then the transformation function I used is the inverse of the cumulative normal distribution curve with mean 0 and std dev 1. Then I have graphed that against the x values to show you what is happening. I found the intercept and slope of the best fit straight line. The functions SLOPE and INTERCEPT provide that.

The best fit line crosses the x axis at the fitted mean, and it crosses y=1 at the mean plus one fitted std.dev.

I have the analysis add-in, which does stats and regressions etc, added in to my excel routinely. I'm not quite sure if these functions are in bare excel, or if you need the add-in.

I think is easier than dyqik's method, and more reliable if you can rely on the piece of information that the range of the cumulative function is 8 and so when y=4, that is where the mean is. Btw, that is my entire spreadsheet, I've hidden nothing.

And btw, you can't use points where the normalised y value is 0 or 1, because. the NORM.INV function is undefined at those values.
Fitting Normal.jpg
Fitting Normal.jpg (88.16 KiB) Viewed 3856 times

User avatar
dyqik
Princess POW
Posts: 7526
Joined: Wed Sep 25, 2019 4:19 pm
Location: Masshole
Contact:

Re: Maths help?

Post by dyqik » Fri Dec 09, 2022 11:38 am

IvanV wrote:
Fri Dec 09, 2022 11:29 am
Formerly AvP wrote:
Thu Dec 08, 2022 8:02 pm
I love this forum. Thanks all!
Ivan, I know the bottom value will be (0,0) and the top value is (100,8). This is why I reckoned it was a Gaussian in the first place. Shpalman, that link to how to do it in Excel looks as if it would solve it, once I've down loaded Solver.
Gratitude again...

AvP
So you know that the cumulative function runs from 0 to 8, so you know that the mean is whatever number on the x axis has a value of 4 on the graph, according to the best estimate of that.

So actually you can fit it very simply as I show in this spreadsheet. I get a different answer from dyqik, similar std but very different mean. That is because I have used that piece of information above, and he hasn't.
Adding a constraint that A <= 8, or fixing A=8 in the solver in my spreadsheet does significantly change the mean and sigma, and along with explicitly adding (0,0) and (100, 8) to the data set, will achieve this.

The nice thing about doing it this way is that you've documented the priors in the sheet, and can relax or alter them as you see fit.

You can also weight the residuals before squaring and summing them.

User avatar
dyqik
Princess POW
Posts: 7526
Joined: Wed Sep 25, 2019 4:19 pm
Location: Masshole
Contact:

Re: Maths help?

Post by dyqik » Fri Dec 09, 2022 11:52 am

It's worth noting they the Error function doesn't go exactly to 0 or 1 at the bounds of the data, so it's never going to fit that well.

And I'm not sure that the mean of the cumulative function is 4 if the distribution is skewed, or significantly offset to one side of the window.

User avatar
shpalman
Princess POW
Posts: 8241
Joined: Mon Nov 11, 2019 12:53 pm
Location: One step beyond
Contact:

Re: Maths help?

Post by shpalman » Fri Dec 09, 2022 12:35 pm

Fitting the position and width of both the error function and the logistic function, in both cases constrained to go from 0 to 8.
AvP.dat.png
AvP.dat.png (30.26 KiB) Viewed 3842 times
Or without adding 0,0 and 100,8:
AvP_orig.dat.png
AvP_orig.dat.png (29.78 KiB) Viewed 3841 times

Code: Select all

import numpy as np
from scipy.special import erf

def erfavp(d,d0,w):
	return 8.0*0.5*(1+erf((d-d0)/(np.sqrt(2)*w)))

def logavp(d,d0,w):
    return 8.0/(1+np.exp(-((d-d0)/w)))
having that swing is a necessary but not sufficient condition for it meaning a thing
@shpalman@mastodon.me.uk

IvanV
Stummy Beige
Posts: 2660
Joined: Mon May 17, 2021 11:12 am

Re: Maths help?

Post by IvanV » Fri Dec 09, 2022 2:11 pm

dyqik wrote:
Fri Dec 09, 2022 11:52 am
And I'm not sure that the mean of the cumulative function is 4 if the distribution is skewed, or significantly offset to one side of the window.
Indeed not. The question is, what do we "know"? If we know something, then that makes it easy. If we just have the data points, then we can be can be making wrong assumptions that produce wrong answers. This is the risk of doing stats when all you have is the data.

It reminds me of the time I was doing a project in collaboration with a professor of engineering, concerned with how payments by mobile operators for mobile telephony frequencies in another country. We had some data about something or other, I forget what. The professor's assistant went, oooh, I can fit a quadratic to that data. Look, it fits really nicely. And I went, it is completely implausible that the underlying relationship in that data is quadratic. Whilst that might act as a practical interpolation method within the data range, it is completely implausible that it acts as any predictor of what it might be outside the data range. So, please do not mention or show that to the client.

User avatar
shpalman
Princess POW
Posts: 8241
Joined: Mon Nov 11, 2019 12:53 pm
Location: One step beyond
Contact:

Re: Maths help?

Post by shpalman » Fri Dec 09, 2022 2:22 pm

I originally set up that erf and logistic fitting script to deal with this.

It's somewhere on the second page where we realized that it was complete nonsense to fit a quadratic or even linear function to that data.
having that swing is a necessary but not sufficient condition for it meaning a thing
@shpalman@mastodon.me.uk

User avatar
dyqik
Princess POW
Posts: 7526
Joined: Wed Sep 25, 2019 4:19 pm
Location: Masshole
Contact:

Re: Maths help?

Post by dyqik » Fri Dec 09, 2022 3:00 pm

IvanV wrote:
Fri Dec 09, 2022 2:11 pm
dyqik wrote:
Fri Dec 09, 2022 11:52 am
And I'm not sure that the mean of the cumulative function is 4 if the distribution is skewed, or significantly offset to one side of the window.
Indeed not. The question is, what do we "know"? If we know something, then that makes it easy. If we just have the data points, then we can be can be making wrong assumptions that produce wrong answers. This is the risk of doing stats when all you have is the data.
Here we don't have any data below the likely mean value of the cumulative function for a symmetric underlying distribution, so we can't make assumptions about symmetry or skew, of the underlying distribution. There are lots of options with a low enough parameter count to not be overfitting that can fit this data pretty well.

ControlFreak
Ghastly Pink
Posts: 4
Joined: Sun Dec 01, 2019 3:31 pm

Re: Maths help?

Post by ControlFreak » Sat Dec 10, 2022 9:04 am

I don't usually try and plot this sort of thing on a "linear" scaled plot. If I thought the data was normal I would usually create a normal plot (a cumulative distribution plot with warped axes to make normal data a straight line) and then run a significance test from that. Fortunately at work I have access to minitab that has this as a built in function. The chart plots actual vs predicted points on the cumulative distribution based on the data estimates for mean and standard deviation, so if its normal you get a straight line. You can usually "eye-ball" normality but also run significance tests if you need to quote a p-value.

The plot is fairly simple to create manually in a spreadsheet using the rank function and an in-built normal function. My go-to place for this sort of stuff is the NIST statistics guide.

Normal plots are here https://www.itl.nist.gov/div898/handboo ... rmprpl.htm

IvanV
Stummy Beige
Posts: 2660
Joined: Mon May 17, 2021 11:12 am

Re: Maths help?

Post by IvanV » Sat Dec 10, 2022 9:31 am

ControlFreak wrote:
Sat Dec 10, 2022 9:04 am
I don't usually try and plot this sort of thing on a "linear" scaled plot. If I thought the data was normal I would usually create a normal plot (a cumulative distribution plot with warped axes to make normal data a straight line) and then run a significance test from that.
Mathematically, those amount to the same thing. And you will need to estimate the mean first to locate the plot in the right place on the warped axes.

Post Reply