# Calculate Linear Regression and Graph Scatter Plot and Line of Best Fit

### What is simple linear regression?

Simple linear regression is a way to describe a relationship between two variables through an equation of a straight line, called line of best fit, that most closely models this relationship.

A common form of a linear equation in the two variables x and y is
y=mx+b
where m and b designate constants. The origin of the name "e;linear"e; comes from the fact that the set of solutions of such an equation forms a straight line in the plane. In this particular equation, the constant m determines the slope or gradient of that line, and the constant term "b" determines the point at which the line crosses the y-axis, otherwise known as the y-intercept.

Given any set of n data points in the form (x_i, y_i), according to this method of minimizing the sum of square errors, the line of best fit is obtained when
m\sum_{i=1}^{n} x_i+bn=\sum_{i=1}^{n} y_i
m\sum_{i=1}^{n} x_i^2+b\sum_{i=1}^{n} x_i=\sum_{i=1}^{n} x_iy_i
where the summations are taken over the entire data set. These equations are known as the normal equations.

From the normal equations above, we can find the expressions for the unknown coefficients m and b.
For a line described by y = mx + b, the linear regression formulae for the slope m and intercept b are given by
slope m,
m = \frac{n\sum_{i=1}^{n} x_iy_i-\sum_{i=1}^n x_i\sum_{i=1}^n y_i}{n\sum_{i=1}^{n} x_i^2-(\sum_{i=1}^n x_i)^2}

intercept b,
b = \frac{\sum_{i=1}^{n} y_i - b\sum_{i=1}^{n} x_i}{n}
where the summations are again taken over the entire data set

line of best fit (trend line) - A line on a scatter plot which can be drawn near the points to more clearly show the trend between two sets of data.

• The line of best that rises quickly from left to right is called a positive correlation.
• The line of best that falls down quickly from left to the right is called a negative correlation
• Strong positve and negative correlations have data points very close to the line of best fit..
• Weak positve and negative correlations have data points that are not clustered near or on the line of best fit.
• Data points that are not close to the line of best fit are called outliers.

### Online Tool to Calculate Linear Regression and Graph Scatter Plot and Line of Best Fit

Data Points

Scatter Plot and Line of Best Fit Properties
Best Fit Line Data
X ValueY ValueActions

Regression Data
xyx\timesyx^2

Regression Calculations
Sample mean for X,

\bar{X} = \frac{ \sum_{i=1}^{n} X_i }{n}

\bar{X} = [XMEAN]

Sample mean for Y,

\bar{Y} = \frac{ \sum_{i=1}^{n} Y_i }{n}

\bar{Y} = [YMEAN]

slope m,

m = \frac{n\sum_{i=1}^{n} x_iy_i-\sum_{i=1}^n x_i\sum_{i=1}^n y_i}{n\sum_{i=1}^{n} x_i^2-(\sum_{i=1}^n x_i)^2}

m = \frac{[NCOUNT]\times[SUMXY] - [SUMX]\times[SUMY]}{[NCOUNT]\times[SUMXSQ] - ([SUMX])^2}

m = [SLOPE]

intercept b,

b = \frac{\sum_{i=1}^{n} y_i - m\sum_{i=1}^{n} x_i}{n}

b = \frac{[SUMY] - [SLOPE]\times[SUMX]}{[NCOUNT]}

b = [INTERCEPT]

Regression / Line of Best fit linear equation,

y = m\timesx + b

y = [SLOPE]\timesx + ([INTERCEPT])

Scatter Plot and Line of Best Fit

### The Standard Deviation

It is also helpful to have a measure of the average uncertainty of the measurements, and this is given by the standard deviation:
The deviation of the measurement x_i from the mean is d_i = x_i - \bar{x}

### Uncertainties in the Slope and Intercept

The slope and the intercept are computed from data values that have uncertainties associated with them. These uncertainties can be propagated through the calculations for the slope and intercept by the standard methods of differential error analysis.
The standard error in the slope \sigma_m:

\sigma_m^2 = \frac{n\sigma_y^2}{n\sum_{i=1}^{n} x_i^2-(\sum_{i=1}^n x_i)^2}

The standard error in the intercept \sigma_b:

\sigma_b^2 = \frac{\sigma_y^2\sum_{i=1}^{n} x_i^2}{n\sum_{i=1}^{n} x_i^2-(\sum_{i=1}^n x_i)^2}

where standard deviation of y_i:

\sigma_i^2 = \sigma_y^2 = \frac{\sum_{i=1}^{n} (y_i-b-mx_i)^2}{n-2}

Uncertainty Calculations
where standard deviation of y_i:

\sigma_y^2 = \frac{\sum_{i=1}^{n} (y_i-b-mx_i)^2}{n-2}

\sigma_y^2 = \frac{[YBMX]}{[NCOUNT]-2}

\sigma_y^2 = [SDYSQ]

\sigma_y = \pm[SDY]

The standard error in the slope \sigma_m:

\sigma_m^2 = \frac{n\sigma_y^2}{n\sum_{i=1}^{n} x_i^2-(\sum_{i=1}^n x_i)^2}

\sigma_m^2 = \frac{[NCOUNT][SDYSQ]}{[NCOUNT]\times[SUMXSQ] - ([SUMX])^2}

\sigma_m^2 = \frac{[NCOUNT][SDYSQ]}{[NCOUNT]\times[SUMXSQ] - ([SUMX])^2}

\sigma_m^2 = \frac{[NMSQ]}{[UCD]}

\sigma_m^2 = [SDMSQ]

\sigma_m = \pm[SDM]

The standard error in the intercept \sigma_b:

\sigma_b^2 = \frac{\sigma_y^2\sum_{i=1}^{n} x_i^2}{n\sum_{i=1}^{n} x_i^2-(\sum_{i=1}^n x_i)^2}

\sigma_b^2 = \frac{[SDYSQ]\times[SUMXSQ]}{[NCOUNT]\times[SUMXSQ] - ([SUMX])^2}

\sigma_b^2 = \frac{[NBSQ]}{[UCD]}

\sigma_b^2 = [SDBSQ]

\sigma_b = \pm[SDB]