Monday 14 February 2022

Regression analysis

 

first, what is regression analysis

regression analysis is a statistical method that analyzes "eye-catching variables" and "dependent variables" and clarifies the relationship between the two. at this point, we refer to the factor variable as the "description variable" and the attention variable as the "target variable address (the description variable)". after clarifying the purpose of regression analysis, let's explain what regression analysis is by taking the steps of regression analysis prediction method:



regression analysis is the mathematical statistical analysis of influencing factors (independent variables) and predictors (dependent variables) with causal relationships. 


 

The established regression equation makes sense only if there is indeed some relationship between the variable and the dependent variable. therefore, whether the factors as independent variables are related to the predictor object as the dependent variable, how relevant they are, and how sure they are to judge this degree of correlation, become problems that must be solved for regression analysis. correlation analysis generally requires correlation, and the magnitude of the correlation coefficient is used to determine the degree of correlation between the independent variable and the dependent variable.  

second, the purpose of regression analysis
the purpose of regression analysis can be broadly divided into two types:

first, "forecasting". predict the target variable, solve for the target variable y and the explanatory variable (x1, x2,...) equations.

y=a0+b1x1+b2x2+...+bkxk+error (Equation A)

Call Equation A the (multiple) regression equation or the (multiple) regression model. a0 is the y-intercept, b1, b2,..., bk are the regression coefficients. When k=l, there is only 1 descriptive variable, called the unary regression equation. The minimum error sum of squares is solved according to the least flat method, and the y-intercept and regression coefficients are not found. If the regression equation is solved. Substitute x1, x2,... The value of xk, the value of the predicted y.

second, "factor analysis". factor analysis is based on the results of regression analysis, the influence of each independent variable on the target variable is derived, so the degree of influence of each independent variable needs to be determined.

I HOPE THAT BEGINNERS WILL FIRST LEARN UNARY REGRESSION ANALYSIS, CORRELATION ANALYSIS, MULTIPLE REGRESSION ANALYSIS, QUANTITATIVE THEORY I AND OTHER KNOWLEDGE BEFORE READING THE NEXT ARTICLE.

According to the least flat method, use Excel to solve for a and b in y=a+bx. So what is the minimum flat method?

from each data marker point of the scatter plot, make a parallel line parallel to the y-axis, and intersect the straight line in the figure (as shown below)



the length of parallel lines is statistically called "error" or 'residuals'. error (residual) is the difference between the arithmetic value and the actual value of the analysis result. take this and find the curve squared value of the length of the parallel line. you can think of the square value as the square area with an edge length equal to the length of a parallel line (as shown below)



finally, solve for the sum of the areas of all the squares. determine the values that minimize the sum of the areas a (intercept) and b (regression coefficients) (as shown below).



Solve regression equations using Excel; Tools → Data Analysis → Regression, which will be explained in a later article.

the steps of linear regression are the same whether univariate or multivariate, as follows:

1. scatter plot to determine the relationship between variables (simple linear);
2. correlation coefficient and linearity verification;
3. find the regression coefficient and establish the regression equation;
4. regression equation test;
5. interval estimation of parameters;
6. forecasting;
 

univariate linear regression operations and interpretation

summary


univariate linear regression can be said to be a very simple knowledge point in data analysis, and people with a little experience in statistics, analysis, and modeling know the meaning of this analysis, and will also use various tools to do this analysis. here i want to explain the details behind this analysis, that is, the mathematical principles behind it.

what is univariate linear regression

Regression Analysis is a statistical analysis method that determines the quantitative relationship between two or more variables that are interdependent. In regression analysis, which includes only one independent variable and one dependent variable, and the relationship between the two can be approximated by a straight line, this regression analysis is called univariate linear regression analysis. Let's take an example:

let's say there is a company with monthly advertising costs and sales, as shown in the following table:


case data

if we draw advertising costs and sales in two-dimensional coordinates, we can get a scatter plot, and if we want to explore the relationship between advertising costs and sales, we can use univariate linear regression to make a fitted straight line:


fit the line
how this line was drawn
For univariate linear regression, it can be seen that the value of Y is with the value of X changes, every actual X will have an actual Y value, we call it Y actual, then we are asking for a straight line, every actual X will have a straight line prediction of the Y value, we call it Y prediction, the regression line makes the difference between the actual value of each Y and the predicted value the sum of squares is the smallest, that is, (Y1 real -Y1 prediction) ^ 2 + (Y2 real -Y2 prediction) ^ 2 + + (Yn real -Yn prediction) ^ ^ The sum of 2 and the minimum (this sum is called SSE, which will be discussed later).

now let's actually find this line:


 
We all know that lines can be represented as Y = aX + b in coordinate systems, so (Y real-Y prediction) can be written (Y real - (aX real + b)), so the sum of squares can be written as a function of a and b. Simply ask for the values of a and b, the smallest of Q, and the regression line will be solved.

simply interpolate how to find the minimum value of the function:

FIRST, THE DERIVATIVE OF THE MINIMUM POINT OF THE UNARY FUNCTION IS ZERO, FOR EXAMPLE, Y = X^2, THE DERIVATIVE OF X^2 IS 2X, SO THAT 2X = 0, WHEN X = 0, Y TAKES THE MINIMUM VALUE.


then essentially binary functions can also be analogous. imagine the binary function image as a surface, the minimum as a depression, then at the bottom of this depression, from any direction, the partial derivative is 0.

Thus, for functions Q, the partial derivatives for a and b, respectively, and then the partial derivative equal to 0, yields a system of binary equations about a and b, and a and b can be solved. This method is called least squares. The following is the specific mathematical calculus process, do not want to see can directly look at the conclusions behind.

let's expand the formula first:


Q FUNCTION EXPRESSION EXPANDS
then use the average to further simplify the content in each parenthesis in the above equation. for example


AVERAGE OF Y^2
rule:


the upper formula is × n on both sides
 

thereupon


Q FINAL SIMPLIFIES THE RESULT

Then find the partial derivative of a and the partial derivative of b for Q, respectively, so that the partial derivative is equal to 0.


Q takes the partial derivative for a and b, respectively, so that the partial derivative is 0
further simplification, 2n can be eliminated, and finally the binary equation system about a, b is obtained


about the binary equation system a, b
finally, the solution formulas for a and b are obtained:


the least squares method finds the slope a and slope b of the line
with this formula, for the example of advertising cost and sales, we can calculate what the fitted line is, find out the various averages in the formula, and then bring in, and finally calculate a = 1.98, b = 2.25

the final regression fitting line is y = 1.98x + 2.25, using the regression line can make some predictions, such as if the advertising cost of 20,000, then the expected sales of 62,000


evaluate the degree of fit of the regression line
The fitted line we draw is only an approximation, because surely many points have not landed on the line, so how well does our line fit? In statistics there is a term called R^2 (coefficient ofdetermination, Chinese called coefficient of determination, good fit, coefficient of determination, the system can not be superscripted, here R^2 is the square of "R"), used to determine the degree of fit of the regression equation.

first of all, let's clarify the following concepts:

Sum of Squares Total Deviations (also known as Sum of Squares Total): Is the sum of squares of the difference between the actual value of each dependent variable (all Y at a given point) and the mean value of the dependent variable (the average of all Y at a given point), that is, reflecting the overall fluctuation of the value of the dependent variable. as follows:


SST FORMULA

Regression Sum of Squares (SSR), Sum of Squares forRegression: The sum of squares of the difference between the regression value of the dependent variable (the Y value on a straight line) and its mean (the average Y value at a given point), i.e., it is the change in y caused by the change in the independent variable x, reflecting the change in the total deviation of y due to the linear relationship between x and y, which can be explained by the regression line.


SSR FORMULA

 

Sum of Squares of Residuals (also known as Sum of Squares Error): The sum of squares of the difference between the actual observed values of the dependent variable (the Y value of a given point) and the regression value (the Y value on the regression line), which is the effect of other factors on the y change other than the linear influence of x on y, and cannot be explained by the regression line.

these concepts are still a bit obscure, and i personally understand them like this:

take the example of advertising costs and sales, in fact, advertising costs are only one of the more important factors affecting sales, there may be many factors that are difficult to say such as economic level, product quality, customer service level, etc. affect the final sales, then the actual sales are the final result of the interaction of many factors, because sales are fluctuating, so use the sum of the squares of the difference between the sales of each month and the average sales mentioned above (that is, the sum of total squares) to represent the overall fluctuations.

the regression line only indicates the impact of the change of a variable in advertising cost on total sales, so it will inevitably cause deviations, so there will be a difference between the actual value and the regression value, so the regression line can only explain part of the impact

then the difference between the actual value and the regression value, which is the result of the combination of countless other factors in addition to advertising costs, cannot be explained by the regression line.

THUS SST (TOTAL DEVIATION) = SSR (DEVIATION THAT THE REGRESSION LINE CAN EXPLAIN) + SSE (DEVIATION THAT THE REGRESSION LINE CANNOT EXPLAIN)


the degree of fit of the drawn regression line is actually to see how well the line (and this linear relationship between x and y) can reflect (or explain) the change in the y value


r^2=ssr/sst or r^2=1-sse/sst, r^2 takes a value between 0,1, the closer to 1 indicates the better the degree of fit


if all the points are on the regression line, indicating that the sse is 0, then r^2 =1 means that 100% of the change in y is caused by the change in x, and no other factor affects y, and the regression line can fully explain the change in y. if r^2 is low, there may not be a linear relationship between x and y


going back to the original example of advertising costs and sales, the r^2 of this regression line is 0.73, indicating that the fitting degree is ok.


fourth, the difference between the correlation coefficient r and the decision coefficient r^2


the coefficient of determination r^2 determines the degree of fit of the regression equation, indicating to what extent the fitted line can reflect the fluctuation of y.


there is a similar concept in statistics called the correlation coefficient r (this has no square, the scientific name is pearson correlation coefficient, because this is not the only correlation coefficient, but the most common and commonly used one), which is used to indicate the degree of linear correlation between x and y as two random variables, with a value range of [-1,1].


When R = 1, it means that X and Y are completely positively correlated, that is, all sample points (x, y) can be strung together with a straight line, and the slope is positive.

WHEN R= -1, IT MEANS COMPLETELY NEGATIVE CORRELATION, AND ALL POINTS CAN BE STRUNG TOGETHER WITH A STRAIGHT LINE WITH A NEGATIVE SLOPE.

IF R = 0, IT MEANS THAT X AND Y HAVE NO LINEAR RELATIONSHIP, NOTE THAT THERE IS NO LINEAR RELATIONSHIP, AND THERE MAY BE OTHER RELATIONSHIPS.

as with the symbolic representations of these two concepts, it is mathematically possible to prove that the squared of the correlation coefficient r is the coefficient of determination.


significance test for variables


The purpose of the significance test for variables is to exclude insignificant explanatory variables (i.e., X) from the regression coefficients to make the model more concise. In a univariate linear model, we have only one independent variable, X, which is to determine whether X has a significant effect on Y; in multiple linear regression, to verify whether each Xi itself really has a significant effect on Y, if it is not significant, it should be removed from the model.

The idea of a significance test for variables: the idea of hypothesis testing in pure mathematical statistics is used. 

 

Make a hypothesis about the actual value of the Xi parameter, and then, in the case that this hypothesis holds, use the known sample information to construct a statistic that conforms to a certain distribution (such as normal distribution, T distribution, and F distribution), and then theoretically calculate the probability of this statistic, if the probability is very low (less than 5%), according to the statistical basic principle of "small probability events cannot happen in one experiment", now it actually happens! (Because our statistic is calculated from known samples, which are an experiment) there must be a problem with the initial hypothesis, so we can reject the initial hypothesis, and if the probability is not low, then the hypothesis is not a problem.



in fact, it involves the content of mathematical statistics, it is really difficult to say clearly, i will give an inappropriate example: for example, there is a pocket containing a total of 20 balls of black and white colors, and then you want to know whether the number of black and white balls is consistent, then if the idea of using hypothesis testing is to do this: first assume that the number of black and white is the same, and then randomly draw 10 balls, but find that 10 are white, if the initial assumption that the number of black and white is the same, then the probability of drawing 10 white is very small. but such a small probability of something actually happened, so we have reason to believe that the assumption is wrong, and the amount of black and white should be different...

in short, for all regression model software, the final result will have a parameter significance test, ignoring the difficult mathematics, we only need to understand the following conclusions:

The T test is used for the linear significance of an independent variable Xi to Y, if a certain Xi is not significant, it means that this variable can be excluded from the model, making the model more concise.

THE F TEST IS USED FOR LINEAR SIGNIFICANCE FOR Y FOR ALL INDEPENDENT VARIABLES X AS A WHOLE

The results of the T test look at the P-value, the F test looks at the Significant F value, generally less than 0.05, the smaller the more significant (this 0.05 is actually the significance level, is artificially set, if more stringent, can be set to 0.01, but also bring some other problems, not to elaborate)

The following figure is the result of regression analysis using excel's example of advertising costs and sales (EXCEL is really a very powerful tool, and it can become a supernatural god with the same magic), and it can be seen that the F test is significant (Significance F is 0.0017), and the T test for variable X is significant (P-value is 0.0017), which is exactly the same and understandable, because we are a univariate regression, only one independent variable X.


Do linear regression analysis with Excel


Another point is that the T test of intercept (that is, the b in Y = aX + b) does not pass, it is not significant, in general, as long as the F test and the T test of the key variables pass, the predictive power of the model is OK.

finally recommend a good statistics course
khan academy open class: statistics program details - netease cloud classroom

this course is an introductory course in statistics and will cover all the main knowledge of statistics, including: random variables, mean variance standard deviation, statistical charts, probability density, binomial distribution, poisson distribution, normal distribution, law of large numbers, central limit theorem, sample and sampling distribution, parameter estimation, confidence intervals, bernoulli distribution, hypothesis testing and p-value, analysis of variance, regression analysis, etc.

a total of more than 80 videos, a total of more than 800 minutes long, can only say that this is the best statistics tutorial i have ever seen, there is no one, if the undergraduate can see this level of tutorial, i will not rely on the note on the exam room to pass this course.

in fact, if you understand a little real statistics, there will be more perspectives in life, which is still quite interesting.

uni-variate linear regression is here, thank you!

transferred from: http://www.jianshu.com/p/fcd220697182#

 

 

Multiple regression analysis using Excel data analysis tools
 

the data analysis tool gallery appears in the data toolbar, as shown in the following figure:

 

 Multiple regression analysis using Excel data analysis tools

 

Given the raw data, the value of the argument is in the cell range A2:I21 and the value of the dependent variable is in J2:J21, as shown in the following figure:

 

 Multiple regression analysis using Excel data analysis tools


 

suppose the regression estimate expression is:

 

 Multiple regression analysis using Excel data analysis tools

 

Try using the Regression Analysis tool in the Excel Data Analysis Tool Gallery to estimate its regression coefficients and perform regression analysis:

 

click the data analysis tool gallery in the data toolbar, as shown in the following figure:

 

 Multiple regression analysis using Excel data analysis tools

 

select "regression" in the pop-up "data analysis" - "analysis tools" multi-line text box, and then click "ok", as shown in the following figure:

 

 Multiple regression analysis using Excel data analysis tools

 

the "regression" dialog box pops up and makes the selection as shown below:

 

 Multiple regression analysis using Excel data analysis tools


 

the specific methods of the above choices are:

 

In the "Y Value Input Area", click the collapse button on the right side, select the cell range J2:J21 where the function Y data is located, and then click the collapse button to return after the selection is completed; this process can also directly enter J2:J21 in the "Y Value Input Area" text box;

 

In the "X Value Input Area", click the collapse button on the right, select the cell range A2:I21 where the argument data is located, and then click the collapse button to return after selecting it; this procedure can also be directly entered in the "X Value Input Area" text box

 

the confidence level can be selected as the default of 95%.

 

Select New Worksheet in the Output area to output the results of the statistical analysis to the new table. For comparison, I selected the blank area in this table, the starting cell in the upper left corner is K10. After clicking OK, the output result is as follows:

 

Multiple regression analysis using Excel data analysis tools
 

 

The first table is the Regression Statistics Table (K12:L17):

 

there into:


 

Multiple R: (Complex correlation coefficient R) The square root of R2, also known as the correlation coefficient, is used to measure the degree of correlation between the independent variables x and y. In this example, R=0.9134 indicates that the relationship between them is highly positively correlated. (Multiple: Composite, Multiple).

 

R Square: The complex determination coefficient, the square of the above complex correlation coefficient R. It is used to illustrate the degree to which the independent variable explains the difference in the dependent variable y, in order to determine the fitting effect of the dependent variable y. The compound measurement coefficient in this case is 0.8343, indicating that 83.43% of the variation of the dependent variable can be explained by using independent variables

 

Adjusted R Square: The adjusted complex coefficient R2, which is 0.6852, indicates that the independent variable can explain 68.52% of the dependent variable y, and 31.48% of the dependent variable y is explained by other factors. (Adjusted: Adjusted).

 

standard error: used to measure the degree of fit, and also used to calculate other statistics related to regression, the smaller this value, the better the fit

 

observations: 

the number of observations used to estimate the data for the regression equation.

 

The second table is the "Analysis of Variance Table": the main role is to determine the regression effect of the regression model through the F test.

 

The P-value of Signalance F (F significance statistic) in this case is 0.00636, which is less than the significance level of 0.05, so the regression equation has a significant regression effect, and at least one of the regression coefficients in the equation is significantly not 0. (Significance: significant).

 

the third table is the "regression parameters table":

 

K26:K35 is the default indication for the ordering of constant terms and b1 to b9



L26: L35 is the constant term and the value of b1 to b9, from which the estimated regression equation is:


Multiple regression analysis using Excel data analysis tools



Important in this table is the O column, where the P-value in O26:O35 is the P value of the regression coefficient t statistic.

No comments:

Post a Comment