Motivation To generate better result than EM algorithm, to avoid local optimization.
Line Score In order to describe how well certain points fits to a line, I developed a score function for the line model. Higher score indicates more linear line, lower score indicates less likely to be a line. Higher score can be achieved by either adding a point, or removing a point.
is the proportion of variation explained by regression Model. It indicates how well the prediction line fits the data. In general, higher value means better fits. is the square of the Pearson correlation.
Leverage: Used to measure the impact of a point in a line. Student Residual: Jackknife residual: Jackknife residual follows a t distribution with (n-3) degree of freedom.
Line Score Two factors are considered into the Line Score Function: R-Square and Proportion of the points in a line. Line Score is defined as N is the total number of points in the input, n is the number of points in the current line.
The algorithm 1. Divide the area into certain finite area of stripes. 2. Calculate the line score for each stripe. 3. Pick the stripe with highest score, filter out outliers. Recalculate the stripe area with the fitted line in the middle. 5. Recalculate the line score with the new points inside the stripe. 6. If new line score is higher, continue to next step, otherwise go back to 3, and pick the next highest score stripe. 7. Recalculate the stripe with the newly fitted line in the middle. Go to step 5. Repeat until no more points are getting added into the stripe. 8. Remove the points from the final stripe from the input, and repeat from step1. 9. Finalize the results, detecting noise etc.
Future Improvement A Better Scoring Function. Are they more factor to be considered in this function. (Press Statistic, Cp Statistic, P- value, CVSS, etc) Adjusted R square VS R square, VS correlation
Conclusion This Algorithm works on some cases It doesn’t require initialization It works best when line is perfectly straight It can detect noise It will not work on all case, since it is probability based