Jekyll2019-03-15T18:04:00+00:00http://www.pwyqspace.com/atom.xmlPWYQ SpaceA Space with ThoughtsYanqing WuStudy Note of Machine Learning (IV)2018-03-13T21:42:00+00:002018-03-13T21:42:00+00:00http://www.pwyqspace.com/study/2018/03/13/stanford-ml-note-four<h1 id="chap-6---advice-for-applying-machine-learning">Chap 6 - Advice for Applying Machine Learning</h1> <h2 id="evaluating-a-hypothesis">Evaluating a Hypothesis</h2> <p>A hypothesis may have low error for the training examples but still be inaccurate (because of overfitting).</p> <p>With a given dataset of training examples, we can split up the data into two sets: a training set and a test set.</p> <p>The new procedure using these two sets is then:</p> <ol> <li>Learn <script type="math/tex">\Theta</script> and minimize <script type="math/tex">J_{train}(\Theta)</script> using the training set</li> <li>Compute the test set error <script type="math/tex">J_{test}(\Theta)</script></li> </ol> <h3 id="the-test-set-error">The test set error</h3> <ol> <li>For linear regression:</li> </ol> <script type="math/tex; mode=display">J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2</script> <ol> <li>For classification ~ Misclassification error (aka 0/1 misclassification error):</li> </ol> <script type="math/tex; mode=display">% <![CDATA[ err(h_\Theta(x),y) = \begin{matrix} 1 & \mbox{if } h_\Theta(x) \geq 0.5\ and\ y = 0\ or\ h_\Theta(x) < 0.5\ and\ y = 1\newline 0 & \mbox otherwise \end{matrix} %]]></script> <p>This gives us a binary 0 or 1 error result based on a misclassification.</p> <p>The average test error for the test set is</p> <script type="math/tex; mode=display">\text{Test Error} = \dfrac{1}{m_{test}} \sum^{m_{test}}_{i=1} err(h_\Theta(x^{(i)}_{test}), y^{(i)}_{test})</script> <p>This gives us the proportion of the test data that was misclassified.</p> <h2 id="model-selection-and-trainvalidationtest-sets">Model Selection and Train/Validation/Test Sets</h2> <ul> <li>Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis.</li> <li>The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than any other data set.</li> </ul> <p>In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.</p> <h3 id="without-the-validation-set-note-this-is-a-bad-method---do-not-use-it">Without the Validation Set (note: this is a bad method - do not use it)</h3> <ol> <li>Optimize the parameters in Θ using the training set for each polynomial degree.</li> <li>Find the polynomial degree d with the least error using the test set.</li> <li>Estimate the generalization error also using the test set with <script type="math/tex">J_{test}(\Theta^{(d)})</script>, (d = theta from polynomial with lower error);</li> </ol> <p>In this case, we have trained one variable, d, or the degree of the polynomial, using the test set. This will cause our error value to be greater for any other set of data.</p> <h3 id="use-of-the-cv-set">Use of the CV set</h3> <p>To solve this, we can introduce a third set, the Cross Validation Set, to serve as an intermediate set that we can train d with. Then our test set will give us an accurate, non-optimistic error.</p> <p>One example way to break down our dataset into the three sets is:</p> <ul> <li>Training set: 60%</li> <li>Cross validation set: 20%</li> <li>Test set: 20%</li> </ul> <p>We can now calculate three separate error values for the three different sets.</p> <h3 id="with-the-validation-set-note-this-method-presumes-we-do-not-also-use-the-cv-set-for-regularization">With the Validation Set (note: this method presumes we do not also use the CV set for regularization)</h3> <ol> <li>Optimize the parameters in Θ using the training set for each polynomial degree.</li> <li>Find the polynomial degree d with the least error using the cross validation set.</li> <li>Estimate the generalization error using the test set with <script type="math/tex">J_{test}(\Theta^{(d)})</script>, (d = theta from polynomial with lower error);</li> </ol> <p>This way, the degree of the polynomial d has not been trained using the test set.</p> <p>(Mentor note: be aware that using the CV set to select ‘d’ means that we cannot also use it for the validation curve process of setting the lambda value).</p> <h2 id="diagnosing-bias-vs-variance">Diagnosing Bias vs. Variance</h2> <p>In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.</p> <ul> <li>We need to distinguish whether bias or variance is the problem contributing to bad predictions.</li> <li>High bias is underfitting and high variance is overfitting. We need to find a golden mean between these two.</li> </ul> <p>The training error will tend to decrease as we increase the degree d of the polynomial.</p> <p>At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.</p> <p>High bias (underfitting): both <script type="math/tex">J_{train}(\Theta)</script> and <script type="math/tex">J_{CV}(\Theta)</script> will be high. Also, <script type="math/tex">J_{CV}(\Theta) \approx J_{train}(\Theta)</script>.</p> <p>High variance (overfitting): <script type="math/tex">J_{train}(\Theta)</script> will be low and <script type="math/tex">J_{CV}(\Theta)</script> will be much greater than<script type="math/tex">J_{train}(\Theta)</script>.</p> <p><img src="/assets/images/posts/Stanford-ML/stanford-ml-w6-1.png" alt="alt_text" title="Bias vs Variance" /></p> <h2 id="regularization-and-biasvariance">Regularization and Bias/Variance</h2> <p>Instead of looking at the degree d contributing to bias/variance, now we will look at the regularization parameter λ.</p> <ul> <li>Large λ: High bias (underfitting)</li> <li>Intermediate λ: just right</li> <li>Small λ: High variance (overfitting)</li> </ul> <p>A large lambda heavily penalizes all the Θ parameters, which greatly simplifies the line of our resulting function, so causes underfitting.</p> <p>The relationship of λ to the training set and the variance set is as follows:</p> <p>Low λ: <script type="math/tex">J_{train}(\Theta)</script> is low and <script type="math/tex">J_{CV}(\Theta)</script> is high (high variance/overfitting).</p> <p>Intermediate λ: <script type="math/tex">J_{train}(\Theta)</script> and <script type="math/tex">J_{CV}(\Theta)</script> are somewhat low and <script type="math/tex">J_{train}(\Theta) \approx J_{CV}(\Theta)</script>.</p> <p>Large λ: both <script type="math/tex">J_{train}(\Theta)</script> and <script type="math/tex">J_{CV}(\Theta)</script> will be high (underfitting /high bias)</p> <p><img src="/assets/images/posts/Stanford-ML/stanford-ml-w6-2.png" alt="alt_text" title="lambda vs hypothesis" /></p> <p>In order to choose the model and the regularization λ, we need:</p> <ol> <li>Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});</li> <li>Create a set of models with different degrees or any other variants.</li> <li>Iterate through the <script type="math/tex">\lambda</script>s and for each <script type="math/tex">\lambda</script> go through all the models to learn some <script type="math/tex">\Theta</script>.</li> <li>Compute the cross validation error using the learned Θ (computed with λ) on the <script type="math/tex">J_{CV}(\Theta)</script> without regularization or λ = 0.</li> <li>Select the best combo that produces the lowest error on the cross validation set.</li> <li>Using the best combo Θ and λ, apply it on <script type="math/tex">J_{test}(\Theta)</script> to see if it has a good generalization of the problem.</li> </ol> <h2 id="learning-curves">Learning Curves</h2> <p>Training 3 examples will easily have 0 errors because we can always find a quadratic curve that exactly touches 3 points.</p> <p>As the training set gets larger, the error for a quadratic function increases. The error value will plateau out after a certain m, or training set size.</p> <h3 id="with-high-bias">With high bias</h3> <p>Low training set size: causes <script type="math/tex">J_{train}(\Theta)</script> to be low and <script type="math/tex">J_{CV}(\Theta)</script> to be high.</p> <p>Large training set size: causes both <script type="math/tex">J_{train}(\Theta)</script> and <script type="math/tex">J_{CV}(\Theta)</script> to be high with <script type="math/tex">J_{train}(\Theta)</script>≈<script type="math/tex">J_{CV}(\Theta)</script>.</p> <p>If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.</p> <p>For high variance, we have the following relationships in terms of the training set size:</p> <h3 id="with-high-variance">With high variance</h3> <p>Low training set size: <script type="math/tex">J_{train}(\Theta)</script> will be low and <script type="math/tex">J_{CV}(\Theta)</script> will be high.</p> <p>Large training set size: <script type="math/tex">J_{train}(\Theta)</script> increases with training set size and <script type="math/tex">J_{CV}(\Theta)</script> continues to decrease without leveling off. Also, <script type="math/tex">J_{train}(\Theta)</script>&lt;<script type="math/tex">J_{CV}(\Theta)</script> but the difference between them remains significant.</p> <p>If a learning algorithm is suffering from high variance, getting more training data is likely to help.</p> <p><img src="/assets/images/posts/Stanford-ML/stanford-ml-w6-3.png" alt="alt_text" title="Learning curves - 1" /> <img src="/assets/images/posts/Stanford-ML/stanford-ml-w6-4.png" alt="alt_text" title="Learning curves - 2" /></p> <h2 id="deciding-what-to-do-next-revisited">Deciding What to Do Next Revisited</h2> <p>Our decision process can be broken down as follows:</p> <table> <thead> <tr> <th style="text-align: center">Action</th> <th style="text-align: center">Result</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">Getting more training examples</td> <td style="text-align: center">Fixes high variance</td> </tr> <tr> <td style="text-align: center">Trying smaller sets of features</td> <td style="text-align: center">Fixes high variance</td> </tr> <tr> <td style="text-align: center">Adding features</td> <td style="text-align: center">Fixes high bias</td> </tr> <tr> <td style="text-align: center">Adding polynomial features</td> <td style="text-align: center">Fixes high bias</td> </tr> <tr> <td style="text-align: center">Decreasing λ</td> <td style="text-align: center">Fixes high bias</td> </tr> <tr> <td style="text-align: center">Increasing λ</td> <td style="text-align: center">Fixes high variance</td> </tr> </tbody> </table> <h3 id="diagnosing-neural-networks">Diagnosing Neural Networks</h3> <p>A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper. A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting. Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set.</p> <h3 id="model-selection">Model Selection:</h3> <p>Choosing M the order of polynomials.</p> <p>How can we tell which parameters Θ to leave in the model (known as “model selection”)?</p> <p>There are several ways to solve this problem:</p> <ul> <li>Get more data (very difficult).</li> <li>Choose the model which best fits the data without overfitting (very difficult).</li> <li>Reduce the opportunity for overfitting through regularization.</li> </ul> <p>Bias: approximation error (Difference between expected value and optimal value)</p> <ul> <li>High Bias = UnderFitting (BU)</li> <li><script type="math/tex">J_{train}(\Theta)</script> and <script type="math/tex">J_{CV}(\Theta)</script> both will be high and <script type="math/tex">J_{train}(\Theta)</script> ≈ <script type="math/tex">J_{CV}(\Theta)</script></li> </ul> <p>Variance: estimation error due to finite data</p> <ul> <li>High Variance = OverFitting (VO)</li> <li><script type="math/tex">J_{train}(\Theta)</script> is low and <script type="math/tex">J_{CV}(\Theta)</script> ≫<script type="math/tex">J_{train}(\Theta)</script></li> </ul> <p>Intuition for the bias-variance trade-off:</p> <ul> <li>Complex model =&gt; sensitive to data =&gt; much affected by changes in X =&gt; high variance, low bias.</li> <li>Simple model =&gt; more rigid =&gt; does not change as much with changes in X =&gt; low variance, high bias.</li> </ul> <p>One of the most important goals in learning: finding a model that is just right in the bias-variance trade-off.</p> <p>Regularization Effects:</p> <ul> <li>Small values of λ allow model to become finely tuned to noise leading to large variance =&gt; overfitting.</li> <li>Large values of λ pull weight parameters to zero leading to large bias =&gt; underfitting.</li> </ul> <p>Model Complexity Effects:</p> <ul> <li>Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.</li> <li>Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.</li> <li>In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.</li> </ul> <p>A typical rule of thumb when running diagnostics is:</p> <ul> <li>More training examples fixes high variance but not high bias.</li> <li>Fewer features fixes high variance but not high bias.</li> <li>Additional features fixes high bias but not high variance.</li> <li>The addition of polynomial and interaction features fixes high bias but not high variance.</li> <li>When using gradient descent, decreasing lambda can fix high bias and increasing lambda can fix high variance (lambda is the regularization parameter).</li> <li>When using neural networks, small neural networks are more prone to under-fitting and big neural networks are prone to over-fitting. Cross-validation of network size is a way to choose alternatives.</li> </ul> <h2 id="mlmachine-learning-system-design">ML:Machine Learning System Design</h2> <h3 id="prioritizing-what-to-work-on">Prioritizing What to Work On</h3> <p>Different ways we can approach a machine learning problem:</p> <ul> <li>Collect lots of data (for example “honeypot” project but doesn’t always work)</li> <li>Develop sophisticated features (for example: using email header data in spam emails)</li> <li>Develop algorithms to process your input in different ways (recognizing misspellings in spam).</li> <li>It is difficult to tell which of the options will be helpful.</li> </ul> <h3 id="error-analysis">Error Analysis</h3> <p>The recommended approach to solving machine learning problems is:</p> <ul> <li>Start with a simple algorithm, implement it quickly, and test it early.</li> <li>Plot learning curves to decide if more data, more features, etc. will help</li> <li>Error analysis: manually examine the errors on examples in the cross validation set and try to spot a trend.</li> </ul> <p>It’s important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm’s performance.</p> <p>You may need to process your input before it is useful. For example, if your input is a set of words, you may want to treat the same word with different forms (fail/failing/failed) as one word, so must use “stemming software” to recognize them all as one.</p> <h3 id="error-metrics-for-skewed-classes">Error Metrics for Skewed Classes</h3> <p>It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm.</p> <p>For example: In predicting a cancer diagnoses where 0.5% of the examples have cancer, we find our learning algorithm has a 1% error. However, if we were to simply classify every single example as a 0, then our error would reduce to 0.5% even though we did not improve the algorithm. This usually happens with skewed classes; that is, when our class is very rare in the entire data set.</p> <p>Or to say it another way, when we have lot more examples from one class than from the other class.</p> <p>For this we can use Precision/Recall.</p> <ul> <li>Predicted: 1, Actual: 1 — True positive</li> <li>Predicted: 0, Actual: 0 — True negative</li> <li>Predicted: 0, Actual, 1 — False negative</li> <li>Predicted: 1, Actual: 0 — False positive</li> </ul> <p>Precision: of all patients we predicted where y=1, what fraction actually has cancer?</p> <script type="math/tex; mode=display">\dfrac{\text{True Positives}}{\text{Total number of predicted positives}} = \dfrac{\text{True Positives}}{\text{True Positives}+\text{False positives}}</script> <p>Recall: Of all the patients that actually have cancer, what fraction did we correctly detect as having cancer?</p> <script type="math/tex; mode=display">\dfrac{\text{True Positives}}{\text{Total number of actual positives}}= \dfrac{\text{True Positives}}{\text{True Positives}+\text{False negatives}}</script> <p>These two metrics give us a better sense of how our classifier is doing. We want both precision and recall to be high.</p> <p>In the example at the beginning of the section, if we classify all patients as 0, then our recall will be <script type="math/tex">\dfrac{0}{0 + f} = 0</script>, so despite having a lower error percentage, we can quickly see it has worse recall.</p> <p>Accuracy = <script type="math/tex">\frac {true positive + true negative} {total population}</script></p> <p>Note 1: if an algorithm predicts only negatives like it does in one of exercises, the precision is not defined, it is impossible to divide by 0. F1 score will not be defined too.</p> <h2 id="trading-off-precision-and-recall">Trading Off Precision and Recall</h2> <p>We might want a confident prediction of two classes using logistic regression. One way is to increase our threshold:</p> <ul> <li>Predict 1 if: <script type="math/tex">h_\theta(x) \geq 0.7</script></li> <li>Predict 0 if: <script type="math/tex">% <![CDATA[ h_\theta(x) < 0.7 %]]></script></li> </ul> <p>This way, we only predict cancer if the patient has a 70% chance.</p> <p>Doing this, we will have higher precision but lower recall (refer to the definitions in the previous section).</p> <p>In the opposite example, we can lower our threshold:</p> <ul> <li>Predict 1 if: <script type="math/tex">h_\theta(x) \geq 0.3</script></li> <li>Predict 0 if: <script type="math/tex">% <![CDATA[ h_\theta(x) < 0.3 %]]></script></li> </ul> <p>That way, we get a very safe prediction. This will cause higher recall but lower precision.</p> <p>The greater the threshold, the greater the precision and the lower the recall.</p> <p>The lower the threshold, the greater the recall and the lower the precision.</p> <p>In order to turn these two metrics into one single number, we can take the F value.</p> <p>One way is to take the average:</p> <script type="math/tex; mode=display">\dfrac{P+R}{2}</script> <p>This does not work well. If we predict all y=0 then that will bring the average up despite having 0 recall. If we predict all examples as y=1, then the very high recall will bring up the average despite having 0 precision.</p> <p>A better way is to compute the F Score (or F1 score):</p> <script type="math/tex; mode=display">\text{F Score} = 2\dfrac{PR}{P + R}</script> <p>In order for the F Score to be large, both precision and recall must be large.</p> <p>We want to train precision and recall on the cross validation set so as not to bias our test set.</p> <h2 id="data-for-machine-learning">Data for Machine Learning</h2> <p>How much data should we train on?</p> <p>In certain cases, an “inferior algorithm,” if given enough data, can outperform a superior algorithm with less data.</p> <p>We must choose our features to have enough information. A useful test is: Given input x, would a human expert be able to confidently predict y?</p> <p>Rationale for large data: if we have a low bias algorithm (many features or hidden units making a very complex function), then the larger the training set we use, the less we will have overfitting (and the more accurate the algorithm will be on the test set).</p> <h1 id="reference">Reference</h1> <ul> <li><a href="https://www.coursera.org/learn/machine-learning">Stanford Machine Learning by Andrew Ng</a></li> </ul>Yanqing WuChap 6 - Advice for Applying Machine Learning Evaluating a Hypothesis A hypothesis may have low error for the training examples but still be inaccurate (because of overfitting).Study Note of Machine Learning (V)2018-03-13T21:42:00+00:002018-03-13T21:42:00+00:00http://www.pwyqspace.com/study/2018/03/13/stanford-ml-note-five<h1 id="chap-7---optimization-objective">Chap 7 - Optimization Objective</h1> <p>The <strong>Support Vector Machine</strong> (SVM) is yet another type of supervised machine learning algorithm. It is sometimes cleaner and more powerful.</p> <p>Recall that in logistic regression, we use the following rules:</p> <ul> <li>if y=1, then <script type="math/tex">h_\theta(x) \approx 1</script> and <script type="math/tex">\Theta^Tx \gg 0</script></li> <li>if y=0, then <script type="math/tex">h_\theta(x) \approx 0</script> and <script type="math/tex">\Theta^Tx \ll 0</script></li> </ul> <p>Recall the cost function for (unregularized) logistic regression:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}J(\theta) & = \frac{1}{m}\sum_{i=1}^m -y^{(i)} \log(h_\theta(x^{(i)})) - (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))\\ & = \frac{1}{m}\sum_{i=1}^m -y^{(i)} \log\Big(\dfrac{1}{1 + e^{-\theta^Tx^{(i)}}}\Big) - (1 - y^{(i)})\log\Big(1 - \dfrac{1}{1 + e^{-\theta^Tx^{(i)}}}\Big)\end{align*} %]]></script> <p>To make a support vector machine, we will modify the first term of the cost function <script type="math/tex">-\log(h_{\theta}(x)) = -\log\Big(\dfrac{1}{1 + e^{-\theta^Tx}}\Big)</script> so that when <script type="math/tex">θ^Tx</script> (from now on, we shall refer to this as z) is greater than 1, it outputs 0. Furthermore, for values of z less than 1, we shall use a straight decreasing line instead of the sigmoid curve. (In the literature, this is called a <a href="https://en.wikipedia.org/wiki/Hinge_loss">hinge loss function</a>)</p> <p><img src="/assets/images/posts/Stanford-ML/stanford-ml-w7-1.png" alt="alt_text" /></p> <p>Similarly, we modify the second term of the cost function <script type="math/tex">-\log(1 - h_{\theta(x)}) = -\log\Big(1 - \dfrac{1}{1 + e^{-\theta^Tx}}\Big)</script> so that when z is less than -1, it outputs 0. We also modify it so that for values of z greater than -1, we use a straight increasing line instead of the sigmoid curve.</p> <p><img src="/assets/images/posts/Stanford-ML/stanford-ml-w7-2.png" alt="alt_text" /></p> <p>We shall denote these as <script type="math/tex">\text{cost}_1(z)</script> and <script type="math/tex">\text{cost}_0(z)</script> (respectively, note that <script type="math/tex">\text{cost}_1(z)</script> is the cost for classifying when y=1, and <script type="math/tex">\text{cost}_0(z)</script> is the cost for classifying when y=0), and we may define them as follows (where k is an arbitrary constant defining the magnitude of the slope of the line):</p> <script type="math/tex; mode=display">z = \theta^Tx</script> <script type="math/tex; mode=display">\text{cost}_0(z) = \max(0, k(1+z))</script> <script type="math/tex; mode=display">\text{cost}_1(z) = \max(0, k(1-z))</script> <p>Recall the full cost function from (regularized) logistic regression:</p> <script type="math/tex; mode=display">J(\theta) = \frac{1}{m} \sum_{i=1}^m y^{(i)}(-\log(h_\theta(x^{(i)}))) + (1 - y^{(i)})(-\log(1 - h_\theta(x^{(i)}))) + \dfrac{\lambda}{2m}\sum_{j=1}^n \Theta^2_j</script> <p>Note that the negative sign has been distributed into the sum in the above equation.</p> <p>We may transform this into the cost function for support vector machines by substituting <script type="math/tex">\text{cost}_0(z)</script> and <script type="math/tex">\text{cost}_1(z)</script>:</p> <script type="math/tex; mode=display">J(\theta) = \frac{1}{m} \sum_{i=1}^m y^{(i)} \ \text{cost}_1(\theta^Tx^{(i)}) + (1 - y^{(i)}) \ \text{cost}_0(\theta^Tx^{(i)}) + \dfrac{\lambda}{2m}\sum_{j=1}^n \Theta^2_j</script> <p>We can optimize this a bit by multiplying this by m (thus removing the m factor in the denominators). Note that this does not affect our optimization, since we’re simply multiplying our cost function by a positive constant (for example, minimizing <script type="math/tex">(u-5)^2 + 1</script> gives us 5; multiplying it by 10 to make it <script type="math/tex">10(u-5)^2 + 10</script> still gives us 5 when minimized).</p> <script type="math/tex; mode=display">J(\theta) = \sum_{i=1}^m y^{(i)} \ \text{cost}_1(\theta^Tx^{(i)}) + (1 - y^{(i)}) \ \text{cost}_0(\theta^Tx^{(i)}) + \dfrac{\lambda}{2}\sum_{j=1}^n \Theta^2_j</script> <p>Furthermore, convention dictates that we regularize using a factor C, instead of λ, like so:</p> <script type="math/tex; mode=display">J(\theta) = C\sum_{i=1}^m y^{(i)} \ \text{cost}_1(\theta^Tx^{(i)}) + (1 - y^{(i)}) \ \text{cost}_0(\theta^Tx^{(i)}) + \dfrac{1}{2}\sum_{j=1}^n \Theta^2_j</script> <p>This is equivalent to multiplying the equation by <script type="math/tex">C = \dfrac{1}{\lambda}</script>, and thus results in the same values when optimized. Now, when we wish to regularize more (that is, reduce overfitting), we decrease C, and when we wish to regularize less (that is, reduce underfitting), we increase C.</p> <p>Finally, note that the hypothesis of the Support Vector Machine is not interpreted as the probability of y being 1 or 0 (as it is for the hypothesis of logistic regression). Instead, it outputs either 1 or 0. (In technical terms, it is a discriminant function.)</p> <script type="math/tex; mode=display">% <![CDATA[ h_\theta(x) =\begin{cases} 1 & \text{if} \ \Theta^Tx \geq 0 \\ 0 & \text{otherwise}\end{cases} %]]></script> <h2 id="large-margin-intuition">Large Margin Intuition</h2> <p>A useful way to think about Support Vector Machines is to think of them as Large Margin Classifiers.</p> <ul> <li>If y=1, we want <script type="math/tex">\Theta^Tx \geq 1</script> (not just ≥0)</li> <li>If y=0, we want <script type="math/tex">\Theta^Tx \leq -1</script> (not just &lt;0)</li> </ul> <p>Now when we set our constant C to a very large value (e.g. 100,000), our optimizing function will constrain Θ such that the equation A (the summation of the cost of each example) equals 0. We impose the following constraints on Θ:</p> <p><script type="math/tex">\Theta^Tx \geq 1</script> if y=1 and <script type="math/tex">\Theta^Tx \leq -1</script> if y=0.</p> <p>If C is very large, we must choose Θ parameters such that:</p> <script type="math/tex; mode=display">\sum_{i=1}^m y^{(i)}\text{cost}_1(\Theta^Tx) + (1 - y^{(i)})\text{cost}_0(\Theta^Tx) = 0</script> <p>This reduces our cost function to:</p> <script type="math/tex; mode=display">\begin{align*} J(\theta) = C \cdot 0 + \dfrac{1}{2}\sum_{j=1}^n \Theta^2_j \newline = \dfrac{1}{2}\sum_{j=1}^n \Theta^2_j \end{align*}</script> <p>Recall the decision boundary from logistic regression (the line separating the positive and negative examples). In SVMs, the decision boundary has the special property that it is <strong>as far away as possible</strong> from both the positive and the negative examples.</p> <p>The distance of the decision boundary to the nearest example is called the <strong>margin</strong>. Since SVMs maximize this margin, it is often called a Large Margin Classifier.</p> <p>The SVM will separate the negative and positive examples by a <strong>large margin</strong>.</p> <p>This large margin is only achieved when <strong>C is very large</strong>.</p> <p>Data is <strong>linearly separable</strong> when a <strong>straight line</strong> can separate the positive and negative examples.</p> <p>If we have <strong>outlier</strong> examples that we don’t want to affect the decision boundary, then we can <strong>reduce</strong> C.</p> <p>Increasing and decreasing C is similar to respectively decreasing and increasing λ, and can simplify our decision boundary.</p> <h2 id="mathematics-behind-large-margin-classification-optional">Mathematics Behind Large Margin Classification (Optional)</h2> <h3 id="vector-inner-product">Vector Inner Product</h3> <p>Say we have two vectors, u and v:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} u = \begin{bmatrix} u_1 \newline u_2 \end{bmatrix} & v = \begin{bmatrix} v_1 \newline v_2 \end{bmatrix} \end{align*} %]]></script> <p>The length of vector v is denoted</p> <script type="math/tex; mode=display">||v||</script> <p>and it describes the line on a graph from origin (0,0) to <script type="math/tex">(v_1,v_2)</script>.</p> <p>The length of vector v can be calculated with <script type="math/tex">\sqrt{v_1^2 + v_2^2}</script>by the Pythagorean theorem.</p> <p>The <strong>projection</strong> of vector v onto vector u is found by taking a right angle from u to the end of v, creating a right triangle.</p> <ul> <li>p= length of projection of v onto the vector u.</li> <li> <script type="math/tex; mode=display">u^Tv= p \cdot ||u||</script> </li> </ul> <p>Note that</p> <script type="math/tex; mode=display">u^Tv = ||u|| \cdot ||v|| \cos \theta</script> <p>where θ is the angle between u and v.</p> <p>Also</p> <script type="math/tex; mode=display">p = ||v|| \cos \theta</script> <p>If you substitute p for</p> <script type="math/tex; mode=display">||v|| \cos \theta</script> <p>you get</p> <script type="math/tex; mode=display">u^Tv= p \cdot ||u||</script> <p>So the product <script type="math/tex">u^Tv</script> is equal to the length of the projection times the length of vector u.</p> <p>In our example, since u and v are vectors of the same length, <script type="math/tex">u^Tv = v^Tu</script>.</p> <script type="math/tex; mode=display">u^Tv = v^Tu = p \cdot ||u|| = u_1v_1 + u_2v_2</script> <p>If the angle between the lines for v and u is greater than 90 degrees, then the projection p will be negative.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}&\min_\Theta \dfrac{1}{2}\sum_{j=1}^n \Theta_j^2 \newline&= \dfrac{1}{2}(\Theta_1^2 + \Theta_2^2 + \dots + \Theta_n^2) \newline&= \dfrac{1}{2}(\sqrt{\Theta_1^2 + \Theta_2^2 + \dots + \Theta_n^2})^2 \newline&= \dfrac{1}{2}||\Theta ||^2 \newline\end{align*} %]]></script> <p>We can use the same rules to rewrite <script type="math/tex">\Theta^Tx^{(i)}</script>:</p> <script type="math/tex; mode=display">\Theta^Tx^{(i)} = p^{(i)} \cdot ||\Theta || = \Theta_1x_1^{(i)} + \Theta_2x_2^{(i)} + \dots + \Theta_n x_n^{(i)}</script> <p>So we now have a new optimization objective by substituting</p> <script type="math/tex; mode=display">p^{(i)} \cdot ||\Theta ||</script> <p>in for <script type="math/tex">\Theta^Tx^{(i)}</script></p> <ul> <li>If y=1, we want</li> </ul> <script type="math/tex; mode=display">p^{(i)} \cdot ||\Theta || \geq 1</script> <ul> <li>If y=0, we want</li> </ul> <script type="math/tex; mode=display">p^{(i)} \cdot ||\Theta || \leq -1</script> <p>The reason this causes a “large margin” is because: the vector for Θ is perpendicular to the decision boundary. In order for our optimization objective (above) to hold true, we need the absolute value of our projections <script type="math/tex">p^{(i)}</script> to be as large as possible.</p> <p>If <script type="math/tex">\Theta_0 =0</script>, then all our decision boundaries will intersect (0,0). If <script type="math/tex">\Theta_0 \neq 0</script>, the support vector machine will still find a large margin for the decision boundary.</p> <h2 id="kernels-i">Kernels I</h2> <p>Why use Kennel?</p> <ul> <li>Kernels allow us to make complex, non-linear classifiers using Support Vector Machines.</li> <li>To deal with non-linear SVM Decision Boundary</li> </ul> <p>Given x, compute new feature depending on proximity to landmarks <script type="math/tex">l^{(1)},\ l^{(2)},\ l^{(3)}</script>.</p> <p>To do this, we find the “similarity” of x and some landmark <script type="math/tex">l^{(i)}</script>:</p> <script type="math/tex; mode=display">f_i = similarity(x, l^{(i)}) = \exp(-\dfrac{||x - l^{(i)}||^2}{2\sigma^2})</script> <p>This “similarity” function is called a Gaussian Kernel. It is a specific example of a kernel.</p> <p>The similarity function can also be written as follows:</p> <script type="math/tex; mode=display">f_i = similarity(x, l^{(i)}) = \exp(-\dfrac{\sum^n_{j=1}(x_j-l_j^{(i)})^2}{2\sigma^2})</script> <p>There are a couple properties of the similarity function:</p> <p>If <script type="math/tex">x \approx l^{(i)}</script>, then <script type="math/tex">f_i = \exp(-\dfrac{\approx 0^2}{2\sigma^2}) \approx 1</script></p> <p>If x is far from <script type="math/tex">l^{(i)}</script>, then <script type="math/tex">f_i = \exp(-\dfrac{(large\ number)^2}{2\sigma^2}) \approx 0</script></p> <p>In other words, if x and the landmark are close, then the similarity will be close to 1, and if x and the landmark are far away from each other, the similarity will be close to 0.</p> <p>Each landmark gives us the features in our hypothesis:</p> <script type="math/tex; mode=display">\begin{align*}l^{(1)} \rightarrow f_1 \newline l^{(2)} \rightarrow f_2 \newline l^{(3)} \rightarrow f_3 \newline\dots \newline h_\Theta(x) = \Theta_1f_1 + \Theta_2f_2 + \Theta_3f_3 + \dots\end{align*}</script> <p><script type="math/tex">\sigma^2</script> is a parameter of the Gaussian Kernel, and it can be modified to increase or decrease the drop-off of our feature <script type="math/tex">f_i</script>. Combined with looking at the values inside Θ, we can choose these landmarks to get the general shape of the decision boundary.</p> <h2 id="kernels-ii">Kernels II</h2> <p>One way to get the landmarks is to put them in the exact same locations as all the training examples. This gives us m landmarks, with one landmark per training example.</p> <p>Given example x:</p> <p><script type="math/tex">f_1 = similarity(x,l^{(1)})</script>, <script type="math/tex">f_2 = similarity(x,l^{(2)})</script>, <script type="math/tex">f_3 = similarity(x,l^{(3)})</script>, and so on.</p> <p>This gives us a “feature vector,” <script type="math/tex">f_{(i)}</script> of all our features for example <script type="math/tex">x_{(i)}</script>. We may also set <script type="math/tex">f_0 = 1</script> to correspond with <script type="math/tex">Θ_0</script>. Thus given training example <script type="math/tex">x_{(i)}</script>:</p> <script type="math/tex; mode=display">x^{(i)} \rightarrow \begin{bmatrix}f_1^{(i)} = similarity(x^{(i)}, l^{(1)}) \newline f_2^{(i)} = similarity(x^{(i)}, l^{(2)}) \newline\vdots \newline f_m^{(i)} = similarity(x^{(i)}, l^{(m)}) \newline\end{bmatrix}</script> <p>Now to get the parameters Θ we can use the SVM minimization algorithm but with <script type="math/tex">f^{(i)}</script> substituted in for <script type="math/tex">x^{(i)}</script>:</p> <script type="math/tex; mode=display">\min_{\Theta} C \sum_{i=1}^m y^{(i)}\text{cost}_1(\Theta^Tf^{(i)}) + (1 - y^{(i)})\text{cost}_0(\theta^Tf^{(i)}) + \dfrac{1}{2}\sum_{j=1}^n \Theta^2_j</script> <p>Using kernels to generate f(i) is not exclusive to SVMs and may also be applied to logistic regression. However, because of computational optimizations on SVMs, kernels combined with SVMs is much faster than with other algorithms, so kernels are almost always found combined only with SVMs.</p> <h3 id="choosing-svm-parameters">Choosing SVM Parameters</h3> <p>Choosing C (recall that <script type="math/tex">C = \dfrac{1}{\lambda}</script></p> <ul> <li>If C is large, then we get higher variance/lower bias</li> <li>If C is small, then we get lower variance/higher bias</li> </ul> <p>The other parameter we must choose is <script type="math/tex">σ^2</script> from the Gaussian Kernel function:</p> <p>With a large <script type="math/tex">σ^2</script>, the features fi vary more smoothly, causing higher bias and lower variance.</p> <p>With a small <script type="math/tex">σ^2</script>, the features fi vary less smoothly, causing lower bias and higher variance.</p> <h3 id="using-an-svm">Using An SVM</h3> <p>There are lots of good SVM libraries already written. A. Ng often uses ‘liblinear’ and ‘libsvm’. In practical application, you should use one of these libraries rather than rewrite the functions.</p> <p>In practical application, the choices you do need to make are:</p> <ul> <li>Choice of parameter C</li> <li>Choice of kernel (similarity function)</li> <li>No kernel (“linear” kernel) – gives standard linear classifier</li> <li>Choose when n is large and when m is small</li> <li>Gaussian Kernel (above) – need to choose <script type="math/tex">σ^2</script></li> <li>Choose when n is small and m is large</li> </ul> <p>The library may ask you to provide the kernel function.</p> <p>Note: do perform feature scaling before using the Gaussian Kernel.</p> <p>Note: not all similarity functions are valid kernels. They must satisfy “Mercer’s Theorem” which guarantees that the SVM package’s optimizations run correctly and do not diverge.</p> <p>You want to train C and the parameters for the kernel function using the training and cross-validation datasets.</p> <h3 id="multi-class-classification">Multi-class Classification</h3> <p>Many SVM libraries have multi-class classification built-in.</p> <p>You can use the one-vs-all method just like we did for logistic regression, where <script type="math/tex">y \in {1,2,3,\dots,K}</script> with <script type="math/tex">\Theta^{(1)}, \Theta^{(2)}, \dots,\Theta{(K)}</script>. We pick class i with the largest <script type="math/tex">(\Theta^{(i)})^Tx</script>.</p> <h3 id="logistic-regression-vs-svms">Logistic Regression vs. SVMs</h3> <ul> <li>If n is large (relative to m), then use logistic regression, or SVM without a kernel (the “linear kernel”)</li> <li>If n is small and m is intermediate, then use SVM with a Gaussian Kernel</li> <li>If n is small and m is large, then manually create/add more features, then use logistic regression or SVM without a kernel.</li> </ul> <p>In the first case, we don’t have enough examples to need a complicated polynomial hypothesis. In the second example, we have enough examples that we may need a complex non-linear hypothesis. In the last case, we want to increase our features so that logistic regression becomes applicable.</p> <p>Note: a neural network is likely to work well for any of these situations, but may be slower to train.</p> <h1 id="reference">Reference</h1> <ul> <li><a href="https://www.coursera.org/learn/machine-learning">Stanford Machine Learning by Andrew Ng</a></li> </ul>Yanqing WuChap 7 - Optimization Objective The Support Vector Machine (SVM) is yet another type of supervised machine learning algorithm. It is sometimes cleaner and more powerful.Study Note of Machine Learning (III)2018-03-12T20:28:00+00:002018-03-12T20:28:00+00:00http://www.pwyqspace.com/study/2018/03/12/stanford-ml-note-three<h1 id="chap-4---neural-networks-representation">Chap 4 - Neural Networks: Representation</h1> <h2 id="non-linear-hypotheses">Non-linear Hypotheses</h2> <p>Performing linear regression with a complex set of data with many features is very unwieldy. Say you wanted to create a hypothesis from three (3) features that included all the quadratic terms:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& g(\theta_0 + \theta_1x_1^2 + \theta_2x_1x_2 + \theta_3x_1x_3 \newline& + \theta_4x_2^2 + \theta_5x_2x_3 \newline& + \theta_6x_3^2 )\end{align*} %]]></script> <p>That gives us 6 features. The exact way to calculate how many features for all polynomial terms is the combination function with repetition <a href="http://www.mathsisfun.com/combinatorics/combinations-permutations.html">link</a>. <script type="math/tex">\frac{(n+r-1)!}{r!(n-1)!}</script>. In this case we are taking all two-element combinations of three features: <script type="math/tex">\frac{(3 + 2 - 1)!}{(2!\cdot (3-1)!)}</script> = <script type="math/tex">\frac{4!}{4} = 6</script>. (Note: you do not have to know these formulas, I just found it helpful for understanding).</p> <p>For 100 features, if we wanted to make them quadratic we would get <script type="math/tex">\frac{(100 + 2 - 1)!}{(2\cdot (100-1)!)} = 5050</script> resulting new features.</p> <p>We can approximate the growth of the number of new features we get with all quadratic terms with <script type="math/tex">\mathcal{O}(n^2/2)</script>. And if you wanted to include all cubic terms in your hypothesis, the features would grow asymptotically at <script type="math/tex">\mathcal{O}(n^3)</script>. These are very steep growths, so as the number of our features increase, the number of quadratic or cubic features increase very rapidly and becomes quickly impractical.</p> <p>Example: let our training set be a collection of 50 x 50 pixel black-and-white photographs, and our goal will be to classify which ones are photos of cars. Our feature set size is then n = 2500 if we compare every pair of pixels.</p> <p>Now let’s say we need to make a quadratic hypothesis function. With quadratic features, our growth is <script type="math/tex">\mathcal{O}(n^2/2)</script>. So our total features will be about <script type="math/tex">2500^2 / 2 = 3125000</script>, which is very impractical.</p> <p>Neural networks offers an alternate way to perform machine learning when we have complex hypotheses with many features.</p> <h2 id="neurons-and-the-brain">Neurons and the Brain</h2> <p>Neural networks are limited imitations of how our own brains work. They’ve had a big recent resurgence because of advances in computer hardware.</p> <p>There is evidence that the brain uses only one “learning algorithm” for all its different functions. Scientists have tried cutting (in an animal brain) the connection between the ears and the auditory cortex and rewiring the optical nerve with the auditory cortex to find that the auditory cortex literally learns to see.</p> <p>This principle is called “neuroplasticity” and has many examples and experimental evidence.</p> <h2 id="model-representation-i">Model Representation I</h2> <p>Let’s examine how we will represent a hypothesis function using neural networks.</p> <p>At a very simple level, neurons are basically computational units that take input (dendrites) as electrical input (called “spikes”) that are channeled to outputs (axons).</p> <p>In our model, our dendrites are like the input features <script type="math/tex">x_1\cdots x_n</script>, and the output is the result of our hypothesis function:</p> <p>In this model our x0 input node is sometimes called the “bias unit.” It is always equal to 1.</p> <p>In neural networks, we use the same logistic function as in classification: <script type="math/tex">\frac{1}{1 + e^{-\theta^Tx}}</script>. In neural networks however we sometimes call it a sigmoid (logistic) activation function.</p> <p>Our “theta” parameters are sometimes instead called “weights” in the neural networks model.</p> <p>Visually, a simplistic representation looks like:</p> <script type="math/tex; mode=display">\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline \end{bmatrix}\rightarrow\begin{bmatrix}\ \ \ \newline \end{bmatrix}\rightarrow h_\theta(x)</script> <p>Our input nodes (layer 1) go into another node (layer 2), and are output as the hypothesis function.</p> <p>The first layer is called the “input layer” and the final layer the “output layer,” which gives the final value computed on the hypothesis.</p> <p>We can have intermediate layers of nodes between the input and output layers called the “hidden layer.”</p> <p>We label these intermediate or “hidden” layer nodes <script type="math/tex">a^2_0 \cdots a^2_n</script> and call them “activation units.”</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& a_i^{(j)} = \text{"activation" of unit $i$ in layer $j$} \newline& \Theta^{(j)} = \text{matrix of weights controlling function mapping from layer $j$ to layer $j+1$}\end{align*} %]]></script> <p>If we had one hidden layer, it would look visually something like:</p> <script type="math/tex; mode=display">\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline x_3\end{bmatrix}\rightarrow\begin{bmatrix}a_1^{(2)} \newline a_2^{(2)} \newline a_3^{(2)} \newline \end{bmatrix}\rightarrow h_\theta(x)</script> <p>The values for each of the “activation” nodes is obtained as follows:</p> <script type="math/tex; mode=display">\begin{align*} a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \newline a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \newline a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \newline h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)}) \newline \end{align*}</script> <p>This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix <script type="math/tex">\Theta^{(2)}</script> containing the weights for our second layer of nodes.</p> <p>Each layer gets its own matrix of weights, <script type="math/tex">\Theta^{(j)}</script>.</p> <p>The dimensions of these matrices of weights is determined as follows:</p> <script type="math/tex; mode=display">\text{If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$.}</script> <p>The +1 comes from the addition in <script type="math/tex">\Theta^{(j)}</script> of the “bias nodes,” <script type="math/tex">x_0</script> and <script type="math/tex">\Theta_0^{(j)}</script>. In other words the output nodes will not include the bias nodes while the inputs will.</p> <p>Example: layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of <script type="math/tex">\Theta^{(1)}</script> is going to be 4×3 where <script type="math/tex">s_j = 2</script> and <script type="math/tex">s_{j+1} = 4</script>, so <script type="math/tex">s_{j+1} \times (s_j + 1) = 4 \times 3</script>.</p> <h2 id="model-representation-ii">Model Representation II</h2> <p>In this section we’ll do a vectorized implementation of the above functions. We’re going to define a new variable <script type="math/tex">z_k^{(j)}</script> that encompasses the parameters inside our g function. In our previous example if we replaced the variable z for all the parameters we would get:</p> <script type="math/tex; mode=display">\begin{align*}a_1^{(2)} = g(z_1^{(2)}) \newline a_2^{(2)} = g(z_2^{(2)}) \newline a_3^{(2)} = g(z_3^{(2)}) \newline \end{align*}</script> <p>In other words, for layer j=2 and node k, the variable z will be:</p> <script type="math/tex; mode=display">z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + \cdots + \Theta_{k,n}^{(1)}x_n</script> <p>The vector representation of x and <script type="math/tex">z^{j}</script> is:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}x = \begin{bmatrix}x_0 \newline x_1 \newline\cdots \newline x_n\end{bmatrix} &z^{(j)} = \begin{bmatrix}z_1^{(j)} \newline z_2^{(j)} \newline\cdots \newline z_n^{(j)}\end{bmatrix}\end{align*} %]]></script> <p>Setting <script type="math/tex">x = a^{(1)}</script>, we can rewrite the equation as:</p> <script type="math/tex; mode=display">z^{(j)} = \Theta^{(j-1)}a^{(j-1)}</script> <p>We are multiplying our matrix <script type="math/tex">\Theta^{(j-1)}</script> with dimensions <script type="math/tex">s_j\times (n+1)</script> (where <script type="math/tex">s_j</script> is the number of our activation nodes) by our vector <script type="math/tex">a^{(j-1)}</script> with height (n+1). This gives us our vector <script type="math/tex">z^{(j)}</script> with height <script type="math/tex">s_j</script>.</p> <p>Now we can get a vector of our activation nodes for layer j as follows:</p> <script type="math/tex; mode=display">a^{(j)} = g(z^{(j)})</script> <p>Where our function g can be applied element-wise to our vector <script type="math/tex">z^{(j)}</script>.</p> <p>We can then add a bias unit (equal to 1) to layer j after we have computed <script type="math/tex">a^{(j)}</script>. This will be element <script type="math/tex">a_0^{(j)}</script> and will be equal to 1.</p> <p>To compute our final hypothesis, let’s first compute another z vector:</p> <script type="math/tex; mode=display">z^{(j+1)} = \Theta^{(j)}a^{(j)}</script> <p>We get this final z vector by multiplying the next theta matrix after <script type="math/tex">\Theta^{(j-1)}</script> with the values of all the activation nodes we just got.</p> <p>This last theta matrix <script type="math/tex">\Theta^{(j)}</script> will have only one row so that our result is a single number.</p> <p>We then get our final result with:</p> <script type="math/tex; mode=display">h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})</script> <p>Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression.</p> <p>Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.</p> <h2 id="examples-and-intuitions-i">Examples and Intuitions I</h2> <p>A simple example of applying neural networks is by predicting <script type="math/tex">x_1</script> AND <script type="math/tex">x_2</script>, which is the logical ‘and’ operator and is only true if both <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> are 1.</p> <p>The graph of our functions will look like:</p> <script type="math/tex; mode=display">\begin{align*}\begin{bmatrix}x_0 \newline x_1 \newline x_2\end{bmatrix} \rightarrow\begin{bmatrix}g(z^{(2)})\end{bmatrix} \rightarrow h_\Theta(x)\end{align*}</script> <p>Remember that <script type="math/tex">x_0</script> is our bias variable and is always 1.</p> <p>Let’s set our first theta matrix as:</p> <script type="math/tex; mode=display">% <![CDATA[ \Theta^{(1)} =\begin{bmatrix}-30 & 20 & 20\end{bmatrix} %]]></script> <p>This will cause the output of our hypothesis to only be positive if both <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> are 1. In other words:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& h_\Theta(x) = g(-30 + 20x_1 + 20x_2) \newline \newline & x_1 = 0 \ \ and \ \ x_2 = 0 \ \ then \ \ g(-30) \approx 0 \newline & x_1 = 0 \ \ and \ \ x_2 = 1 \ \ then \ \ g(-10) \approx 0 \newline & x_1 = 1 \ \ and \ \ x_2 = 0 \ \ then \ \ g(-10) \approx 0 \newline & x_1 = 1 \ \ and \ \ x_2 = 1 \ \ then \ \ g(10) \approx 1\end{align*} %]]></script> <p>So we have constructed one of the fundamental operations in computers by using a small neural network rather than using an actual AND gate. Neural networks can also be used to simulate all the other logical gates.</p> <h2 id="examples-and-intuitions-ii">Examples and Intuitions II</h2> <p>The <script type="math/tex">\theta^(1)</script> matrices for AND, NOR, and OR are:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}AND:\newline\Theta^{(1)} &=\begin{bmatrix}-30 & 20 & 20\end{bmatrix} \newline NOR:\newline\Theta^{(1)} &= \begin{bmatrix}10 & -20 & -20\end{bmatrix} \newline OR:\newline\Theta^{(1)} &= \begin{bmatrix}-10 & 20 & 20\end{bmatrix} \newline\end{align*} %]]></script> <p>We can combine these to get the XNOR logical operator (which gives 1 if <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> are both 0 or both 1).</p> <script type="math/tex; mode=display">\begin{align*}\begin{bmatrix}x_0 \newline x_1 \newline x_2\end{bmatrix} \rightarrow\begin{bmatrix}a_1^{(2)} \newline a_2^{(2)} \end{bmatrix} \rightarrow\begin{bmatrix}a^{(3)}\end{bmatrix} \rightarrow h_\Theta(x)\end{align*}</script> <p>For the transition between the first and second layer, we’ll use a <script type="math/tex">\theta^(1)</script> matrix that combines the values for AND and NOR:</p> <script type="math/tex; mode=display">% <![CDATA[ \Theta^{(1)} =\begin{bmatrix}-30 & 20 & 20 \newline 10 & -20 & -20\end{bmatrix} %]]></script> <p>For the transition between the second and third layer, we’ll use a <script type="math/tex">\theta^(2)</script> matrix that uses the value for OR:</p> <script type="math/tex; mode=display">% <![CDATA[ \Theta^{(2)} =\begin{bmatrix}-10 & 20 & 20\end{bmatrix} %]]></script> <p>Let’s write out the values for all our nodes:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& a^{(2)} = g(\Theta^{(1)} \cdot x) \newline& a^{(3)} = g(\Theta^{(2)} \cdot a^{(2)}) \newline& h_\Theta(x) = a^{(3)}\end{align*} %]]></script> <p>And there we have the XNOR operator using two hidden layers!</p> <h2 id="multiclass-classification">Multiclass Classification</h2> <p>To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we wanted to classify our data into one of four final resulting classes:</p> <script type="math/tex; mode=display">\begin{align*}\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline\cdots \newline x_n\end{bmatrix} \rightarrow\begin{bmatrix}a_0^{(2)} \newline a_1^{(2)} \newline a_2^{(2)} \newline\cdots\end{bmatrix} \rightarrow\begin{bmatrix}a_0^{(3)} \newline a_1^{(3)} \newline a_2^{(3)} \newline\cdots\end{bmatrix} \rightarrow \cdots \rightarrow\begin{bmatrix}h_\Theta(x)_1 \newline h_\Theta(x)_2 \newline h_\Theta(x)_3 \newline h_\Theta(x)_4 \newline\end{bmatrix} \rightarrow\end{align*}</script> <p>Our final layer of nodes, when multiplied by its theta matrix, will result in another vector, on which we will apply the g() logistic function to get a vector of hypothesis values.</p> <p>Our resulting hypothesis for one set of inputs may look like:</p> <script type="math/tex; mode=display">h_\Theta(x) =\begin{bmatrix}0 \newline 0 \newline 1 \newline 0 \newline\end{bmatrix}</script> <p>In which case our resulting class is the third one down, or <script type="math/tex">h_\Theta(x)_3</script>.</p> <p>We can define our set of resulting classes as y:</p> <p>Our final value of our hypothesis for a set of inputs will be one of the elements in y.</p> <h1 id="chap-5---mlneural-networks-learning">Chap 5 - ML:Neural Networks: Learning</h1> <h2 id="cost-function">Cost Function</h2> <p>Let’s first define a few variables that we will need to use:</p> <p>a) L= total number of layers in the network</p> <p>b) <script type="math/tex">s_l</script> = number of units (not counting bias unit) in layer l</p> <p>c) K= number of output units/classes</p> <p>Recall that in neural networks, we may have many output nodes. We denote <script type="math/tex">h_\Theta(x)_k</script> as being a hypothesis that results in the <script type="math/tex">k^{th}</script> output.</p> <p>Our cost function for neural networks is going to be a generalization of the one we used for logistic regression.</p> <p>Recall that the cost function for regularized logistic regression was:</p> <script type="math/tex; mode=display">J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2</script> <p>For neural networks, it is going to be slightly more complicated:</p> <script type="math/tex; mode=display">\begin{gather*}\large J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}</script> <p>We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, between the square brackets, we have an additional nested summation that loops through the number of output nodes.</p> <p>In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.</p> <p>Note:</p> <ul> <li>the double sum simply adds up the logistic regression costs calculated for each cell in the output layer; and</li> <li>the triple sum simply adds up the squares of all the individual Θs in the entire network.</li> <li>the i in the triple sum does not refer to training example i</li> </ul> <h2 id="backpropagation-algorithm">Backpropagation Algorithm</h2> <p>“Backpropagation” is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression.</p> <p>Our goal is to compute:</p> <script type="math/tex; mode=display">\min_\Theta J(\Theta)</script> <p>That is, we want to minimize our cost function J using an optimal set of parameters in theta.</p> <p>In this section we’ll look at the equations we use to compute the partial derivative of J(Θ):</p> <script type="math/tex; mode=display">\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta)</script> <p>In back propagation we’re going to compute for every node:</p> <p><script type="math/tex">\delta_j^{(l)}</script> = “error” of node j in layer l</p> <p>Recall that <script type="math/tex">a_j^{(l)}</script> is activation node j in layer l.</p> <p>For the last layer, we can compute the vector of delta values with:</p> <script type="math/tex; mode=display">\delta^{(L)} = a^{(L)} - y</script> <p>Where L is our total number of layers and <script type="math/tex">a^{(L)}</script> is the vector of outputs of the activation units for the last layer. So our “error values” for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y.</p> <p>To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:</p> <script type="math/tex; mode=display">\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ g'(z^{(l)})</script> <p>The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g’, or g-prime, which is the derivative of the activation function g evaluated with the input values given by z(l).</p> <p>The g-prime derivative terms can also be written out as:</p> <script type="math/tex; mode=display">g'(u) = g(u)\ .*\ (1 - g(u))</script> <p>The full back propagation equation for the inner nodes is then:</p> <script type="math/tex; mode=display">\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})</script> <p>A. Ng states that the derivation and proofs are complicated and involved, but you can still implement the above equations to do back propagation without knowing the details.</p> <p>We can compute our partial derivative terms by multiplying our activation values and our error values for each training example t:</p> <script type="math/tex; mode=display">\dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = \frac{1}{m}\sum_{t=1}^m a_j^{(t)(l)} {\delta}_i^{(t)(l+1)}</script> <p>This however ignores regularization, which we’ll deal with later.</p> <p>Note: <script type="math/tex">\delta^{l+1}</script> and <script type="math/tex">a^{l+1}</script> are vectors with <script type="math/tex">s_{l+1}</script> elements. Similarly, <script type="math/tex">\ a^{(l)}</script> is a vector with <script type="math/tex">s_l</script> elements. Multiplying them produces a matrix that is <script type="math/tex">s_{l+1}</script> by <script type="math/tex">s_l</script> which is the same dimension as <script type="math/tex">\Theta^{(l)}</script>. That is, the process produces a gradient term for every element in <script type="math/tex">\Theta^{(l)}</script>. (Actually, <script type="math/tex">\Theta^{(l)}</script> has <script type="math/tex">s_{l}</script> + 1 column, so the dimensionality is not exactly the same).</p> <p>We can now take all these equations and put them together into a backpropagation algorithm:</p> <h3 id="back-propagation-algorithm">Back propagation Algorithm</h3> <p>Given training set <script type="math/tex">\lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace</script></p> <p>Set <script type="math/tex">\Delta^{(l)}_{i,j}</script> := 0 for all (l,i,j)</p> <p>For training example t =1 to m:</p> <ul> <li>Set <script type="math/tex">a^{(1)} := x^{(t)}</script></li> <li>Perform forward propagation to compute <script type="math/tex">a^{(l)}</script> for l=2,3,…,L</li> <li>Using <script type="math/tex">y^{(t)}</script>, compute <script type="math/tex">\delta^{(L)} = a^{(L)} - y^{(t)}</script></li> <li>Compute <script type="math/tex">\delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)}</script> using <script type="math/tex">\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})</script></li> <li><script type="math/tex">\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}</script> or with vectorization, <script type="math/tex">\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T</script></li> <li><script type="math/tex">D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)</script> If j≠0 NOTE: Typo in lecture slide omits outside parentheses. This version is correct.</li> <li><script type="math/tex">D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}</script> If j=0</li> </ul> <p>The capital-delta matrix is used as an “accumulator” to add up our values as we go along and eventually compute our partial derivative.</p> <p>The actual proof is quite involved, but, the <script type="math/tex">D^{(l)}_{i,j}</script> terms are the partial derivatives and the results we are looking for:</p> <script type="math/tex; mode=display">D_{i,j}^{(l)} = \dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}}.</script> <h2 id="backpropagation-intuition">Backpropagation Intuition</h2> <p>The cost function is:</p> <script type="math/tex; mode=display">\begin{gather*}J(\theta) = - \frac{1}{m} \sum_{t=1}^m\sum_{k=1}^K \left[ y^{(t)}_k \ \log (h_\theta (x^{(t)}))_k + (1 - y^{(t)}_k)\ \log (1 - h_\theta(x^{(t)})_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} ( \theta_{j,i}^{(l)})^2\end{gather*}</script> <p>If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with:</p> <script type="math/tex; mode=display">cost(t) =y^{(t)} \ \log (h_\theta (x^{(t)})) + (1 - y^{(t)})\ \log (1 - h_\theta(x^{(t)}))</script> <p>More intuitively you can think of that equation roughly as:</p> <script type="math/tex; mode=display">cost(t) \approx (h_\theta(x^{(t)})-y^{(t)})^2</script> <p>Intuitively, <script type="math/tex">\delta_j^{(l)}</script> is the “error” for <script type="math/tex">a^{(l)}_j</script> (unit j in layer l)</p> <p>More formally, the delta values are actually the derivative of the cost function:</p> <script type="math/tex; mode=display">\delta_j^{(l)} = \dfrac{\partial}{\partial z_j^{(l)}} cost(t)</script> <p>Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are.</p> <p>Note: In lecture, sometimes i is used to index a training example. Sometimes it is used to index a unit in a layer. In the Back Propagation Algorithm described here, t is used to index a training example rather than overloading the use of i.</p> <h2 id="implementation-note-unrolling-parameters">Implementation Note: Unrolling Parameters</h2> <p>With neural networks, we are working with sets of matrices:</p> <script type="math/tex; mode=display">\begin{align*} \Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}, \dots \newline D^{(1)}, D^{(2)}, D^{(3)}, \dots \end{align*}</script> <p>In order to use optimizing functions such as “fminunc()”, we will want to “unroll” all the elements and put them into one long vector:</p> <p>If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the “unrolled” versions as follows:</p> <p>NOTE: The lecture slides show an example neural network with 3 layers. However, 3 theta matrices are defined: Theta1, Theta2, Theta3. There should be only 2 theta matrices: Theta1 (10 x 11), Theta2 (1 x 11).</p> <h2 id="gradient-checking">Gradient Checking</h2> <p>Gradient checking will assure that our backpropagation works as intended.</p> <p>We can approximate the derivative of our cost function with:</p> <script type="math/tex; mode=display">\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}</script> <p>With multiple theta matrices, we can approximate the derivative with respect to <script type="math/tex">Θ_j</script> as follows:</p> <script type="math/tex; mode=display">\dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\epsilon}</script> <p>A good small value for <script type="math/tex">{\epsilon}</script> (epsilon), guarantees the math above to become true. If the value be much smaller, may we will end up with numerical problems. The professor Andrew usually uses the value <script type="math/tex">{\epsilon = 10^{-4}}</script>.</p> <p>We are only adding or subtracting epsilon to the <script type="math/tex">Theta_j</script> matrix. In octave we can do it as follows:</p> <p>We then want to check that gradApprox ≈ deltaVector.</p> <p>Once you’ve verified once that your backpropagation algorithm is correct, then you don’t need to compute gradApprox again. The code to compute gradApprox is very slow.</p> <h2 id="random-initialization">Random Initialization</h2> <p>Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly.</p> <p>Instead we can randomly initialize our weights:</p> <p>Initialize each <script type="math/tex">\Theta^{(l)}_{ij}</script> to a random value between<script type="math/tex">[-\epsilon,\epsilon]</script>:</p> <script type="math/tex; mode=display">\epsilon = \dfrac{\sqrt{6}}{\sqrt{\mathrm{Loutput} + \mathrm{Linput}}}</script> <script type="math/tex; mode=display">\Theta^{(l)} = 2 \epsilon \; \mathrm{rand}(\mathrm{Loutput}, \mathrm{Linput} + 1) - \epsilon</script> <p>rand(x,y) will initialize a matrix of random real numbers between 0 and 1. (Note: this epsilon is unrelated to the epsilon from Gradient Checking)</p> <p>Why use this method? This paper may be useful: https://web.stanford.edu/class/ee373b/nninitialization.pdf</p> <h2 id="putting-it-together">Putting it Together</h2> <p>First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers total.</p> <p>Number of input units = dimension of features <script type="math/tex">x^{(i)}</script> Number of output units = number of classes Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units) Defaults: 1 hidden layer. If more than 1 hidden layer, then the same number of units in every hidden layer. Training a Neural Network</p> <p>Randomly initialize the weights Implement forward propagation to get <script type="math/tex">h_\theta(x^{(i)})</script> Implement the cost function Implement backpropagation to compute partial derivatives Use gradient checking to confirm that your backpropagation works. Then disable gradient checking. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta. When we perform forward and back propagation, we loop on every training example:</p> <h2 id="explanation-of-derivatives-used-in-backpropagation">Explanation of Derivatives Used in Backpropagation</h2> <p>We know that for a logistic regression classifier (which is what all of the output neurons in a neural network are), we use the cost function, <script type="math/tex">J(\theta) = -ylog(h_{\theta}(x)) - (1-y)log(1-h_{\theta}(x))</script>, and apply this over the K output neurons, and for all m examples. he equation to compute the partial derivatives of the theta terms in the output neurons:</p> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}</script> <p>And the equation to compute partial derivatives of the theta terms in the [last] hidden layer neurons (layer L-1):</p> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}</script> <p>Clearly they share some pieces in common, so a delta term (<script type="math/tex">δ^{(L)}</script>) can be used for the common pieces between the output layer and the hidden layer immediately before it (with the possibility that there could be many hidden layers if we wanted):</p> <script type="math/tex; mode=display">\delta^{(L)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}}</script> <p>And we can go ahead and use another delta term (<script type="math/tex">δ^{(L−1)}</script>) for the pieces that would be shared by the final hidden layer and a hidden layer before that, if we had one. Regardless, this delta term will still serve to make the math and implementation more concise.</p> <script type="math/tex; mode=display">\delta^{(L-1)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}</script> <script type="math/tex; mode=display">\delta^{(L-1)} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}</script> <p>With these delta terms, our equations become:</p> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}</script> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}</script> <p>Now, time to evaluate these derivatives. Let’s start with the output layer:</p> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}</script> <p>Using <script type="math/tex">\delta^{(L)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}}</script>, we need to evaluate both partial derivatives.</p> <p>Given <script type="math/tex">J(\theta) = -ylog(a^{(L)}) - (1-y)log(1-a^{(L)})</script>, where <script type="math/tex">a^{(L)} = h_{\theta}(x))</script>, the partial derivative is:</p> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial a^{(L)}} = \frac{1-y}{1-a^{(L)}} - \frac{y}{a^{(L)}}</script> <p>And given a=g(z), where <script type="math/tex">g = \frac{1}{1+e^{-z}}</script>, the partial derivative is:</p> <script type="math/tex; mode=display">\frac{\partial a^{(L)}}{\partial z^{(L)}} = a^{(L)}(1-a^{(L)})</script> <p>So, let’s substitute these in for <script type="math/tex">δ^{(L)}</script>:</p> <script type="math/tex; mode=display">\delta^{(L)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}}</script> <script type="math/tex; mode=display">\delta^{(L)} = (\frac{1-y}{1-a^{(L)}} - \frac{y}{a^{(L)}}) (a^{(L)}(1-a^{(L)}))</script> <script type="math/tex; mode=display">\delta^{(L)} =a^{(L)} - y</script> <p>So, for a 3-layer network (L=3),</p> <script type="math/tex; mode=display">\delta^{(3)} =a^{(3)} - y</script> <p>Note that this is the correct equation, as given in our notes. Now, given z=θ∗input, and in layer L the input is <script type="math/tex">a^{(L−1)}</script>, the partial derivative is:</p> <script type="math/tex; mode=display">\frac{\partial z^{(L)}}{\partial \theta^{(L-1)}} = a^{(L-1)}</script> <p>Put it together for the output layer:</p> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}</script> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = (a^{(L)} - y) (a^{(L-1)})</script> <p>Let’s continue on for the hidden layer (let’s assume we only have 1 hidden layer):</p> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}</script> <p>Let’s figure out <script type="math/tex">δ{(L−1)}</script>. Once again, given z=θ∗input, the partial derivative is:</p> <script type="math/tex; mode=display">\frac{\partial z^{(L)}}{\partial a^{(L-1)}} = \theta^{(L-1)}</script> <p>And: <script type="math/tex">\frac{\partial a^{(L-1)}}{\partial z^{(L-1)}} = a^{(L-1)}(1-a^{(L-1)})</script></p> <p>So, let’s substitute these in for <script type="math/tex">δ^{(L−1)}</script>:</p> <script type="math/tex; mode=display">\delta^{(L-1)} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}</script> <script type="math/tex; mode=display">\delta^{(L-1)} = \delta^{(L)} (\theta^{(L-1)}) (a^{(L-1)}(1-a^{(L-1)}))</script> <script type="math/tex; mode=display">\delta^{(L-1)} = \delta^{(L)} \theta^{(L-1)} a^{(L-1)}(1-a^{(L-1)})</script> <p>So, for a 3-layer network,</p> <script type="math/tex; mode=display">\delta^{(2)} = \delta^{(3)} \theta^{(2)} a^{(2)}(1-a^{(2)})</script> <p>Put it together for the [last] hidden layer:</p> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}</script> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = (\delta^{(L)} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}) (a^{(L-2)})</script> <script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = ((a^{(L)} - y) (\theta^{(L-1)})(a^{(L-1)}(1-a^{(L-1)}))) (a^{(L-2)})</script> <h2 id="deriving-the-sigmoid-gradient-function">Deriving the Sigmoid Gradient Function</h2> <p>We let the sigmoid function be <script type="math/tex">\sigma(x) = \frac{1}{1 + e^{-x}}</script></p> <p>Deriving the equation above yields to <script type="math/tex">(\frac{1}{1 + e^{-x}})^2 \frac {d}{ds} \frac{1}{1 + e^{-x}}</script></p> <p>Which is equal to <script type="math/tex">(\frac{1}{1 + e^{-x}})^2 e^{-x} (-1)</script></p> <script type="math/tex; mode=display">(\frac{1}{1 + e^{-x}}) (\frac{1}{1 + e^{-x}}) (-e^{-x})</script> <script type="math/tex; mode=display">(\frac{1}{1 + e^{-x}}) (\frac{-e^{-x}}{1 + e^{-x}})</script> <script type="math/tex; mode=display">\sigma(x)(1- \sigma(x))</script> <h3 id="additional-resources-for-backpropagation">Additional Resources for Backpropagation</h3> <ul> <li>Very thorough conceptual <a href="https://web.archive.org/web/20150317210621/https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf">example</a></li> <li><a href="http://pandamatak.com/people/anand/771/html/node37.html">Short derivation of the backpropagation algorithm</a></li> <li><a href="http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm">Stanford University Deep Learning notes</a></li> <li><a href="http://neuralnetworksanddeeplearning.com/chap2.html">Very thorough explanation and proof</a></li> </ul> <h1 id="reference">Reference</h1> <ul> <li><a href="https://www.coursera.org/learn/machine-learning">Stanford Machine Learning by Andrew Ng</a></li> </ul>Yanqing WuChap 4 - Neural Networks: Representation Non-linear Hypotheses Performing linear regression with a complex set of data with many features is very unwieldy. Say you wanted to create a hypothesis from three (3) features that included all the quadratic terms:Study Note of Machine Learning (II)2018-03-12T00:34:00+00:002018-03-12T00:34:00+00:00http://www.pwyqspace.com/study/2018/03/12/stanford-ml-note-two<h1 id="chap-3---logistic-regression">Chap 3 - Logistic Regression</h1> <p>Now we are switching from regression problems to classification problems. Don’t be confused by the name “Logistic Regression”; it is named that way for historical reasons and is actually an approach to classification problems, not regression problems.</p> <h2 id="binary-classification">Binary Classification</h2> <p>Instead of our output vector y being a continuous range of values, it will only be 0 or 1.</p> <p>y∈{0,1}</p> <p>Where 0 is usually taken as the “negative class” and 1 as the “positive class”, but you are free to assign any representation to it.</p> <p>We’re only doing two classes for now, called a “Binary Classification Problem.”</p> <p>One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. This method doesn’t work well because classification is not actually a linear function.</p> <h3 id="hypothesis-representation">Hypothesis Representation</h3> <p>Our hypothesis should satisfy: <script type="math/tex">0 \leq h_\theta (x) \leq 1</script></p> <p>Our new form uses the “Sigmoid Function,” also called the “Logistic Function”:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& h_\theta (x) = g ( \theta^T x ) \newline \newline& z = \theta^T x \newline& g(z) = \dfrac{1}{1 + e^{-z}}\end{align*} %]]></script> <p>The function g(z), maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification. Try playing with interactive plot of <a href="https://www.desmos.com/calculator/bgontvxotm">sigmoid function</a>.</p> <p>We start with our old hypothesis (linear regression), except that we want to restrict the range to 0 and 1. This is accomplished by plugging <script type="math/tex">\theta^Tx</script> into the Logistic Function.</p> <p><script type="math/tex">h_\theta</script> will give us the probability that our output is 1. For example, <script type="math/tex">h_\theta(x)=0.7</script> gives us the probability of 70% that our output is 1.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \newline& P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1\end{align*} %]]></script> <p>Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).</p> <h2 id="decision-boundary">Decision Boundary</h2> <p>In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& h_\theta(x) \geq 0.5 \rightarrow y = 1 \newline& h_\theta(x) < 0.5 \rightarrow y = 0 \newline\end{align*} %]]></script> <p>The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& g(z) \geq 0.5 \newline& when \; z \geq 0\end{align*} %]]></script> <p>Remember:</p> <script type="math/tex; mode=display">\begin{align*}z=0, e^{0}=1 \Rightarrow g(z)=1/2\newline z \to \infty, e^{-\infty} \to 0 \Rightarrow g(z)=1 \newline z \to -\infty, e^{\infty}\to \infty \Rightarrow g(z)=0 \end{align*}</script> <p>So if our input to g is <script type="math/tex">\theta^T X</script>, then that means:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& h_\theta(x) = g(\theta^T x) \geq 0.5 \newline& when \; \theta^T x \geq 0\end{align*} %]]></script> <p>From these statements we can now say:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& \theta^T x \geq 0 \Rightarrow y = 1 \newline& \theta^T x < 0 \Rightarrow y = 0 \newline\end{align*} %]]></script> <p>The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.</p> <p>Example:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& \theta = \begin{bmatrix}5 \newline -1 \newline 0\end{bmatrix} \newline & y = 1 \; if \; 5 + (-1) x_1 + 0 x_2 \geq 0 \newline & 5 - x_1 \geq 0 \newline & - x_1 \geq -5 \newline& x_1 \leq 5 \newline \end{align*} %]]></script> <p>In this case, our decision boundary is a straight vertical line placed on the graph where <script type="math/tex">x_1 = 5</script>, and everything to the left of that denotes <script type="math/tex">y = 1</script>, while everything to the right denotes <script type="math/tex">y = 0</script>.</p> <p>Again, the input to the sigmoid function g(z) (e.g. <script type="math/tex">\theta^T X</script>) doesn’t need to be linear, and could be a function that describes a circle (e.g. <script type="math/tex">z = \theta_0 + \theta_1 x_1^2 +\theta_2 x_2^2</script>) or any shape to fit our data.</p> <h2 id="cost-function">Cost Function</h2> <p>We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.</p> <p>Instead, our cost function for logistic regression looks like:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0}\end{align*} %]]></script> <p>The more our hypothesis is off from y, the larger the cost function output. If our hypothesis is equal to y, then our cost is 0:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \newline \end{align*} %]]></script> <p>If our correct answer ‘y’ is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.</p> <p>If our correct answer ‘y’ is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.</p> <p>Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.</p> <h2 id="simplified-cost-function-and-gradient-descent">Simplified Cost Function and Gradient Descent</h2> <p>We can compress our cost function’s two conditional cases into one case:</p> <script type="math/tex; mode=display">\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))</script> <p>Notice that when y is equal to 1, then the second term <script type="math/tex">(1-y)\log(1-h_\theta(x))</script> will be zero and will not affect the result. If y is equal to 0, then the first term <script type="math/tex">-y \log(h_\theta(x))</script> will be zero and will not affect the result.</p> <p>We can fully write out our entire cost function as follows:</p> <script type="math/tex; mode=display">J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]</script> <p>A vectorized implementation is:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} & h = g(X\theta)\newline & J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) \end{align*} %]]></script> <h3 id="gradient-descent">Gradient Descent</h3> <p>Remember that the general form of gradient descent is:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& Repeat \; \lbrace \newline & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \newline & \rbrace\end{align*} %]]></script> <p>We can work out the derivative part using calculus to get:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} & Repeat \; \lbrace \newline & \; \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \newline & \rbrace \end{align*} %]]></script> <p>Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.</p> <p>A vectorized implementation is:</p> <script type="math/tex; mode=display">\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})</script> <h3 id="partial-derivative-of-jθ">Partial derivative of J(θ)</h3> <p>First calculate derivative of sigmoid function (it will be useful while finding partial derivative of J(θ)):</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}\sigma(x)'&=\left(\frac{1}{1+e^{-x}}\right)'=\frac{-(1+e^{-x})'}{(1+e^{-x})^2}=\frac{-1'-(e^{-x})'}{(1+e^{-x})^2}=\frac{0-(-x)'(e^{-x})}{(1+e^{-x})^2}=\frac{-(-1)(e^{-x})}{(1+e^{-x})^2}=\frac{e^{-x}}{(1+e^{-x})^2} \newline &=\left(\frac{1}{1+e^{-x}}\right)\left(\frac{e^{-x}}{1+e^{-x}}\right)=\sigma(x)\left(\frac{+1-1 + e^{-x}}{1+e^{-x}}\right)=\sigma(x)\left(\frac{1 + e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}\right)=\sigma(x)(1 - \sigma(x))\end{align*} %]]></script> <p>Now we are ready to find out resulting partial derivative:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}\frac{\partial}{\partial \theta_j} J(\theta) &= \frac{\partial}{\partial \theta_j} \frac{-1}{m}\sum_{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} \frac{\partial}{\partial \theta_j} log (h_\theta(x^{(i)})) + (1-y^{(i)}) \frac{\partial}{\partial \theta_j} log (1 - h_\theta(x^{(i)}))\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} h_\theta(x^{(i)})}{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - h_\theta(x^{(i)}))}{1 - h_\theta(x^{(i)})}\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} \sigma(\theta^T x^{(i)})}{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - \sigma(\theta^T x^{(i)}))}{1 - h_\theta(x^{(i)})}\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})} + \frac{- (1-y^{(i)}) \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})}\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})} - \frac{(1-y^{(i)}) h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})}\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) x^{(i)}_j - (1-y^{(i)}) h_\theta(x^{(i)}) x^{(i)}_j\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) - (1-y^{(i)}) h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - y^{(i)} h_\theta(x^{(i)}) - h_\theta(x^{(i)}) + y^{(i)} h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline&= \frac{1}{m}\sum_{i=1}^m \left [ h_\theta(x^{(i)}) - y^{(i)} \right ] x^{(i)}_j\end{align*} %]]></script> <p>The vectorized version:</p> <script type="math/tex; mode=display">\nabla J(\theta) = \frac{1}{m} \cdot X^T \cdot \left(g\left(X\cdot\theta\right) - \vec{y}\right)</script> <h2 id="advanced-optimization">Advanced Optimization</h2> <p>“Conjugate gradient”, “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. Andrew Ng suggests not to write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they’re already tested and highly optimized. Octave provides them.</p> <h2 id="multiclass-classification-one-vs-all">Multiclass Classification: One-vs-all</h2> <p>Now we will approach the classification of data into more than two categories. Instead of y = {0,1} we will expand our definition so that y = {0,1…n}.</p> <p>In this case we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that ‘y’ is a member of one of our classes.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& y \in \lbrace0, 1 ... n\rbrace \newline& h_\theta^{(0)}(x) = P(y = 0 | x ; \theta) \newline& h_\theta^{(1)}(x) = P(y = 1 | x ; \theta) \newline& \cdots \newline& h_\theta^{(n)}(x) = P(y = n | x ; \theta) \newline& \mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )\newline\end{align*} %]]></script> <p>We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.</p> <h1 id="chap-35---regularization">Chap. 3.5 - Regularization</h1> <h3 id="the-problem-of-overfitting">The Problem of Overfitting</h3> <p>Regularization is designed to address the problem of overfitting.</p> <p>High bias or underfitting is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. eg. if we take <script type="math/tex">h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2</script> then we are making an initial assumption that a linear model will fit the training data well and will be able to generalize but that may not be the case.</p> <p>At the other extreme, overfitting or high variance is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.</p> <p>This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:</p> <ol> <li>Reduce the number of features: <ul> <li>Manually select which features to keep.</li> <li>Use a model selection algorithm (studied later in the course).</li> </ul> </li> <li>Regularization</li> </ol> <p>Keep all the features, but reduce the parameters <script type="math/tex">\theta_j</script>.</p> <p>Regularization works well when we have a lot of slightly useful features.</p> <h2 id="cost-function-1">Cost Function</h2> <p>If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.</p> <p>Say we wanted to make the following function more quadratic:</p> <script type="math/tex; mode=display">\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4</script> <p>We’ll want to eliminate the influence of <script type="math/tex">\theta_3x^3</script> and <script type="math/tex">\theta_4x^4</script>. Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:</p> <script type="math/tex; mode=display">min_\theta\ \dfrac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + 1000\cdot\theta_3^2 + 1000\cdot\theta_4^2</script> <p>We’ve added two extra terms at the end to inflate the cost of <script type="math/tex">\theta_3</script> and <script type="math/tex">\theta_4</script>. Now, in order for the cost function to get close to zero, we will have to reduce the values of <script type="math/tex">\theta_3</script> and <script type="math/tex">\theta_4</script> to near zero. This will in turn greatly reduce the values of <script type="math/tex">\theta_3x^3</script> and <script type="math/tex">\theta_4x^4</script> in our hypothesis function.</p> <p>We could also regularize all of our theta parameters in a single summation:</p> <script type="math/tex; mode=display">min_\theta\ \dfrac{1}{2m}\ \left[ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2 \right]</script> <p>The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated. You can visualize the effect of regularization in this interactive plot : https://www.desmos.com/calculator/1hexc8ntqp</p> <p>Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting.</p> <h2 id="regularized-linear-regression">Regularized Linear Regression</h2> <p>We can apply regularization to both linear regression and logistic regression. We will approach linear regression first.</p> <h3 id="gradient-descent-1">Gradient Descent</h3> <p>We will modify our gradient descent function to separate out <script type="math/tex">\theta_0</script> from the rest of the parameters because we do not want to penalize <script type="math/tex">\theta_0</script>.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} & \text{Repeat}\ \lbrace \newline & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline & \rbrace \end{align*} %]]></script> <p>The term <script type="math/tex">\frac{\lambda}{m}\theta_j</script> performs our regularization.</p> <p>With some manipulation our update rule can also be represented as:</p> <script type="math/tex; mode=display">\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}</script> <p>The first term in the above equation, <script type="math/tex">1 - \alpha\frac{\lambda}{m}</script> will always be less than 1. Intuitively you can see it as reducing the value of <script type="math/tex">\theta_j</script> by some amount on every update.</p> <p>Notice that the second term is now exactly the same as it was before.</p> <h3 id="normal-equation">Normal Equation</h3> <p>Now let’s approach regularization using the alternate method of the non-iterative normal equation.</p> <p>To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*} %]]></script> <p>L is a matrix with 0 at the top left and 1’s down the diagonal, with 0’s everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including x_0), multiplied with a single real number λ.</p> <p>Recall that if <script type="math/tex">m \leq n</script>, then <script type="math/tex">X^TX</script> is non-invertible. However, when we add the term <script type="math/tex">\lambda\cdotL</script>, then <script type="math/tex">X^TX + \lambda\cdotL</script> becomes invertible.</p> <h2 id="regularized-logistic-regression">Regularized Logistic Regression</h2> <p>We can regularize logistic regression in a similar way that we regularize linear regression. Let’s start with the cost function.</p> <h3 id="cost-function-2">Cost Function</h3> <p>Recall that our cost function for logistic regression was:</p> <script type="math/tex; mode=display">J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)})) \large]</script> <p>We can regularize this equation by adding a term to the end:</p> <script type="math/tex; mode=display">J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2</script> <p>Note Well: The second sum, <script type="math/tex">\sum_{j=1}^n \theta_j^2</script> means to explicitly exclude the bias term, <script type="math/tex">\theta_0</script>. I.e. the θ vector is indexed from 0 to n (holding n+1 values, <script type="math/tex">\theta_0</script> through <script type="math/tex">\theta_n</script>), and this sum explicitly skips <script type="math/tex">\theta_0</script>, by running from 1 to n, skipping 0.</p> <h3 id="gradient-descent-2">Gradient Descent</h3> <p>Just like with linear regression, we will want to separately update \theta_0 and the rest of the parameters because we do not want to regularize <script type="math/tex">\theta_0</script>.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}& \text{Repeat}\ \lbrace \newline& \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline& \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline& \rbrace\end{align*} %]]></script> <p>This is identical to the gradient descent function presented for linear regression.</p> <h2 id="initial-ones-feature-vector">Initial Ones Feature Vector</h2> <h3 id="constant-feature">Constant Feature</h3> <p>As it turns out it is crucial to add a constant feature to your pool of features before starting any training of your machine. Normally that feature is just a set of ones for all your training examples.</p> <p>Concretely, if X is your feature matrix then <script type="math/tex">X_0</script> is a vector with ones.</p> <p>Below are some insights to explain the reason for this constant feature. The first part draws some analogies from electrical engineering concept, the second looks at understanding the ones vector by using a simple machine learning example.</p> <h3 id="electrical-engineering">Electrical Engineering</h3> <p>From electrical engineering, in particular signal processing, this can be explained as DC and AC.</p> <p>The initial feature vector X without the constant term captures the dynamics of your model. That means those features particularly record changes in your output y - in other words changing some feature <script type="math/tex">X_i</script> where <script type="math/tex">i\not= 0</script> will have a change on the output y. AC is normally made out of many components or harmonics; hence we also have many features (yet we have one DC term).</p> <p>The constant feature represents the DC component. In control engineering this can also be the steady state.</p> <p>Interestingly removing the DC term is easily done by differentiating your signal - or simply taking a difference between consecutive points of a discrete signal (it should be noted that at this point the analogy is implying time-based signals - so this will also make sense for machine learning application with a time basis - e.g. forecasting stock exchange trends).</p> <p>Another interesting note: if you were to play and AC+DC signal as well as an AC only signal where both AC components are the same then they would sound exactly the same. That is because we only hear changes in signals and Δ(AC+DC)=Δ(AC).</p> <h3 id="housing-price-example">Housing price example</h3> <p>Suppose you design a machine which predicts the price of a house based on some features. In this case what does the ones vector help with?</p> <p>Let’s assume a simple model which has features that are directly proportional to the expected price i.e. if feature Xi increases so the expected price y will also increase. So as an example we could have two features: namely the size of the house in [m2], and the number of rooms.</p> <p>When you train your machine you will start by prepending a ones vector <script type="math/tex">X_0</script>. You may then find after training that the weight for your initial feature of ones is some value <script type="math/tex">\theta_0</script>. As it turns, when applying your hypothesis function <script type="math/tex">h_{\theta}(X)</script> - in the case of the initial feature you will just be multiplying by a constant (most probably <script type="math/tex">\theta_0</script> if you not applying any other functions such as sigmoids). This constant (let’s say it’s <script type="math/tex">\theta_0</script> for argument’s sake) is the DC term. It is a constant that doesn’t change.</p> <p>But what does it mean for this example? Well, let’s suppose that someone knows that you have a working model for housing prices. It turns out that for this example, if they ask you how much money they can expect if they sell the house you can say that they need at least θ0 dollars (or rands) before you even use your learning machine. As with the above analogy, your constant θ0 is somewhat of a steady state where all your inputs are zeros. Concretely, this is the price of a house with no rooms which takes up no space.</p> <p>However this explanation has some holes because if you have some features which decrease the price e.g. age, then the DC term may not be an absolute minimum of the price. This is because the age may make the price go even lower.</p> <p>Theoretically if you were to train a machine without a ones vector <script type="math/tex">f_{AC}(X)</script>, it’s output may not match the output of a machine which had a ones vector <script type="math/tex">f_{DC}(X)</script>. However, <script type="math/tex">f_{AC}(X)</script> may have exactly the same trend as <script type="math/tex">f_{DC}(X)</script> i.e. if you were to plot both machine’s output you would find that they may look exactly the same except that it seems one output has just been shifted (by a constant). With reference to the housing price problem: suppose you make predictions on two houses <script type="math/tex">house_A</script> and <script type="math/tex">house_B</script> using both machines. It turns out while the outputs from the two machines would different, the difference between houseA and houseB’s predictions according to both machines could be exactly the same. Realistically, that means a machine trained without the ones vector <script type="math/tex">f_AC</script> could actually be very useful if you have just one benchmark point. This is because you can find out the missing constant by simply taking a difference between the machine’s prediction an actual price - then when making predictions you simply add that constant to what even output you get. That is: if <script type="math/tex">house_{benchmark}</script> is your benchmark then the DC component is simply <script type="math/tex">price(house_{benchmark}) - f_{AC}(features(house_{benchmark}))</script> A more simple and crude way of putting it is that the DC component of your model represents the inherent bias of the model. The other features then cause tension in order to move away from that bias position.</p> <p>Remark: this exmpale is provided by Kholofelo Moyaba.</p> <h3 id="a-simpler-approach">A simpler approach</h3> <p>A “bias” feature is simply a way to move the “best fit” learned vector to better fit the data. For example, consider a learning problem with a single feature <script type="math/tex">X_1</script>. The formula without the <script type="math/tex">X_0</script> feature is just <script type="math/tex">theta_1 * X_1 = y</script>. This is graphed as a line that always passes through the origin, with slope y/theta. The <script type="math/tex">x_0</script> term allows the line to pass through a different point on the y axis. This will almost always give a better fit. Not all best fit lines go through the origin (0,0) right?</p> <p>Remark: this is provided by Joe Cotton.</p> <h1 id="reference">Reference</h1> <ul> <li><a href="https://www.coursera.org/learn/machine-learning">Stanford Machine Learning by Andrew Ng</a></li> </ul>Yanqing WuChap 3 - Logistic Regression Now we are switching from regression problems to classification problems. Don’t be confused by the name “Logistic Regression”; it is named that way for historical reasons and is actually an approach to classification problems, not regression problems.Study Note of Machine Learning (I)2018-03-11T00:00:00+00:002018-03-11T00:00:00+00:00http://www.pwyqspace.com/study/2018/03/11/stanford-ml-note-one<h1 id="chap-1---introduction">Chap 1 - Introduction</h1> <h2 id="what-is-machine-learning">What is Machine Learning?</h2> <p>Definition by Arthur Samuel(1959): “Field of study that gives computers the ability to learn without being explicitly programmed.”<br /> Definition by Tom Mitchell(1998): “A computer program is said to <em>learn</em> form experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”</p> <p>In general, any ML problem can be assigned to one of two broad classification: supervised learning <em>or</em> unsupervised learning.</p> <h2 id="supervised-learning">Supervised Learning</h2> <ul> <li>Supervised learning is the ML task of learning a way to map input signals to output values by training a model on a set of training examples where each example is a pair consisting of an input and a <strong>desired</strong> output value. <ul> <li>Alternatively, <em>we know the relationship between the input and output.</em></li> </ul> </li> <li><strong>Regression Problem</strong>: try to predict results within a <em>continuous</em> output (ie, map input variables to some continuous function) <ul> <li>E.g. House Price Prediction</li> </ul> </li> <li><strong>Classification Problem</strong>: try to predict results in a <em>discrete</em> output (ie, map input variables into discrete categories) <ul> <li>E.g. Cancer Categorization (malignant/benign tumor), Email spam/not spam.</li> </ul> </li> </ul> <h2 id="unsupervised-learning">Unsupervised Learning</h2> <ul> <li>Unsupervised ML is the ML task of inferring a function to describe hidden structure from “unlabeled” data. Examples of unsupervised learning tasks include clustering (where we try to discover underlying groupings of examples) and anomaly detection (where we try to infer if some examples in a dataset do not conform to some expected pattern). <ul> <li>Alternatively, <em>we have little or no idea what our results should look like.</em></li> </ul> </li> <li>There is no feedback based on the prediction results with unsupervised learning. <ul> <li>E.g.: Orgnize computing clusters, Social network analysis, Market segmentation, Astronomical data analysis</li> </ul> </li> <li><strong>Clustering Problem Example</strong>: Automatically group a huge amount of genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.</li> <li><strong>Non-clustering Problem Example</strong>: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. <ul> <li><a href="https://en.wikipedia.org/wiki/Source_separation">Source separation</a>: Cocktail party problem <strong><code class="highlighter-rouge">[W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');</code></strong></li> <li><a href="https://www.gnu.org/software/octave/">GNU Octave</a>, a great PL for algorithm prototyping</li> </ul> </li> </ul> <h2 id="model-and-cost-function">Model and Cost Function</h2> <h3 id="model-representation">Model Representation</h3> <p>In <em>regression</em> problems, we are taking input variables and trying to fit the output onto a continuous expected result function.<br /> Linear regression with one variable is also known as “univariate linear regression.”<br /> Univariate linear regression is used when predicts a <strong>single output</strong> value y from a <strong>single input</strong> value x.</p> <h4 id="notation">Notation</h4> <ul> <li><script type="math/tex">m</script> = number of training examples</li> <li><script type="math/tex">x's</script> = “input” variable / features</li> <li><script type="math/tex">y's</script> = “output” variable / “target” variable</li> <li><script type="math/tex">(x, y)</script> = one training example</li> <li><script type="math/tex">(x^{(i)}, y^{(i)}) = i^{(th)}</script> training example</li> <li><script type="math/tex">\theta_i = i^{th}</script> parameter of model</li> </ul> <p>Hypothesis function: <script type="math/tex">\hat{y} = h_\theta(x) = \theta_0 + \theta_1 x</script></p> <p>Note that this is like the equation of a straight line. We give to <script type="math/tex">h_\theta(x)</script> values for <script type="math/tex">\theta_0</script> and <script type="math/tex">\theta_1</script> to get our estimated output <script type="math/tex">\hat{y}</script>. In other words, we are trying to create a function called <script type="math/tex">h_\theta</script> that is trying to map our input data (the <script type="math/tex">x's</script>) to the output data (the <script type="math/tex">y's</script>))</p> <!-- TODO: Insert example pic here --> <h2 id="parameter-learning">Parameter Learning</h2> <h3 id="cost-function">Cost Function</h3> <p>The accuracy of hypothesis function can be measured by a <strong>cost function</strong>. This takes an “average” of all the results of the hypothesis with inputs from <script type="math/tex">x's</script> compared to the actual output <script type="math/tex">y's</script>.</p> <p><strong>Cost Function</strong>: <script type="math/tex">J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2</script></p> <p>This function is also known as “Squared Error Function”. - Please refer to <a href="https://en.wikipedia.org/wiki/Mean_squared_error">Squared Error Function</a> (aka, Mean squared error)</p> <h2 id="gradient-descent">Gradient Descent</h2> <p>After having a hypothesis function and a cost function to measure hypothesis function, now we uses <strong>Gradient Descent</strong> to estimate the parameters in hypothesis function.</p> <p>Imagine that we graph our hypothesis function based on its fields <script type="math/tex">\theta_0</script> and <script type="math/tex">\theta_1</script> (actually we are graphing the cost function as a function of the parameter estimates). This can be kind of confusing; we are moving up to a higher level of abstraction. We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting particular set of parameters.</p> <p>We put <script type="math/tex">\theta_0</script> on the x axis and <script type="math/tex">\theta_1</script> on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters.</p> <p>We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum.</p> <p>The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent, and the size of each step is determined by the parameter <script type="math/tex">\alpha</script>, which is called the learning rate.</p> <p>The gradient descent algorithm is:<br /> <script type="math/tex">% <![CDATA[ \begin{align*} \text{repeat until convergence: } \lbrace & \newline \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) \newline \rbrace& \end{align*} %]]></script></p> <p><img src="/assets/images/posts/Stanford-ML/week1-gradient-descent-3d.png" alt="alt_text" title="3d-gradient-descent" /></p> <h3 id="gradient-descent-for-linear-regression">Gradient Descent for Linear Regression</h3> <p>Algorithm:<br /> <script type="math/tex">% <![CDATA[ \begin{align*} \text{repeat until convergence: } \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline \rbrace& \end{align*} %]]></script></p> <h4 id="facts">Facts</h4> <ul> <li><strong>“Batch” Gradient Descent</strong>: Each step of gradient descent uses all the training examples.</li> <li>Gradient descent can converge even if <script type="math/tex">\alpha</script> is kept fixed. (But <script type="math/tex">\alpha</script> cannot be too large, or else it may fail to converge).</li> <li>For the specific choice of cost function <em>J</em> used in linear regression, there are no local optima (other than the global optimum).</li> </ul> <h1 id="chap-2---linear-regression-with-multiple-variables">Chap 2 - Linear Regression with Multiple Variables</h1> <p>Linear regression with multiple variables is also known as “multivariate linear regression”.</p> <p>We now introduce notation for equations where we can have any number of input variables.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the column vector of all the feature inputs of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \left| x^{(i)} \right| ; \text{(the number of features)} \end{align*} %]]></script> <p>Now define the multivariable form of the hypothesis function as follows, accommodating these multiple features:<br /> <script type="math/tex">h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n</script></p> <p>In order to develop intuition about this function, we can think about <script type="math/tex">\theta_0</script> as the basic price of a house, <script type="math/tex">\theta_1</script> as the price per square meter, <script type="math/tex">\theta_2</script>as the price per floor, etc. <script type="math/tex">x_1</script> will be the number of square meters in the house, <script type="math/tex">x_2</script> the number of floors, etc.</p> <p>Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:<br /> <script type="math/tex">\begin{align*}h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\end{align*}</script></p> <p>Remark: Note that for convenience reasons in this course Mr. Ng assumes <script type="math/tex">x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } )</script></p> <p>The training examples are stored in X row-wise, like such:<br /> <script type="math/tex">% <![CDATA[ \begin{align*}X = \begin{bmatrix}x^{(1)}_0 & x^{(1)}_1 \newline x^{(2)}_0 & x^{(2)}_1 \newline x^{(3)}_0 & x^{(3)}_1 \end{bmatrix}&,\theta = \begin{bmatrix}\theta_0 \newline \theta_1 \newline\end{bmatrix}\end{align*} %]]></script></p> <p>You can calculate the hypothesis as a column vector of size (m x 1) with: <script type="math/tex">h_\theta(X) = X \theta</script></p> <p><strong>For the rest of these notes, <script type="math/tex">X</script> will be represent a matrix of training examples <script type="math/tex">x_(i)</script> stored row-wise.</strong></p> <h3 id="cost-function-1">Cost Function</h3> <p>For the parameter <script type="math/tex">\theta</script> (of type <script type="math/tex">\mathbb{R}^{(n+1)}</script>or <script type="math/tex">\mathbb{R}^{(n+1) \times 1}</script>),<br /> cost function: <script type="math/tex">J(\theta) = \dfrac {1}{2m} \displaystyle \sum_{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2</script></p> <p>The vectorized version: <script type="math/tex">J(\theta) = \dfrac {1}{2m} (X\theta - \vec{y})^{T} (X\theta - \vec{y})</script></p> <h3 id="gradient-descent-for-multiple-variables">Gradient Descent for Multiple Variables</h3> <p>The gradient descent equation itself is generally the same form; we just have to repeat it for our ‘n’ features:<br /> <script type="math/tex">% <![CDATA[ \begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*} %]]></script></p> <p>In other words:<br /> <script type="math/tex">% <![CDATA[ \begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0..n}\newline \rbrace\end{align*} %]]></script></p> <h3 id="matrix-notation">Matrix Notation</h3> <p>The Gradient Descent rule can be expressed as:<br /> <script type="math/tex">\theta := \theta - \alpha \nabla J(\theta)</script></p> <p>where <script type="math/tex">\nabla J(\theta)</script> is a vector of the form:<br /> <script type="math/tex">\nabla J(\theta) = \begin{bmatrix}\frac{\partial J(\theta)}{\partial \theta_0} \newline \frac{\partial J(\theta)}{\partial \theta_1} \newline \vdots \newline \frac{\partial J(\theta)}{\partial \theta_n} \end{bmatrix}</script></p> <p>The j-th component of the gradient is the summation of the product of two terms:<br /> <script type="math/tex">% <![CDATA[ \begin{align*} \; &\frac{\partial J(\theta)}{\partial \theta_j} &=& \frac{1}{m} \sum\limits_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)} \right) \cdot x_j^{(i)} \newline \; & &=& \frac{1}{m} \sum\limits_{i=1}^{m} x_j^{(i)} \cdot \left(h_\theta(x^{(i)}) - y^{(i)} \right) \end{align*} %]]></script></p> <p>Sometimes, the summation of the product of two terms can be expressed as the product of two vectors.</p> <p>Here, <script type="math/tex">x_j^{(i)}</script> for i = 1, 2, … m, represents the m elements of the j-th columns, <script type="math/tex">\vec{x_j}</script>, of the training set X.<br /> The other term <script type="math/tex">\left(h_\theta(x^{(i)}) - y^{(i)} \right)</script> is the vector of the deviation between the prediction <script type="math/tex">h_\theta(x^{(i)})</script> and the true values <script type="math/tex">y^{(i)}</script>. Rewriting <script type="math/tex">\frac{\partial J(\theta)}{\partial \theta_j}</script>, we have:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*}\; &\frac{\partial J(\theta)}{\partial \theta_j} &=& \frac1m \vec{x_j}^{T} (X\theta - \vec{y}) \newline\newline\newline\; &\nabla J(\theta) & = & \frac 1m X^{T} (X\theta - \vec{y}) \newline\end{align*} %]]></script> <p>The vectorized matrix notation: <script type="math/tex">\theta := \theta - \frac{\alpha}{m} X^{T} (X\theta - \vec{y})</script></p> <h2 id="feature-normalization">Feature Normalization</h2> <p>We can speed up gradient descent by having each of our input values in roughly the same range. This is because <script type="math/tex">\theta</script> will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.</p> <p>The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:</p> <p>−1 ≤ <script type="math/tex">x_(i)</script> ≤ 1 <em>or</em> −0.5 ≤ <script type="math/tex">x_(i)</script> ≤ 0.5</p> <p>These aren’t exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.</p> <p>Two techniques to help with this are <strong>feature scaling</strong> and <strong>mean normalization</strong>. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:</p> <script type="math/tex; mode=display">x_i := \dfrac{x_i - \mu_i}{s_i}</script> <p>Where <script type="math/tex">μ_i</script> is the average of all the values for feature (i) and <script type="math/tex">s_i</script> is the range of values (max - min), or <script type="math/tex">s_i</script> is the standard deviation.</p> <p>Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.</p> <p>Example: xi is housing prices with range of 100 to 2000, with a mean value of 1000. Then, <script type="math/tex">x_i := \dfrac{price-1000}{1900}</script></p> <h3 id="gradient-descent-tips">Gradient Descent Tips</h3> <ul> <li><strong>Debugging gradient descent</strong>: Make a plot with number of iterations on the x-axis. Now plot the cost function, <script type="math/tex">J(\theta)</script> over the number of iterations of gradient descent. If <script type="math/tex">J(\theta)</script> ever increases, then you probably need to decrease <script type="math/tex">\alpha</script>.</li> <li><strong>Automatic convergence test</strong>: Declare convergence if <script type="math/tex">J(\theta)</script> decreases by less than E in one iteration, where E is some small value such as <script type="math/tex">10^{-3}</script>. However in practice it’s difficult to choose this threshold value.</li> </ul> <p>It has been proven that if learning rate <script type="math/tex">\alpha</script> is sufficiently small, then <script type="math/tex">J(\theta)</script> will decrease on every iteration. Andrew Ng recommends decreasing <script type="math/tex">\alpha</script> by multiples of 3.</p> <h3 id="features-and-polynomial-regression">Features and Polynomial Regression</h3> <p>We can improve our features and the form of our hypothesis function in a couple different ways.</p> <p>We can <strong>combine</strong> multiple features into one. For example, we can combine <script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> into a new feature <script type="math/tex">x_3</script> by taking <script type="math/tex">x_1⋅x_2</script>.</p> <p>Polynomial Regression Our hypothesis function need not be linear (a straight line) if that does not fit the data well.</p> <p>We can <strong>change the behavior or curve</strong> of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).</p> <p>For example, if our hypothesis function is <script type="math/tex">h_\theta(x) = \theta_0 + \theta_1 x_1</script> then we can create additional features based on <script type="math/tex">x_1</script>, to get the quadratic function <script type="math/tex">h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2</script> or the cubic function <script type="math/tex">h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3</script>. In the cubic version, we have created new features <script type="math/tex">x_2 = x_1^2</script> and <script type="math/tex">x_3 = x_1^3</script>.</p> <p>To make it a square root function, we could do: <script type="math/tex">h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_1}</script></p> <p>One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.</p> <p>eg. if <script type="math/tex">x_1</script> has range 1 - 1000 then range of <script type="math/tex">x_1^2</script> becomes 1 - 1000000 and that of <script type="math/tex">x_1^3</script> becomes 1 - 1000000000.</p> <h2 id="normal-equation">Normal Equation</h2> <p>The “Normal Equation” is a method of finding the optimum theta <strong>without iteration</strong>:<br /> <script type="math/tex">\theta = (X^T X)^{-1}X^T y</script></p> <p>Remark: There is no need to do feature scaling with the normal equation. Mathematics proofs: <a href="https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)">link 1</a>, <a href="https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression">link 2</a>.</p> <p>The following is a comparison of gradient descent and the normal equation:</p> <table> <thead> <tr> <th style="text-align: center">Gradient Descent</th> <th style="text-align: center">Normal Equation</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">Need to choose <script type="math/tex">\alpha</script></td> <td style="text-align: center">No need to choose <script type="math/tex">\alpha</script></td> </tr> <tr> <td style="text-align: center">Need many iterations</td> <td style="text-align: center">No need to iterate</td> </tr> <tr> <td style="text-align: center"><script type="math/tex">O(kn^2)</script></td> <td style="text-align: center"><script type="math/tex">O(n^3)</script>, need to calculate inverse of <script type="math/tex">X^TX</script></td> </tr> <tr> <td style="text-align: center">Works well when n is large</td> <td style="text-align: center">slow if n is large</td> </tr> </tbody> </table> <p>Remark: <script type="math/tex">n \approx 10000</script></p> <h3 id="normal-equation-non-invertibility">Normal Equation Non-invertibility</h3> <p>When implementing the normal equation in octave we want to use the ‘pinv’ function rather than ‘inv.’ <script type="math/tex">X^TX</script> may be <strong>non-invertible</strong>. The common causes are:</p> <ul> <li>Redundant features, where two features are very closely related (i.e. they are linearly dependent)</li> <li>Too many features (e.g. <script type="math/tex">m \leq n</script>). In this case, delete some features or use “regularization” (to be explained in a later lesson).</li> </ul> <h1 id="reference">Reference</h1> <ul> <li><a href="https://www.coursera.org/learn/machine-learning">Stanford Machine Learning by Andrew Ng</a></li> </ul>Yanqing WuChap 1 - Introduction What is Machine Learning? Definition by Arthur Samuel(1959): “Field of study that gives computers the ability to learn without being explicitly programmed.” Definition by Tom Mitchell(1998): “A computer program is said to learn form experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”Study Note of Scheme2018-02-28T00:00:00+00:002018-02-28T00:00:00+00:00http://www.pwyqspace.com/study/2018/02/28/scheme-study-note<p><strong>==&gt; On Studying Process &lt;==</strong></p> <h1 id="the-ten-commandments">The Ten Commandments</h1> <h2 id="the-first-commandment">The First commandment</h2> <p>When recurring on a list of atoms, <em>lat</em>, ask two questions about it: (<em>null? lat</em>) and <strong>else</strong>. When recurring on a number, <em>n</em>, ask two questions about it: (<em>zero? n</em>) and <strong>else</strong>. When recurring on a list of S-expressions, <em>l</em>, ask three questions bout it: (<em>null? l</em>), (<em>atom?</em> (<em>car l</em>)), and <strong>else</strong>.</p> <h2 id="the-second-commandment">The Second Commandment</h2> <p>Use <em>cons</em> to build lists.</p> <h2 id="the-third-commandment">The Third Commandment</h2> <p>When building a list, describe the first typical element, and then <em>cons</em> it onto the natural recursion.</p> <h2 id="the-fourth-commandment">The Fourth Commandment</h2> <p>Always change at least one argument while recurring. When recurring on a list of atoms, <em>lat</em>, use (<em>cdr lat</em>). When recurring on a number, <em>n</em>, use (<em>sub1 n</em>). And when recurring on a list of S-expressions, <em>l</em>, use (<em>car l</em>) and (<em>cdr l</em>) if neither (<em>null? l</em>) nor (<em>atom?</em> (<em>car l</em>)) are true.</p> <p>It must be changed to be closer to termination. The changing argument must be tested in the termination condition:<br /> when using <em>cdr</em>, test termination with <em>null?</em> and<br /> when using <em>sub1</em>, test termination with <em>zero?</em>.</p> <h2 id="the-fifth-commandment">The Fifth Commandment</h2> <p>When building a value with +, always use 0 for the value of the terminating line, for adding 0 does not change the value of an addition.</p> <p>When building a value with x, always use 1 for the value of the terminating line, for multiplying by 1 does not change the value of a multiplication.</p> <p>When building a value with <em>cons</em>, always consider () for the value of the terminating line.</p> <h2 id="the-sixth-commandment">The Sixth Commandment</h2> <p>Simplify only after the function is correct.</p> <h2 id="the-seventh-commandment">The Seventh Commandment</h2> <p>Recur on the <em>subparts</em> that are of the same nature:</p> <ul> <li>On the sublists of a list.</li> <li>On the subxpressions of an arithmetic expression.</li> </ul> <h2 id="the-eighth-commandment">The Eighth Commandment</h2> <p>Use help functions to abstract from representations.</p> <h2 id="the-ninth-commandment">The Ninth Commandment</h2> <p>Abstract common patterns with a new function.</p> <h2 id="the-tenth-commandment">The Tenth Commandment</h2> <p>Build functions to collect more than one value at a time.</p> <h1 id="the-five-rules">The Five Rules</h1> <h2 id="the-law-of-car">The Law of Car</h2> <p>The primitive <em>car</em> is defined only for non-empty lists.</p> <h2 id="the-law-of-cdr">The Law of Cdr</h2> <p>The primitive <em>cdr</em> is defined only for non-empty lists. The <em>cdr</em> of any non-empty list is always another list.</p> <h2 id="the-law-of-cons">The Law of Cons</h2> <p>The primitive <em>cons</em> takes two arguments.<br /> The second argument to <em>cons</em> must be a list.<br /> The result is a list.</p> <h2 id="the-lwa-of-null">The Lwa of Null?</h2> <p>The primitive <em>null?</em> is defined only for lists.</p> <h2 id="the-law-of-eq">The Law of Eq?</h2> <p>The primitive <em>eq?</em> takes two arguments.<br /> Each must be a non-numeric atom.</p> <h1 id="chapter-1">Chapter 1</h1> <ol> <li><code class="highlighter-rouge">atom</code> is a string of characters, except with a left “(“ or right “)” parenthesis.</li> <li><code class="highlighter-rouge">list</code> is a collection of <code class="highlighter-rouge">atom</code>(s) enclosed by parentheses.</li> <li>All <code class="highlighter-rouge">atom</code>s are <code class="highlighter-rouge">S-expression</code>s.</li> <li>All <code class="highlighter-rouge">list</code>s are <code class="highlighter-rouge">S-expression</code>s.</li> </ol> <p><strong>e.g.</strong>: How many <code class="highlighter-rouge">S-expression</code>s are in the <code class="highlighter-rouge">list</code>:<br /> (((how) are) ((you) (doing so)) far)<br /> and what are they?</p> <p><strong>Ans</strong>: Three, ((how) are), ((you) (doing so)), and far.</p> <ol> <li> <p><code class="highlighter-rouge">null list</code> (or <code class="highlighter-rouge">empty list</code>) contains zero S-expressions enclosed by parentheses. <code class="highlighter-rouge">()</code></p> </li> <li><em>car</em> is the first <code class="highlighter-rouge">atom</code> of the <code class="highlighter-rouge">list</code>.</li> <li>There is no <em>car</em> for an <code class="highlighter-rouge">atom</code>.</li> <li>There is no <em>car</em> for an <code class="highlighter-rouge">null list</code>.</li> </ol> <p>refer to <strong>The Law of Car</strong></p> <ol> <li>“(<em>car l</em>)” is another way to ask for “the <em>car</em> of the list <em>l</em>.”</li> <li>“cdr” is pronounced “could-er.”</li> </ol> <h1 id="installation">Installation</h1> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># on Arch Linux $sudo pacman -Sy mit-scheme # on Ubuntu Linux$ sudo apt-get install mit-scheme -y </code></pre></div></div> <h1 id="reference">Reference</h1> <p><a href="https://mitpress.mit.edu/books/little-schemer">The Little Schemer</a></p>Yanqing Wu==&gt; On Studying Process &lt;==Git Cheat Sheet2018-02-27T00:00:00+00:002018-02-27T00:00:00+00:00http://www.pwyqspace.com/cheatsheet/2018/02/27/Git-Cheat-Sheet<style> table { width: 100% } table th:nth-child(1) { width: 40%; } table th:nth-child(2) { width: 60%; } </style> <h2 id="create-repositories">Create Repositories</h2> <p>Start a new repository or obtain one from an existing URL</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">git init &lt;project-name&gt;</code></td> <td style="text-align: center">Initialize a new local repository with the specified name</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git clone &lt;url&gt;</code></td> <td style="text-align: center">Downloads a project and its entire version history</td> </tr> </tbody> </table> <h2 id="configure-tooling">Configure Tooling</h2> <p>Configure user information for all local repositories</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">git config --global user.name "&lt;name&gt;"</code></td> <td style="text-align: center">Sets the name to your commit transactions</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git config --global user.email "&lt;email address&gt;"</code></td> <td style="text-align: center">Sets the email to your commit transactions</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git config --global color.ui auto</code></td> <td style="text-align: center">Enables helpful colorization of command line output</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git config core.editor "vim"</code></td> <td style="text-align: center">Set vim as default text editor (e.g., editing commit message) (nano is too hard for me:))</td> </tr> </tbody> </table> <h2 id="make-changes">Make Changes</h2> <p>Review edits and craft a commit transaction</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">git status</code></td> <td style="text-align: center">Lists all new or modified files to be committed</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git diff</code></td> <td style="text-align: center">Shows file differences not yet staged</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git add &lt;file&gt;</code></td> <td style="text-align: center">Snapshots the file in preparation for versioning</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git add -A</code> <em>or</em> <code class="highlighter-rouge">git add .</code></td> <td style="text-align: center">Includes all modification in stage</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git diff --staged</code></td> <td style="text-align: center">Shows file differences between staging and the last file version</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">gie reset &lt;file&gt;</code></td> <td style="text-align: center">Unstages the file, but preserve its contents</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git commit -m "&lt;descriptive message&gt;"</code></td> <td style="text-align: center">Records file snapshots permanently in version history</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git commit --amend</code></td> <td style="text-align: center">Modifies commit message for not pushed commit; if pushed, use <code class="highlighter-rouge">git push --force &lt;branch&gt;</code></td> </tr> </tbody> </table> <h2 id="group-changes">Group Changes</h2> <p>Name a series of commits and combine completed efforts</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">git branch</code></td> <td style="text-align: center">Lists all local branches in the current repository</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git branch &lt;branch-name&gt;</code></td> <td style="text-align: center">Creates a new branch</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git checkout &lt;branch-name&gt;</code></td> <td style="text-align: center">Switches to the specified branch and updates the working directory</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git merge &lt;branch&gt;</code></td> <td style="text-align: center">Combines the specified branch’s history into the current branch</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git branch -d &lt;branch-name&gt;</code></td> <td style="text-align: center">Deletes the specified branch locally</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git push origin :&lt;branch-name&gt;</code></td> <td style="text-align: center">Deletes the specified branch remotely</td> </tr> </tbody> </table> <h2 id="refactor-filenames">Refactor Filenames</h2> <p>Relocate and remove versioned files</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">git rm &lt;file&gt;</code></td> <td style="text-align: center">Deletes the file from the working directory and stages the deletion</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git rm --cached &lt;file&gt;</code></td> <td style="text-align: center">Removes the file from version control but preserves the file locally</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git mv &lt;file-original&gt; &lt;file-renamed&gt;</code></td> <td style="text-align: center">Changes the file name and prepares it for commit</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git clean (-d) &lt;file&gt;</code></td> <td style="text-align: center">Removes untracked files from working tree (locally), <code class="highlighter-rouge">-d</code> for directory</td> </tr> </tbody> </table> <h2 id="suppress-tracking">Suppress Tracking</h2> <p>Exclude temporary files and paths</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">vim .gitignore</code></td> <td style="text-align: center">Creates a text file named <code class="highlighter-rouge">.gitignore</code> to exclude specified files from versioning</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git ls-files --other --ignored --exclude-standard</code></td> <td style="text-align: center">Lists all ignored files in this project</td> </tr> </tbody> </table> <h2 id="save-fragments">Save Fragments</h2> <p>Shelve and restore incomplete changes</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">git stash</code></td> <td style="text-align: center">Temporarily stores all modified tracked files</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git stash pop</code></td> <td style="text-align: center">Resotres the most recently stashed files</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git stash list</code></td> <td style="text-align: center">Lists all stashed changesets</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git stash drop</code></td> <td style="text-align: center">Discards the most recently stashed changeset</td> </tr> </tbody> </table> <h2 id="review-history">Review History</h2> <p>Browse and inspect the evolution of project files</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">git log (--graph --color)</code></td> <td style="text-align: center">Lists version history for the current branch</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git reflog</code></td> <td style="text-align: center">Reference logs, records when the tips of branches and other references were updated locally.</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git log --follow &lt;file&gt;</code></td> <td style="text-align: center">Lists version history for a file, including renames</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git dff &lt;first-branch&gt;...&lt;second-branch&gt;</code></td> <td style="text-align: center">Shows content differences between two branches</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git show &lt;commit&gt;</code></td> <td style="text-align: center">Outputs metadata and content changes of the specified commit</td> </tr> </tbody> </table> <h2 id="redo-commits">Redo Commits</h2> <p>Erase mistakes and craft replacement history</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">git reset &lt;commit&gt;</code></td> <td style="text-align: center">Undoes all commits after <code class="highlighter-rouge">&lt;commit&gt;</code>, preserving changes locally</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git reset --hard &lt;commit&gt;</code></td> <td style="text-align: center">Discards all history and changes back to the specified commit</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git reset --soft &lt;commit&gt;</code></td> <td style="text-align: center">Undos a local commit and keeps changes</td> </tr> </tbody> </table> <h2 id="synchronize-changes">Synchronize Changes</h2> <p>Register a repository bookmark and exchange version history</p> <table> <thead> <tr> <th style="text-align: center">command</th> <th style="text-align: center">description</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><code class="highlighter-rouge">git fetch &lt;bookmark&gt;</code></td> <td style="text-align: center">Downloads all history from the repository bookmark</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git merge &lt;bookmark&gt;/&lt;branch&gt;</code></td> <td style="text-align: center">Combines bookmark’s branch into current local branch</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git push &lt;alias&gt; &lt;branch&gt;</code></td> <td style="text-align: center">Uploads all local branch commits to Remote</td> </tr> <tr> <td style="text-align: center"><code class="highlighter-rouge">git pull</code></td> <td style="text-align: center">Downloads bookmark history and incorporates changes</td> </tr> </tbody> </table> <h2 id="track-files">Track Files</h2> <ul> <li>Force tracking target file：<code class="highlighter-rouge">git update-index --no-assume-unchanged &lt;file&gt;</code></li> <li>Force not tracking target file：<code class="highlighter-rouge">git update-index --assume-unchanged &lt;file&gt;</code></li> </ul> <h2 id="tag-operation">Tag Operation</h2> <ul> <li>Based on <code class="highlighter-rouge">HEAD</code>：<code class="highlighter-rouge">git tag &lt;name&gt;</code> <ul> <li>Based on specific commit：<code class="highlighter-rouge">git tag &lt;name&gt; &lt;commit&gt;</code></li> <li>Add tag info：<code class="highlighter-rouge">git tag -m &lt;message&gt; &lt;name&gt;</code></li> <li>Add tag with PGP：<code class="highlighter-rouge">git tag -s &lt;name&gt;</code></li> </ul> </li> <li>Check tags：<code class="highlighter-rouge">git tag</code></li> <li>Push specific tag to remote repo：<code class="highlighter-rouge">git push &lt;repo-name&gt; &lt;tag-name&gt;</code> <ul> <li>Push all tags to remote repo：<code class="highlighter-rouge">git push &lt;repo-name&gt; --tags</code></li> </ul> </li> <li>Delete specific tag in specific repo：<code class="highlighter-rouge">git push &lt;repo-name&gt; :refs/tags/&lt;tag-name&gt;</code></li> </ul> <h2 id="module-operation">Module Operation</h2> <ul> <li>Add submodule：<code class="highlighter-rouge">git submodule add -b &lt;branch&gt; --name &lt;name&gt; &lt;repo&gt; &lt;path&gt;</code></li> <li>Check submodule status：<code class="highlighter-rouge">git submodule status</code></li> <li>clone project with submodule <ul> <li>method I：<code class="highlighter-rouge">git clone &lt;repo&gt; --recursive</code></li> <li> <p>method II：</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> git clone &lt;repo&gt; git submodule update --init --recursive </code></pre></div> </div> </li> </ul> </li> <li> <p>Delete submodule：</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> git deinit &lt;path&gt; git rm --cached &lt;path&gt; rm -rf &lt;path&gt; [edit .gitmodules to remove submodule item] </code></pre></div> </div> </li> <li>Execute commands in submodule：<code class="highlighter-rouge">git submodule foreach &lt;command&gt;</code></li> <li>Update submodule：<code class="highlighter-rouge">git submodule update --recursive --remote</code></li> </ul> <h2 id="fun-facts">Fun Facts</h2> <ol> <li>In GIt, <code class="highlighter-rouge">HEAD</code> represent current version (i.e., the latest push); last version is <code class="highlighter-rouge">HEAD^</code>, last last version is <code class="highlighter-rouge">HEAD^^</code>, last 100 version is <code class="highlighter-rouge">HEAD~100</code>.</li> <li>A lot of commands has <code class="highlighter-rouge">-n</code> or <code class="highlighter-rouge">--dry-run</code> option. When apply such an option, the command won’t run; instead, it outputs what it will do, so that user can determine to proceed or not.</li> </ol> <h2 id="reference">Reference</h2> <ol> <li><a href="https://git-scm.com/">Git Official Website</a></li> <li><a href="http://uchuhimo.me/2017/04/18/git-cheat-sheet/">Git Cheat Sheet</a></li> </ol>Yanqing WuWelcoming Arch Linux to My Thinkpad X1C2018-02-03T00:00:00+00:002018-02-03T00:00:00+00:00http://www.pwyqspace.com/tech/2018/02/03/Arch-Linux-Install<p>My Ubuntu OS was blown up three days ago (ㄒoㄒ).</p> <p>Long story short, I stuck in an infinite and helpless login loop in Ubuntu, and there was no feasible workaround to address the problem (with 2-hour intensive googling). Well, since I need to re-install my operating system anyway, why not try something new.</p> <p>Though I’ve spent more than one year on Ubuntu Linux (daily use and at work), it’s still took me two nights + one day to configure everything in Arch.</p> <blockquote> <p>I would definitely NOT recommend a total Linux newbie to try <em>Arch Linux</em>. IT’S TOO “LIGHT”!</p> </blockquote> <h2 id="installation--configuration">Installation &amp; Configuration</h2> <p>Since there are plenty of decent and mature online tutorials, it would be pointless for me to reinvent the wheel. I will recommend some of them below that I followed with.<br /> <strong>Notes</strong>:</p> <blockquote> <ol> <li>Make sure you understand how each command works (use <code class="highlighter-rouge">man &lt;cmd&gt;</code> or <code class="highlighter-rouge">&lt;cmd&gt; --help</code> to display commands’ description)</li> <li><em>Always</em> refer to <a href="https://wiki.archlinux.org/" target="_blank">Arch Wiki</a> when you are confused</li> </ol> </blockquote> <h3 id="tutorial-in-english">Tutorial in English</h3> <ul> <li><a href="https://kozikow.com/2016/06/03/installing-and-configuring-arch-linux-on-thinkpad-x1-carbon/" target="_blank">Installing and configuring Arch Linux on Thinkpad X1 Carbon</a></li> </ul> <h3 id="tutorials-in-chinese">Tutorials in Chinese</h3> <ul> <li><a href="http://www.viseator.com/2017/05/17/arch_install/" target="_blank">以官方Wiki的方式安装ArchLinux</a></li> <li><a href="http://www.viseator.com/2017/05/19/arch_setup/" target="_blank">ArchLinux安装后的必须配置与图形界面安装教程</a></li> <li><a href="http://www.viseator.com/2017/07/02/arch_more/" target="_blank">ArchLinux你可能需要知道的操作与软件包推荐「持续更新」</a></li> <li><a href="http://www.bijishequ.com/detail/220866" target="_blank">配置和美化Arch Linux</a></li> </ul> <h3 id="everything-works-out-of-box">Everything works out of box</h3> <p>My installation was really smooth, thanks to above authors’ hard work on their posts. I’d like to thank Thinkpad as well, since it doesn’t require extra time to configure any hardware ^_^</p> <p>My arch-linux configurations:</p> <table> <thead> <tr> <th>Service Name</th> <th style="text-align: center">Type / Version</th> </tr> </thead> <tbody> <tr> <td>Display server</td> <td style="text-align: center">Xorg</td> </tr> <tr> <td>Desktop Environment</td> <td style="text-align: center">KDE Plasma 5</td> </tr> <tr> <td>Display Manager</td> <td style="text-align: center">SDDM</td> </tr> <tr> <td>Window Manager</td> <td style="text-align: center">KDE</td> </tr> <tr> <td>File Manager</td> <td style="text-align: center">Dolphin</td> </tr> <tr> <td>Shell</td> <td style="text-align: center">zsh + oh-my-zsh</td> </tr> <tr> <td>Terminal emulator</td> <td style="text-align: center">Konsole</td> </tr> <tr> <td>Widget toolkit</td> <td style="text-align: center">gtk5-base</td> </tr> </tbody> </table> <h2 id="things-outside-the-tutorials">Things Outside the Tutorials</h2> <p>Here I’d like to talk about some stuff that are not covered in aforementioned tutorials but are helpful to know.</p> <h3 id="reducing-reboot-waiting-time">Reducing reboot waiting time</h3> <p>At early installation &amp; configuration stage, you need to reboot the system frequently, during such period you might see a message like <code class="highlighter-rouge">A stop job is running for session ... (1 min 30s)</code>. This waits for 90 sec to continue the reboot process. You can reduced the time out in <code class="highlighter-rouge">/etc/systemd/system.conf</code> (Here reduced to 30s):</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DefaultTimeoutStartSec=30s DefaultTimeoutStopSec=30s </code></pre></div></div> <p>Then you need to reload system manager with <code class="highlighter-rouge">systemctl daemon-reload</code>.</p> <h3 id="calling-terminal-using-keyboard-shortcut">Calling terminal using keyboard shortcut</h3> <p>In my configuration, it didn’t come with the default <code class="highlighter-rouge">ctrl+alt+t</code> to call a terminal. If you are using <code class="highlighter-rouge">KDE</code> desktop environment, this can be set in <code class="highlighter-rouge">Global Shortcuts</code>. First, press the <code class="highlighter-rouge">Super</code> (aka <code class="highlighter-rouge">Win</code>) key to show the menu. Then, just search with <code class="highlighter-rouge">shortcut</code>.</p> <p>One lesson I learnt form this time is to: <strong>first explore available settings via GUI</strong>, then play with the command-line &amp; scripts. In next point, I will talk about my exhausted experience to merely remap caps-lock and ctrl key via scripting, which cost me more than 3 hours…</p> <h3 id="remapping-caps-lock--ctrl">Remapping caps-lock &amp; ctrl</h3> <p>Actually, this can be set with GUI if you’re using <code class="highlighter-rouge">KDE</code>, or with <code class="highlighter-rouge">gnome-tweak-tool</code> if you’re using <code class="highlighter-rouge">Gnome</code>.</p> <p>In KDE, <code class="highlighter-rouge">Keyboard and Hardware Layout -&gt; Advanced -&gt; Ctrl Position -&gt; Swap Ctrl and Caps Lock</code></p> <p>Before I knew this can be easily set with GUI, I tried a trillion times to modify scripts/settings like <code class="highlighter-rouge">.Xmodmap</code>, <code class="highlighter-rouge">autostart-scripts</code>, <code class="highlighter-rouge">.xinitrc</code>, <code class="highlighter-rouge">setxkbmap</code>, and <code class="highlighter-rouge">/usr/share/sddm/scripts/Xsetup</code>, and rebooted a trillion+1 times. All the scripts either worked on current run but fails to load after a reboot, or didn’t auto-load/parse due to low priority in <code class="highlighter-rouge">KDE Plasama</code>, or cost me extra few seconds to wait on login…</p> <p>Again, <strong>go through settings that available on GUI first</strong>! Just like reading an instruction before using a product (I know lots of people skip this, including me…).</p> <h3 id="changing-weird-font-on-google-chrome">Changing Weird Font on Google Chrome</h3> <p>I change my web browser to Google Chrome as I have a lot of important bookmarks and plug-ins on Chrome (the default web browser <code class="highlighter-rouge">Konqueror</code> is also way too “hacker” to use for me).</p> <p>Anyway, the font can be set in Chrome built-in settings.</p> <p><code class="highlighter-rouge">Open Chrome Browser -&gt; Settings -&gt; Advanced -&gt; Customized Fonts</code></p> <h3 id="hidpi-settings-for-high-resolution-screen">HiDPI Settings for High Resolution Screen</h3> <p>If you’re using high resolution screen, the first thing you will complain is that, everything is surprisingly tiny! The font size is around 1-2 mm.</p> <p>I copied the KDE settings below, by <a href="https://wiki.archlinux.org/index.php/HiDPI" target="_blank">Arch wiki</a>:</p> <blockquote> <p>KDE You can use KDE’s settings to fine tune font, icon, and widget scaling. This solution affects both Qt and Gtk+ applications.</p> <p>To adjust font, widget, and icon scaling together:</p> <ol> <li>System Settings → Display and Monitor → Display Configuration → Scale Display</li> <li>Drag the slider to the desired size</li> <li>Restart for the settings to take effect</li> </ol> <p>To adjust only font scaling:</p> <ol> <li>System Settings → Fonts</li> <li>Check “Force fonts DPI” and adjust the DPI level to the desired value. This setting should take effect immediately for newly started applications. You will have to logout and login for it to take effect on Plasma desktop.</li> </ol> <p>To adjust only icon scaling:</p> <ol> <li>System Settings → Icons → Advanced</li> <li>Choose the desired icon size for each category listed. This should take effect immediately.</li> </ol> </blockquote> <p><strong>Update</strong>: you might find weird horizontal lines appear and disappear in Konsole after you scaling, this is a <a href="https://bugs.kde.org/show_bug.cgi?id=373232" target="_blank">known bug</a>.</p> <h4 id="adjusting-google-chrome-dpi">Adjusting Google Chrome DPI</h4> <p>If you feel Chrome toolbar is too small and want to scaling it, using following method (<a href="http://kernpanik.com/geekstuff/2015/05/20/chrome-change-default-hidpi-setting.html" target="_blank">source</a>):</p> <ol> <li>Open <code class="highlighter-rouge">/usr/share/applications/google-chrome</code></li> <li>Find <code class="highlighter-rouge">Exec=/usr/bin/google-chrome-stable %U</code></li> <li>change it to <code class="highlighter-rouge">Exec=/usr/bin/google-chrome-stable --force-device-scale-factor=1 %U</code>. <ul> <li>Change to your desire scale factor, floating point acceptable.</li> </ul> </li> </ol> <p>If above is not working, using following, (<a href="https://bbs.archlinux.org/viewtopic.php?id=227131" target="_blank">source</a>):</p> <ol> <li><code class="highlighter-rouge">sudo touch /usr/bin/google-chrome</code> create a file as workaround;</li> <li><code class="highlighter-rouge">sudo chmod a+x /usr/bin/google-chrome</code> make its executable;</li> <li><code class="highlighter-rouge">sudo vim /usr/bin/google-chrome</code> start editing</li> <li>add &amp; save (adjust scale factor to your need) <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span> google-chrome-stable <span class="nt">--force-device-scale-factor</span><span class="o">=</span>1 </code></pre></div> </div> </li> <li>start Google Chrome in terminal using <code class="highlighter-rouge">google-chrome</code> <ul> <li>I find that if I start Chrome from menu, it remains the old settings. I assume this must be a path problem. Will try to find a workaround so that I don’t need to start Chrome from terminal every time.</li> </ul> </li> </ol> <h3 id="setting-wallpaper">Setting Wallpaper</h3> <p><strong>Note</strong>: I only tested this with my configuration (as listed above) There are a lot third parties application can also achieve this (just google it!).</p> <p>For Main Screen:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. right click anywhere of main screen -&gt; Configure Desktop 2. choose a file </code></pre></div></div> <p>For Login Screen:</p> <p><code class="highlighter-rouge">System Settings -&gt; Start up and Shutdown -&gt; Login Screen (SDDM) -&gt; Background</code></p> <h2 id="recommended-software-updating">Recommended Software (Updating)</h2> <p>I’m still exploring on this part. QAQ</p> <p>A full list of applications can be found at <a href="https://wiki.archlinux.org/index.php/list_of_applications" target="_blank">here (Arch Wiki)</a>.</p> <ul> <li>Network Manager: <code class="highlighter-rouge">NetworkManager</code></li> <li>Maths: <code class="highlighter-rouge">octave</code></li> <li>VPN clients: <code class="highlighter-rouge">OpenVPN</code></li> <li>Webcam: <code class="highlighter-rouge">guvcview</code> (work for X1C, haven’t check for other machines)</li> <li>Office suites: <code class="highlighter-rouge">libreoffice-fresh</code></li> <li>Notepadqq: <a href="/assets/images/posts/Arch-Linux-Install/login-2.JPG" title="Login Page 2 - by Makoto Shinkai" target="_blank">GitHub link</a> (I treat it as a complimentary to Vim)</li> </ul> <h2 id="random-thoughts">Random Thoughts</h2> <p>Yeahhh, finally done the configuration and post. Let me first show something:<br /> <img src="/assets/images/posts/Arch-Linux-Install/login.JPG" alt="alt text" title="Login Page - Darling in the Franxx 02" /> <img src="/assets/images/posts/Arch-Linux-Install/login-2.JPG" alt="alt text" title="Login Page 2 - by Makoto Shinkai" /> <img src="/assets/images/posts/Arch-Linux-Install/info-center.png" alt="alt text" title="Info Center - Arch Linux" /></p> <p>I’ve set so many personal first-time during this week:</p> <ul> <li>first time to encounter an un-fixable system bug in Ubuntu</li> <li>first time to use a terminal in <code class="highlighter-rouge">tty</code> mode in Ubuntu (the entire machine just have no GUI, and terminal font size is like 2mm large)</li> <li>first time to install Arch from scratch!</li> <li>first time to write scripts for installation &amp; configuration</li> <li>…</li> </ul> <p>In addition, I found myself have become much more sensitive and picky on the installation of application/dependency/package since I started using Arch. For instance, I once installed an undesired application. When I wanted to delete it later, I unconsciously wanted to compare all installed related dependencies one by one with the <code class="highlighter-rouge">pacman Rcns xxx</code> list. Honestly, I was feeling quite uncomfortable until I was certain that every redundant dependency had been removed. Oops, am I becoming software mysophobia?</p> <div style="text-align: right"> At Markham, 3:44 PM </div>Yanqing WuMy Ubuntu OS was blown up three days ago (ㄒoㄒ).Booklists | 书单2018-01-01T00:00:00+00:002018-01-01T00:00:00+00:00http://www.pwyqspace.com/study/2018/01/01/Booklists<p>Documenting (including but not limited to) textbooks, novels, journals and (conference) papers that I am interested in and will read.<br /> 记录我感兴趣的和将要阅读的（包括但不限于）教科书、小说、刊物和(会议)论文。</p> <p>Status: <strong>Updating</strong> | <strong>长期更新</strong></p> <p>Principles | 收书原则:</p> <blockquote> <ol> <li>If one was originally composed neither in English nor in Chinese, a translated version (that similar to author’s culture) will be documented. Otherwise, original version will be documented only; link(s) to translated version will be added hinge on personal interest.<br /> (For instance, for non-English European authors, English version will be documented; for non-Chinese Asian authors, Chinese version will be documented)<br /> 若一本书的首发语言非中文或英文，则与作者文化相近的语言版本会被收录。除此之外，只记录首发语言版本。翻译版本的链接仅凭个人兴趣添加。<br /> （例如，对非英语国家的欧洲作家，英语版本会被记录；对非中文国家的亚洲作家，中文版本会被记录）</li> <li>Not restricted to theme/topic and format, one will be documented here as long as it is worth reading<br /> 题材和体裁不限；有价值的均会记录于此</li> <li>To avoid wordiness and for brevity, Chinese and/or English will be used in <code class="highlighter-rouge">Brief Intro</code><br /> 为节省空间，简要介绍一栏用到中文 和/或者 英文</li> <li>For protecting author’s intellectual property, only online published one will be added link(s)<br /> 为保护作者知识产权，仅对网络公开的书籍提供链接</li> </ol> </blockquote> <head> <base target="_blank" /> <style> table { width: 100% } table th:nth-child(1) { width: 15%; } table th:nth-child(2) { width: 10%; } table th:nth-child(3) { width: 40%; } table th:nth-child(4) { width: 10%; } table th:nth-child(5) { width: 25%; } </style> </head> <h1 id="have-read">Have Read</h1> <table> <thead> <tr> <th style="text-align: center">Subject/Genre</th> <th style="text-align: center">Type</th> <th style="text-align: center">Title &amp; 1st Author</th> <th style="text-align: center">Link(s)</th> <th style="text-align: center">Brief Intro/Note</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">AI</td> <td style="text-align: center">Paper</td> <td style="text-align: center"><strong>Concrete Problems in AI Safety</strong>; Dario Amodei et al</td> <td style="text-align: center"><a href="https://arxiv.org/abs/1606.06565">arXiv</a></td> <td style="text-align: center">讨论了由于劣质设计导致的无意且有害的AI行为问题</td> </tr> <tr> <td style="text-align: center">AI</td> <td style="text-align: center">Paper</td> <td style="text-align: center"><strong>Deep Reinforcement Learning: An Overview</strong>; Yuxi Li</td> <td style="text-align: center"><a href="https://arxiv.org/abs/1701.07274">arXiv</a></td> <td style="text-align: center">A comprehensive introduction of basic learning methods</td> </tr> </tbody> </table> <h1 id="reading">Reading</h1> <table> <thead> <tr> <th style="text-align: center">Subject/Genre</th> <th style="text-align: center">Type</th> <th style="text-align: center">Title &amp; 1st Author</th> <th style="text-align: center">Link(s)</th> <th style="text-align: center">Brief Intro/Note</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">散文</td> <td style="text-align: center">书</td> <td style="text-align: center"><strong>湘行散记</strong>; 沈从文</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Programming</td> <td style="text-align: center">Book</td> <td style="text-align: center"><strong>The Little Schemer</strong>; Daniel Friedman</td> <td style="text-align: center"> </td> <td style="text-align: center">TLS; Recommend</td> </tr> </tbody> </table> <h1 id="to-read">To Read</h1> <table> <thead> <tr> <th style="text-align: center">Subject/Genre</th> <th style="text-align: center">Type</th> <th style="text-align: center">Title &amp; 1st Author</th> <th style="text-align: center">Link(s)</th> <th style="text-align: center">Brief Intro/Note</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">ML</td> <td style="text-align: center">Paper</td> <td style="text-align: center"><strong>Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution</strong>; Judea Pearl</td> <td style="text-align: center"><a href="https://arxiv.org/abs/1801.04016">arXiv</a></td> <td style="text-align: center">作者为图灵奖得主，探讨ML能否成为ASI的突破口</td> </tr> <tr> <td style="text-align: center">Social Engineering</td> <td style="text-align: center">Book</td> <td style="text-align: center"><strong>The Art of Deception</strong>; Kevin Mitnick</td> <td style="text-align: center"><a href="http://sbisc.ut.ac.ir/wp-content/uploads/2015/10/mitnick.pdf">link</a></td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Society, Dystopian</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>Brave New World</strong>; Aldous Huxley</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Society</td> <td style="text-align: center">Book</td> <td style="text-align: center"><strong>厚黑学</strong>; 李宗吾</td> <td style="text-align: center"> </td> <td style="text-align: center">成书与民国</td> </tr> <tr> <td style="text-align: center">Historical Fiction, Drama</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>The Kite Runner</strong>; Khaled Hosseini</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Historical</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>A Tale of Two Cities</strong>; Charles Dickens</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Nouveau roman</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>The Lover (L’Amant)</strong>; Marguerite Duras</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Realist, Erotic</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>Madame Bovary</strong>; Gustave Flaubert</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Realist</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>Anna Karenina (Анна Каренина)</strong>; Lev Tolstoy</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Hard science fiction</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>A Fire Upon the Deep</strong>; Vernor Vinge</td> <td style="text-align: center"> </td> <td style="text-align: center">中译：深渊上的火</td> </tr> <tr> <td style="text-align: center">浪漫</td> <td style="text-align: center">赋</td> <td style="text-align: center"><strong>洛神赋</strong>; 曹植</td> <td style="text-align: center"><a href="http://so.gushiwen.org/view_47894.aspx">link</a></td> <td style="text-align: center">背诵</td> </tr> <tr> <td style="text-align: center">OS</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>Operating System Concepts</strong>; Silberschatz</td> <td style="text-align: center"><a href="http://iips.icci.edu.iq/images/exam/Abraham-Silberschatz-Operating-System-Concepts---9th2012.12.pdf">link</a></td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Computer Network</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>Computer Networks</strong>; Andrew Tanenbaum</td> <td style="text-align: center"><a href="https://book.douban.com/subject/10510747/">douban</a></td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Programming</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>Structure and Interpretation of Computer Programs</strong>; Gerald Sussman</td> <td style="text-align: center"><a href="https://mitpress.mit.edu/sicp/full-text/book/book.html">link</a></td> <td style="text-align: center">SICP; Read Ch. 1-3; Advanced</td> </tr> <tr> <td style="text-align: center">Programming</td> <td style="text-align: center">Book</td> <td style="text-align: center"><strong>The Art of Computer Programming</strong>; Donald Knuth</td> <td style="text-align: center"> </td> <td style="text-align: center">TAOCP</td> </tr> <tr> <td style="text-align: center">Calculus</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>微积分学教程（Курс дифференциального и интегрального исчисления）</strong>; ригорий Михайлович Фихтенгольц</td> <td style="text-align: center"> </td> <td style="text-align: center">全书分三卷</td> </tr> <tr> <td style="text-align: center">数分</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>Principles of mathematical analysis</strong>; Walter Rudin</td> <td style="text-align: center"><a href="https://notendur.hi.is/vae11/%C3%9Eekking/principles_of_mathematical_analysis_walter_rudin.pdf">link</a></td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Math</td> <td style="text-align: center">“Novel”</td> <td style="text-align: center"><strong>What Is Mathematics?</strong>; Herbert Robbins</td> <td style="text-align: center"> </td> <td style="text-align: center">中译：什么是数学；Introductory</td> </tr> <tr> <td style="text-align: center">Math</td> <td style="text-align: center">“Novel”</td> <td style="text-align: center"><strong>Mathematical Thought from Ancient to Modern Times</strong>; Morris Kline</td> <td style="text-align: center"> </td> <td style="text-align: center">中译：古今数学思想；Advanced</td> </tr> <tr> <td style="text-align: center">Philosophy</td> <td style="text-align: center">Book</td> <td style="text-align: center"><strong>Discipline and Punish (Surveiller et punir)</strong>; Michel Foucault</td> <td style="text-align: center"> </td> <td style="text-align: center">中译：规训与惩罚</td> </tr> <tr> <td style="text-align: center">Fiction</td> <td style="text-align: center">Short Story</td> <td style="text-align: center"><strong>Boule de Suif</strong>; Guy de Maupassant</td> <td style="text-align: center"> </td> <td style="text-align: center">中译：羊脂球；作者：莫泊桑</td> </tr> <tr> <td style="text-align: center">Fiction</td> <td style="text-align: center">Short Story</td> <td style="text-align: center"><strong>Love of life (L’Amour de la vie)</strong>; Jack London</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Romance Fiction</td> <td style="text-align: center">Short Story</td> <td style="text-align: center"><strong>伊豆的舞女 (伊豆の踊子)</strong>; 川端康成</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Romance Fiction</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>古都 (古都)</strong>; 川端康成</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Romance Fiction</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>潮骚 (潮騒)</strong>; 三島由紀夫</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Romance Fiction</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>绝唱 (絶唱)</strong>; 大江賢次</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Romance Fiction</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>春琴抄 (春琴抄)</strong>; 谷崎潤一郎</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Linux</td> <td style="text-align: center">Book</td> <td style="text-align: center"><strong>The Linux Programming Interface</strong>; Micheal Kirrisk</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Linux</td> <td style="text-align: center">Book</td> <td style="text-align: center"><strong>Understanding the Linux Kernel</strong>; Daniel Bovet</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Consciousness</td> <td style="text-align: center">Book</td> <td style="text-align: center"><strong>Gödel, Escher, Bach: an Eternal Golden Braid</strong>; Douglas Hofstadter</td> <td style="text-align: center"> </td> <td style="text-align: center">GEB</td> </tr> <tr> <td style="text-align: center">Hardware</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>计算机组成原理</strong>; 蒋本珊</td> <td style="text-align: center"> </td> <td style="text-align: center">清华出版</td> </tr> <tr> <td style="text-align: center">AI</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>统计学习方法</strong>; 李航</td> <td style="text-align: center"> </td> <td style="text-align: center">作者为诺亚方舟实验室原主任</td> </tr> <tr> <td style="text-align: center">War story</td> <td style="text-align: center">Book</td> <td style="text-align: center"><strong>For Whom the Bell Tolls</strong>; Ernest Hemingway</td> <td style="text-align: center"> </td> <td style="text-align: center"> </td> </tr> <tr> <td style="text-align: center">Fiction</td> <td style="text-align: center">Novel</td> <td style="text-align: center"><strong>Ulysses</strong>; James Joyce</td> <td style="text-align: center"> </td> <td style="text-align: center">中译：尤利西斯；作者：乔伊斯</td> </tr> </tbody> </table> <h1 id="paused">Paused</h1> <table> <thead> <tr> <th style="text-align: center">Subject/Genre</th> <th style="text-align: center">Type</th> <th style="text-align: center">Title &amp; 1st Author</th> <th style="text-align: center">Link(s)</th> <th style="text-align: center">Brief Intro/Note</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">ROS</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>ROS Robot Programming</strong>; YoonSeok Pyo et al</td> <td style="text-align: center"><a href="http://community.robotsource.org/t/download-the-ros-robot-programming-book-for-free/51">link</a></td> <td style="text-align: center">Using C++; Entry Level</td> </tr> <tr> <td style="text-align: center">ROS</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>Programming Robots with ROS</strong>; Morgan Quigley et al</td> <td style="text-align: center"><a href="http://marte.aslab.upm.es/redmine/files/dmsf/p_drone-testbed/170324115730_268_Quigley_-_Programming_Robots_with_ROS.pdf">link</a></td> <td style="text-align: center">More Advanced</td> </tr> <tr> <td style="text-align: center">Algorithm, Data Structure</td> <td style="text-align: center">Textbook</td> <td style="text-align: center"><strong>Introduction to Algorithms</strong>; Thomas Cormen</td> <td style="text-align: center"><a href="http://ressources.unisciel.fr/algoprog/s00aaroot/aa00module1/res/%5BCormen-AL2011%5DIntroduction_To_Algorithms-A3.pdf">link</a></td> <td style="text-align: center">CLRS</td> </tr> <tr> <td style="text-align: center">Robot</td> <td style="text-align: center">Paper</td> <td style="text-align: center"><strong>Combining Neural Networks and Tree Search for Task and Motion Planning in Challenging Environments</strong>; Chris Paxton et al</td> <td style="text-align: center"><a href="https://arxiv.org/abs/1703.07887">arXiv</a></td> <td style="text-align: center">by <a href="http://zoox.com/">ZOOX</a></td> </tr> </tbody> </table>Yanqing WuDocumenting (including but not limited to) textbooks, novels, journals and (conference) papers that I am interested in and will read. 记录我感兴趣的和将要阅读的（包括但不限于）教科书、小说、刊物和(会议)论文。Working Without Mouse or Trackpad On Linux2017-09-04T00:00:00+00:002017-09-04T00:00:00+00:00http://www.pwyqspace.com/tech/2017/09/04/Work-Without-Mouse-On-Linux<p>A significant and continuous improvement on my working efficiency has been observed, since the day I started to work with only a keyboard. As this working style makes me feel so delighted (and has many advantages), I thus consider this is a must to share my thoughts and some basic techniques about how to work without a mouse.</p> <blockquote> <p>Note:</p> <ol> <li>Some techniques in this post can also be applied to <strong>Windows</strong> and <strong>Mac OS</strong>.</li> <li><del>As I am still learning, I may update this post occasionally.</del></li> <li>While I am talking about Linux in this post, I mainly refer to <code class="highlighter-rouge">Ubuntu</code> distribution.</li> </ol> </blockquote> <h2 id="benefits">Benefits</h2> <p>After you gained some techniques from this post, you will find the major benefits of working without a mouse are:</p> <ul> <li><strong>Saving time</strong> of moving your hands as they almost never leave the main area of the keyboard</li> <li><strong>Alleviating</strong> wrist and finger <strong>pain</strong> <sup>1</sup></li> <li>Improving your working <strong>efficiency</strong> and <strong>productivity</strong></li> <li>Enhancing your <strong>typing speed</strong> subconsciously</li> <li>It looks <strong>coooooooool</strong></li> </ul> <p>1: A research conducted by Foye et al. (2002) states that “repetitive or prolonged” using a computer mouse may contribute to Tenosynovitis. According to AAOS (American Academy of Orthopaedic Surgeons), Tenosynovitis is the inflammation of the sheath that surrounds the tendon, which can cause chronic pain and “tenderness along the thumb side of the wrist”.</p> <h2 id="who-is-suitable-to-work-without-a-mouse">Who Is Suitable To Work Without A Mouse?</h2> <p>Not every type of work is suitable with this working style, so choose your style wisely. For instance, if you are a UI developer, then go with IDEs and mouse; those are your good friends.<br /> If you satisfy most following conditions, this post is going to boost your efficiency like a Saturn-V Rocket.</p> <ul> <li>You rarely do front-end developing</li> <li>You rarely do graphic design / research</li> <li>You prefer working with command line than GUI (Graphical User Interface)</li> <li>You are exciting about learning new technologies</li> <li>Coding/writing accounts for your most working time</li> </ul> <h2 id="tools-that-help-you-leave-your-mouse">Tools That Help You Leave Your Mouse</h2> <p>Since you are already here, I assume you have a Linux OS on your machine (FYI: I use Ubuntu 16.04 when I wrote the post). If you have not installed Linux, then this may be a good opportunity to have one.</p> <p>Some tools I will talk about:</p> <ul> <li>A mechanical keyboard (recommend <code class="highlighter-rouge">brown</code> or <code class="highlighter-rouge">blue</code> switch for long-time typing)</li> <li><a href="#loc_vimium">Vimium</a> (an extension for browsing, available on Chrome)</li> <li><a href="#loc_vim">Vim</a> (a highly-configurable text editor)</li> <li><a href="#loc_wasavi">wasavi</a> (an extension that changes a text-area element to virtual vi editor, available on Chrome, Opera and Firefox)</li> </ul> <p>In addition to the tools mentioned above, I strongly encourage you to use ten fingers for typing. If you are struggling with it, there are a lot of great online resources that can help you practising your typing skill (search with keywords like: <code class="highlighter-rouge">typing practise</code> or <code class="highlighter-rouge">typing game</code>).</p> <p>Short story: I used to type with only three to five fingers. One day, before I decided to transfer to ECE (<del>Early Childhood Education</del>Electrical &amp; Computer Engineering), I realized if I let typing be the obstacle on my study career, then that would be too ridiculous. Therefore, I settled down and started practising typing like a kid who get access to a computer for the first time. Beginning with single letters, to short prefixes, suffixes and phrases until long sentences. Believe it or not, it was really tough for the first few days. My fingers were so dump as if I could not control them. After over a month of practising, my average wpm (word per minute) increased from 30 to 80.</p> <p>So give yourself some time to let ten-finger typing be one of your friends :)</p> <h3 id="common-shortcut-keys-on-general">Common Shortcut Keys On General</h3> <p>Note: following commands are <strong>case-sensitive</strong>.</p> <ol> <li> <p>View Keyboard Shortcuts manual</p> <blockquote> <p>long press <code class="highlighter-rouge">Super</code>; <code class="highlighter-rouge">Super</code> is another name of <code class="highlighter-rouge">Win</code> key on Linux.</p> </blockquote> <p><img src="/assets/images/posts/Work-Without-Mouse/linux_keyboard_shortcuts.jpg" alt="alt text" title="Ubuntu Keyboard Shortcut" /></p> </li> <li> <p>Select software from the left-side launcher</p> <blockquote> <p>Method-1: Long press <code class="highlighter-rouge">Super</code> until number shows on the left-side launcher; press the corresponding number to open the software.<br /> Method-2: use <code class="highlighter-rouge">Super + Tab</code> to switch (like how you use <code class="highlighter-rouge">Alt + Tab</code> to switch program)<br /> Method-3: use <code class="highlighter-rouge">Super + Shift + number</code>.</p> </blockquote> </li> <li> <p>Maximize/minimize or share screen</p> <p><code class="highlighter-rouge">Ctrl + Super + up</code>         - maximize (until full screen)<br /> <code class="highlighter-rouge">Ctrl + Super + down</code>    - minimize (to default / until it minimized to launcher)<br /> <code class="highlighter-rouge">Ctrl + Super + left</code>    - move window to the left<br /> <code class="highlighter-rouge">Ctrl + Super + right</code> - move window to the right</p> </li> <li> <p>Open &amp; close terminal</p> <p><code class="highlighter-rouge">Ctrl + Alt + t</code> - open a new terminal<br /> <code class="highlighter-rouge">Ctrl + Shift + t</code> - open a new terminal tab<br /> <code class="highlighter-rouge">Ctrl + Shift + w</code> - close current terminal tab</p> </li> <li> <p>Use <code class="highlighter-rouge">cd</code> to travel everywhere (file system) on your Linux</p> </li> <li> <p>Remap <code class="highlighter-rouge">CapsLock</code> to <code class="highlighter-rouge">Ctrl</code> to increase efficiency, and to ease finger pain; a guide -&gt; <a href="https://askubuntu.com/questions/33774/how-do-i-remap-the-caps-lock-and-ctrl-keys" target="_blank">How To Remap</a></p> <p>If you are comfortable with current <code class="highlighter-rouge">Ctrl</code> position, then stay with it.</p> </li> <li> <p>Switch software</p> <p><code class="highlighter-rouge">Alt+Tab</code></p> </li> <li> <p>Open window menu</p> <p><code class="highlighter-rouge">Alt + space</code></p> <blockquote> <p>For example, I can press <code class="highlighter-rouge">Alt + space</code> to see a drop down menu, then type <code class="highlighter-rouge">c</code> to close the selected software.</p> </blockquote> </li> <li> <p>Access top drop down menu (file menu) for most software</p> <p><code class="highlighter-rouge">Alt + </code> corresponding letter</p> <blockquote> <p>When you press <code class="highlighter-rouge">Alt</code>, there will be a underline of a letter for each option in the top menu.<br /> This also works with the file system.</p> </blockquote> </li> <li> <p>Open files in GUI</p> <blockquote> <p>Type the filename; a small box will then appear on the right bottom of your screen;<br /> Corresponding file will be highlighted;<br /> Hit <code class="highlighter-rouge">Enter</code> to open.</p> </blockquote> </li> </ol> <p><a name="loc_vimium"></a></p> <h3 id="vimium"><a href="https://chrome.google.com/webstore/detail/vimium/dbepggeogbaibhgnhhndojpepiihcmeb?hl=en" target="_blank">Vimium</a></h3> <p>With <code class="highlighter-rouge">Vimium</code>, you can enable keyboard shortcuts for navigation and control on any web-pages.</p> <h4 id="most-used-commands">Most used commands:</h4> <blockquote> <p><code class="highlighter-rouge">f</code> - open a link on current page<br /> <code class="highlighter-rouge">F</code> - open a link on a new page<br />          Alternatively, <code class="highlighter-rouge">f</code> + <code class="highlighter-rouge">Shift</code> + corresponding char<br /> <code class="highlighter-rouge">h</code> - scroll screen left<br /> <code class="highlighter-rouge">l</code> - scroll screen right<br /> <code class="highlighter-rouge">j</code> - scroll screen down<br /> <code class="highlighter-rouge">k</code> - scroll screen up<br /> <code class="highlighter-rouge">u</code> - scroll page up<br /> <code class="highlighter-rouge">d</code> - scroll page down</p> </blockquote> <blockquote> <p><code class="highlighter-rouge">gg</code> - move to the top of the page<br /> <code class="highlighter-rouge">G</code>    - move down to the bottom of the page</p> </blockquote> <blockquote> <p><code class="highlighter-rouge">Shift + h</code> - go back to the previous page<br /> <code class="highlighter-rouge">Shift + l</code> - go to the next page<br /> <code class="highlighter-rouge">Shift + j</code> - move to one left tab<br /> <code class="highlighter-rouge">Shift + k</code> - move to one right tab</p> </blockquote> <h4 id="detailed-cheat-sheet">Detailed Cheat Sheet:</h4> <p><img src="/assets/images/posts/Work-Without-Mouse/vimium-cheatsheet-big.png" alt="alt text" title="Vimium Cheat Sheet" /></p> <p><a name="loc_vim"></a></p> <h3 id="vim"><a href="https://vim.sourceforge.io/" target="_blank">Vim</a></h3> <p>In fact, <code class="highlighter-rouge">Vimium</code> is built in the spirit of <code class="highlighter-rouge">Vim</code>. Hence, you can find some commands in common.</p> <blockquote> <p>Then why introduced <code class="highlighter-rouge">Vim</code> after <code class="highlighter-rouge">Vimium</code>?</p> </blockquote> <p>Because <code class="highlighter-rouge">Vim</code> is really a tool for life-long learning, and moreover, I just used it for few weeks, the stuff I know is merely the tip of an iceberg.</p> <p>Through the process of finding information about Vim,<br /> I found somebody said he has been using Vim <strong>over 15 years and still learning</strong>;<br /> I found somebody has his vimrc (Vim configuration file) <strong>over 1500 lines</strong>;<br /> I also found a joke says: “A highly-configured <code class="highlighter-rouge">Vim</code> can reach <strong>one-third</strong> performance of <code class="highlighter-rouge">Visual Studio.</code>” :)</p> <h4 id="basic-commands-case-sensitive">Basic commands (case-sensitive):</h4> <ol> <li> <p>Move cursor</p> <p><code class="highlighter-rouge">h</code> - moving left for one unit<br /> <code class="highlighter-rouge">l</code> - moving right for one unit<br /> <code class="highlighter-rouge">j</code> - moving down for one line<br /> <code class="highlighter-rouge">k</code> - moving up for one line</p> </li> <li> <p>Move screen</p> <p><code class="highlighter-rouge">Shift + PageUP/PageDown</code></p> </li> <li> <p>Search word/phrases</p> <p><code class="highlighter-rouge">/</code> + the word you are searching</p> <p><code class="highlighter-rouge">*</code> - next occurrence of the word<br /> <code class="highlighter-rouge">#</code> - previous occurrence of the word</p> </li> <li> <p>Save &amp; quit</p> <p><code class="highlighter-rouge">:w!</code> - save<br /> <code class="highlighter-rouge">:q!</code> - quit<br /> <code class="highlighter-rouge">:wq!</code> - save &amp; quit<br /> <code class="highlighter-rouge">:w !sudo tee %</code> - top permission to save &amp; quit</p> <blockquote> <p><code class="highlighter-rouge">!</code> stands for overwrite</p> </blockquote> </li> <li> <p>Enter Visual mode</p> <p><code class="highlighter-rouge">V</code>, then move cursor to select line(s)</p> </li> <li> <p>Copy &amp; paste</p> <p><code class="highlighter-rouge">dd</code> - cut current line<br /> <code class="highlighter-rouge">p</code> - paste the line after the cursor<br /> <code class="highlighter-rouge">P</code> - paste the line before the cursor</p> </li> <li> <p>Move to the end of a line.</p> <p><code class="highlighter-rouge">$</code> - simply move to the end of the line<br /> <code class="highlighter-rouge">A</code> - appending at the end of the line. (i.e. this auto enters <code class="highlighter-rouge">insert</code> mode)</p> </li> <li> <p>Delete</p> <p><code class="highlighter-rouge">x</code> - delete the single character at the cursor<br /> <code class="highlighter-rouge">dw</code> - delete all characters of a word that after the cursor<br /> <code class="highlighter-rouge">d$</code> - delete to the end of the line after cursor</p> </li> </ol> <h4 id="detailed-cheat-sheet-1">Detailed Cheat Sheet</h4> <p><img src="/assets/images/posts/Work-Without-Mouse/vim-cheatsheet.png" alt="alt text" title="Vim Cheat Sheet" /></p> <p><a name="loc_wasavi"></a></p> <h3 id="wasavi"><a href="https://github.com/akahuku/wasavi" target="_blank">Wasavi</a></h3> <blockquote> <p>wasavi is a clone of vi editor and extends a TEXT-AREA element.<br /> If you focus a TEXT-AREA element and press <code class="highlighter-rouge">Ctrl+Enter</code>, TEXT-AREA turns into <code class="highlighter-rouge">vi</code> editor.</p> </blockquote> <p>A browser extension wrote by Japanese developers. It has similar commands as the <code class="highlighter-rouge">Vim</code> text editor.</p> <p>P.S. Can anyone tell me what <code class="highlighter-rouge">wasavi</code> means in Japanese? I could not find an answer…</p> <h3 id="browser-chrome">Browser (Chrome)</h3> <p>Below are some shortcut keys that works with Chrome. I believed these are common for most browsers, but I only tested with Chrome. Taken responsibilities into account, I suggest you play these commands with blank pages first to avoid potential losses.</p> <p>Additionally, these commands are still useful even when you have <code class="highlighter-rouge">Vimium</code> installed, because you may use <code class="highlighter-rouge">Incognito</code> mode sometimes (unless you enable extension on <code class="highlighter-rouge">Incognito</code> mode).</p> <h4 id="useful-commands">Useful commands:</h4> <p><code class="highlighter-rouge">Ctrl + w</code> - close current tab<br /> <code class="highlighter-rouge">Ctrl + n</code> - open a new tab</p> <p><code class="highlighter-rouge">Shift + 9</code> - go to the last tab<br /> <code class="highlighter-rouge">Shift + w</code> - show current tab in a new window.<br /> <code class="highlighter-rouge">Shift + number key</code> - go to the corresponding tab (counts form left)</p> <p><code class="highlighter-rouge">Ctrl + Shift + t</code> - open last closed tab<br /> <code class="highlighter-rouge">Ctrl + Shift + n</code> - enter <code class="highlighter-rouge">Incognito</code> mode</p> <p><code class="highlighter-rouge">Alt + left</code> - go back to the previous page<br /> <code class="highlighter-rouge">Alt + right</code> - go to the next page</p> <h2 id="useful-links">Useful Links</h2> <p><code class="highlighter-rouge">Vim</code>, <code class="highlighter-rouge">Vimium</code> and <code class="highlighter-rouge">wasavi</code> are open source software on Github, you can find them here:</p> <ul> <li><a href="https://github.com/vim" target="_blank">Vim - Github</a></li> <li><a href="https://github.com/philc/vimium" target="_blank">Vimium - Github</a></li> <li><a href="https://github.com/akahuku/wasavi" target="_blank">wasavi - Github</a></li> </ul> <h2 id="references">References</h2> <ol> <li>Foye, P. M., Cianca, J. C., &amp; Prather, H. (2002). Cumulative trauma disorders of the upper limb in computer users. Archives of Physical Medicine and Rehabilitation, 83. doi:10.1053/apmr.2002.32144<br /> Retrieved from http://www.archives-pmr.org/article/S0003-9993(02)80005-3/pdf</li> <li>De Quervain’s Tendinosis. (n.d.). American Academy of Orthopaedic Surgeons, Retrieved from http://orthoinfo.aaos.org/topic.cfm?topic=A00007</li> </ol>Yanqing WuA significant and continuous improvement on my working efficiency has been observed, since the day I started to work with only a keyboard. As this working style makes me feel so delighted (and has many advantages), I thus consider this is a must to share my thoughts and some basic techniques about how to work without a mouse.