Slide 14: The assumption is that the box is featureless (no texture) but distinct from the background and we're looking at the local window highlighted in the figure. If you look at the highlighted window in both images, they look exactly the same - the way to infer that the entire box is moving is through global reasoning.
Slide 17: I assume you actually mean slide 19 and 20? This slide depicts a standard way to overcome the problem from your previous question. Many industrial methods give up trying to track all pixels (dense motion field estimation) and instead focus on a small set of keypoints which stand out from the image. If you can match them across images you can still perform the same reasoning as you would with dense motion field, except in a sparse manner - yielding a sparse point cloud (slide 20) as your 3d outcome.
Slide 21: You mean 23? M is the rotation/scaling component and t is the translation component of a general projection matrix P. The _i subscript indicates that it relates to the projection parameters of the i-th image.
Slide 25: (assuming 27): U, D, V are the matrices resulting from SVD decomposition of W. In case you're not familiar, SVD decomposition represents a general matrix M as a product of three matrices (called U D V in that case), where U and V are unitary and D is diagonal with non-negative entries. SVD decompositions have a lot of useful properties, in particular they can be used as an algorithmic primitive in solving least squares problems, which is this case. This is a bit outside of the scope of what I can explain in a forum post but if you need more explanation in person please send me an e-mail.
Slide 30: again, assuming you mean 32 (sequential structure from motion), I'm not sure what you're asking about :(