Due to the imposed Lipschitzness of the reward we have that the reward around each grid point changes at most linearly in . Moreover, since the grid points are spaced at a distance of apart, we can approximate the regret within each interval around a given grid point by the regret at grid point itself, with being off by at most .
The only thing that remains is then to add approximation error to the bound obtained in i), plug in and optimize for (set derivative wrt. equal to 0).
My first guess was actually correct. Applying the first-order optimality condition is permitted here because, asymptotically (for ), the term (which is convex!) dominates. Hence the objective function is (informally speaking) "asymptotically convex" and setting the first derivative to zero is sufficient and necessary for obtaining the (asymptotically) optimal choice of .
Ignoring the big-O notation for a moment, the first derivative is equal to . Setting this to zero and solving for , you obtain the optimal choice for (as I explained above, the first order optimality condition holds even though the function is strictly speaking not convex).