Hello,
For the mark scheme of the 2018 exam, for question 1.2, you wrote that
H(X) ≤ log₂(33) < Lc(Huffman(X))
However, the second inequality is not always true. In this particular question, the "worst-case scenario" is when 4 letters appear thrice, and the other 29 each appear four times. Calculating the average length of our Huffman code, we get 5.046875, which is marginally greater than log₂(33). However, for any other distribution of letters, the average Huffman length is less than log₂(33). (This also means the answer to the question itself should be no and not yes, since an average length of 5.5 is impossible.)
What we do have are the two inequalities
H(X) ≤ log₂(n)
and
H(X) ≤ Lc(Huffman(X))
However, these two together do not imply the second inequality. Furthermore, whilst one might argue that when the letters are equally distributed we have H(X) = log₂(n), this can only occur when the number of letters in the word is a multiple of n. Otherwise, the second inequality is not necessarily true. For example, if n = 3, and the word is of length 4, we have only a few possible letter-frequency distributions ( (4,0,0), (3,1,0), (2,2,0) and (2,1,1) ) of which the "worst-case scenario" (maximal average-length) is the latter (so for example the word ADAM). For this word, we could encode the letters as follows:
- A = 0
- D = 10
- M = 11
which gives an average Huffman-length of 6/4 = 1.5. But this is inferior to log₂(3), and so for all possible words of length 4, including the worst case, the second inequality is incorrect. Indeed, if your question posed that the word length was 64 instead of 128, then the second inequality would also never hold in that case, as the longest possible Huffman-length occurs with a distribution of (1, 1, 2, 2, .... 2), which gives a length of 5.03125. But this is inferior to log₂(33) !
Regards,
Tanish Patil