CS-431: PoS exercise part 1

I was looking at the solution to the exercise and I have a question about part one:

The frequency distribution before lemmatization is generated like this:

fdist_before = nltk.FreqDist(word.lower() for (word, tag) in brown_tagged)

which yields 49815 distinct words in the corpus. That's fair enough.

Now, the distribution after lemmatization is generated like so:

fdist_after = nltk.FreqDist(lem.lemmatize(word, get_wordnet_pos(tag)).lower() for (word, tag) in brown_tagged)

which yields 41272 distinct lemmas. However, notice that the word that is passed into the lemmatization function is not lowercased. Instead, the resulting lemma is lowercased. This means that capitalized words will be interpreted as proper nouns by the lemmatizer. But in the corpus, they might simply be the start of a sentence. Should we rather not have:

lem.lemmatize(word.lower(), get_wordnet_pos(tag)) for (word, tag) in brown_tagged

This will reduce the number of distinct lemmas down to 39358 and the resulting reduction will be ~21%.

Re: PoS exercise part 1

by Jean-Cédric Chappelier - Monday, 4 November 2019, 9:53 AM

Very good question indeed!

(and you're right)

Actually, the corpus contains many proper nouns and thus uppercase should, in the most accurate processing, be taken into account.
However, to be able to really cope with it, we should either have (or even both)

distinct common noun/proper noun tags, which we don't in the "universal" tagset;
or end-of-sentence detector, which we did not introduced.

So we decided, as a simplification assumption, not to take this difference (upper/lower-case) into account.
Once this has been decided, then, you're right, we shall be consistent and stick to it, exactly the way you mention.

Thanks for pointing out!