tf*idf for text representation

tf*idf for text representation

by Raphaël Attias -
Number of replies: 4

Hello,

If I understood correctly, tf*idf is used to weight the indices of text's representation when comparing two texts (with for example a cosine similarity measure). I was wondering if tf*idf could also be used to transform the indexing set associated to a document into a vector? At first I don't see any issue with this, and I would say that 

* Binary vector with tf*idf weighting scheme for cosine comparaison 

and

* tf*idf to transform indexing set into a vector and no weighting scheme for cosine comparaison

Would provide the same comparaison results. 

In reply to Raphaël Attias

Re: tf*idf for text representation

by Jean-Cédric Chappelier -

Not sure I understand what you mean and it seems to me you mess up a few concepts (sorry if you don't):
what is the purpose of a ``weighting scheme'' if not to produce a vector?

There are two things to clearly understand:

  1. the steps of the whole process;
  2. the different weighs (and where they could be applied).
Regarding the steps:
  • first you have a document as the input, which some way or another, ends up to be an ordered list of tokens 
  • from these token you keep ``a few'': the ones which corresponds to the indexing terms
  • you then represent each document as a vector, the size of which is the number of different indexing terms
  • the coordinates of that vector are provided by the ``weighting schemes'' (more details below)
  • some metric or similarity is then used to measure the similarity between documents, as the metric on the corresponding vectorial space.
Regarding the weightings:
there are usually (generally speaking) three types of weighting (maybe not all present):
  • weights involving the indexing term and the document: typically tf;
  • weights involving the term only (and not the document): typically idf;
  • weights involving the document only (and not each term): typically the norm of the vector.
The first two HAVE TO be part of the computation of the vector coordinates (because they are related to the term, i.e. to the coordinate!). The last one can either be in the coordinate or in the metric: you could for instance consider having a dot product similarity on a space of unary vectors, or having a cosine-similarity on non-normalized vectors; these two are equivalent.

If you consider the coordinate-related weigths to be part of "the metric", then it's not a metric/similarity anymore on the whole space (because this function changes for each document pair: it's not the same function anymore).

Makes sense?

[Edited: changed ``metric'' for ``similarity" where more appropriate)
In reply to Jean-Cédric Chappelier

Re: tf*idf for text representation

by Raphaël Attias -

Thank you very much for the detailed answer, it is clearer now. I have still some questions, 

First, when you said "first you have a document as the input, which some way or another, ends up to be an ordered list of tokens, from these token you keep ``a few'': the ones which corresponds to the indexing terms". But then for each document we will have a different set of indexing terms. How can two documents have their vector representation belong to the same vector space if their vectors have different dimensions/features?  Shouldn't the vector space be document independent, but collection of documents dependent? 

To remedy this problem, could we just set the dimension of the vector space as the number of different indexing terms among all documents? Then set the coordinates of a vector to their tf*idf for each feature, which will be 0 for indexing terms that don't belong in the document. 

In resume I do not understand why the features vectors, as you said, should have the size of the # indexing terms. We would be then comparing vectors (documents) of different dimensions. 

In reply to Raphaël Attias

Re: tf*idf for text representation

by Jean-Cédric Chappelier -

the choice of the indexing set is done PRIOR TO indexing the documents.
otherwise you cannot do anything (e.g. what would (1) represent [document made of only 1 single word] if you change the indexing terms for each document [I mean the document "cat" and the document "dog" would both be a 1D document (1), does not make any sense).

The dimension of the vector space is then the size of the whole indexing set.

Does it answer your question?