Latent-Semantic-Algorithms

From ActiveMathWiki

Table of contents

Improving the ActiveMath search using latent semantic algorithms

In every good relationship, once you come to a point where you stop listening at what your partner says and start understanding what your partner means. ActiveMath wants to lead a good relationship with its users.

Though, the goal of this work is to significantly improve the search engines results in terms of finding out what field of knowledge it most likely points at and the possibility to ask natural-language questions. As long as they're not covering your relationship.

What is the idea behind?

Basically, it's all about similarity between the users questions and the information we can obtain from our content. In a way we're already doing that, but we're taking things one step further now, from literal analysis to semantic understanding.

How does that work?

To achive that, we try to find out what our user points at by a perticular term. Some clever people developed a method that works pretty well in this context. One flavor is called Latent Semantic Indexing, the next Latent Semantic Analysis.

They share a smart idea: the semantics of a term mustn't be defined explicity, but, if you have enough documents in your pocket, can be derived statistically from their contents.

Let's make that point clear: if you accindentally find a document in your pocket containing the words 'Ferrari', 'Porsche' and 'Lamborghini' you can be pretty sure it's the one to offer if someone asked you about prehistoric energy-inefficiency.

What is the difference between LSI and LSA then?

Simply, LSI refers to using the approach for indexing or information retrieval; LSA refers to all other applications.

TODO: psychology

Latent Semantic Analysis is a statistical, corpus-based language understanding technique that estimates similarities on a scale of -1 to 1 between the latent semantic structure of terms and texts [Deerwester et al., 1990]. The input to LSA is a set of corpora segmented into documents. These documents are typically paragraphs or sentences. Mathematical transformations create a large term-document matrix from the input.


Abstract of Cairns-Paper on LSI

The paper describes an LSI-based approach to retrieve information from a sufficiently large database of mathematical content, the database of Mizar articles. [Paul Cairns, UCL Interaction Centre University College London, London WC1E 7DP, UK, p.cairns@ucl.ac.uk]

The Mizar Database

The Mizar project started around 1973 as an attempt to reconstruct mathematical vernacular in a computer-oriented environment. [1] (http://mizar.org)

The Mizar Mathematical Library (version 4.50.934) includes 934 articles written by 179 authors and 42013 theorems, 7916 definitions, 723 schemes, 6844 registrations, 5853 symbols, 1903 keywords. [2] (http://merak.pb.bialystok.pl/)

Advantages of the LSI approach

Different search techniques have been used to extract the desired information from the database:

  • isomorphisms [Delahaye D.: Information Retrieval in Coq Proof Library using Type Isomorphisms. In T. Coquand et al. (eds), TYPES, LNCS 1956 (2000) 131-147]
  • metadata [Miller B. R., Youssef A.: Technical Aspects of the Digital Library of Mathematical Functions. Annals of Mathematics and Art. Intelligence 38 (2003)]
  • A combination of metadata and reasoning [Baumgartner, P., Furbach, U.: Automated Deduction Techniques for the Management of Personalized Documents. Annals of Mathematics and Art. Intelligence 38 (2003) 211-288]

With metadata, there is of course the possibility to make a search engine very effective but then there is the overhead of who must prepare the metadata. Authors of webpages are already very poor at adding metadata and there is no suggestion that authors of mathematics are likely to be any better.

Latent semantic indexing avoids these issues entirely as the semantics of the documents to which it is applied are never explictly represented. Instead, the semantics of terms are implicitly inferred from contexts in which they occur even if, as in the case of mathematics, the contexts are sometimes the definitions of the terms. [Cairns-Paper]

Prof. Weickum p. 41 (http://www.mpi-inf.mpg.de/units/ag5/teaching/ss05/is05/folien/kap2.pdf,) points out the following advantages

  • implicit thesaurus because of correlations of synonyms
  • implicit discrimination of the different meanings of polysems (through diffrent correlations)
  • better efficency on closed corpora, e.g. financial news, collections of patents, ...

and disadvantages

  • finding a good k (below)
  • big efforts on memory and computation for very large but sparse matrices
  • no convincing results for web-search-engines


The mathematics of LSI

Good source, too: http://infomap-nlp.sourceforge.net/doc/algorithm.html

Mathematically spoken, the goal of LSI is transforming document vectors from a high dimensional 'term space' to a 'topic space' using

  • correlations between terms ('web' and 'internet' are likely to appear together)
  • implicit differentiation of polysems which discriminate through their correlations with other terms ('Java' together with 'Library' differs from 'Java' together with 'Borneo')

On this fundament we can now derive the mathematical background.

Given:

  • m terms, n documents (and most commonly n >> m)
  • a m x n matrix A expressing the term-document between these
  • k with k << m

Wanted:

  • a preferably good - and similarity-keeping - map of the comlumn vectors into a k-dimensional vector space

As a next step we need the tools to achieve this.

Theorem 1: SVD

Any real-valued m x n-matrix can be separated into the form A = U x D x V^T with

  • a m x r-matrix U consisting of orthonormal column-vectors
  • r x r-diagonal-matrix D and
  • a n x r-matrix V with orthonormal column-vectors

This separation is called 'singular value decomposition' and is unique, if the elements of D are in a descending order.

Theorem 2: Contents of the SVD

In the SVD A = U x D x V^T the matrices are defined as follows:

  • D contains the singular values of A (which means the positive square roots of the eigenvalues of A^T x A )
  • the column vectors of U are the eigen vectors of A x A^T
  • the column vectors of V are the eigen vectors of A^T x A


Indexing and query analysis

  • The matrix DkVk^T corresponds to a 'topic-index' and is to be administrated in a adequate data structure
  • Additionally the 'term-topic'-map Uk has to be stored
  • A query q (a m x 1 column vector) in term-space is transformed to a query q' = Uk^T x q (a k x 1-column-vector) and then analyzed in topic-space (Vk) (using a adequate distance measure; e.g. scalar-product-similarity or cosine-distance)
  • a new document d (a m x 1-column-vector) is transformed into d' = Uk^T x d (a k x 1-column-vector) and added as a new column to the 'Index' Vk^T ('folding-in')

http://www.mpi-inf.mpg.de/units/ag5/teaching/ss05/is05/folien/kap2.pdf p2-29ff

Comments from Pauls Paper

[Fridolin Wild, Christina Stahl, Gerald Stermsek and Gustaf Neumann - PARAMETERS DRIVING EFFECTIVENESS OF AUTOMATED ESSAY SCORING WITH LSA]

Choice of Dimensionality

Among other factors the choice of dimensionality has a significant impact on the results of LSA. After calculating the singular value decomposition from the original term-document matrix, a reduced matrix is reconstructed using only the k-largest singular values. This aims at obtaining an approximation to the original vector space, which captures the most important structure but reduces noise and variability in word usage (Berry et al., 1995). For our tests we considered the following alternatives to determine the number of selected factors in the approximated vector space:

  • Percentage of cumulated singular values (share): Using a normalized vector of cumulated singular values we can sum up singular values until we reach a specific value. In our paper we suggest to use 50%, 40% and 30% of the cumulative summed up singular values.
  • Absolute value of cumulated singular values equals number of

documents (ndocs): The sum of the first k singular values factors equals the number of documents n used in the analysis.

  • Percentage of number of terms (1/30, 1/50): Alternatively the number of factors can be determined by a fraction of all terms indexed. Commonly used fractions are 1/30 or 1/50.
  • Fixed number of factors (magic 10): A less sophisticated and inflexible approach is to use a fixed number of factors, e.g. 10 factors. The number has to be determined depending on the text corpus.


Similarity Measures & Methods

Finally, we tested three similarity measures: Pearson-correlation, Spearman’s rho and Cosine. Although the cosine is the most commonly used measure for comparing LSA-vectors and usually works best for information retrieval (Landauer and Dumais, 1997) our past experiments returned the best result for Spearman’s rho. Additionally we investigated the different values for both, the correlation with the best matching golden essay (max) and the mean correlation of all three golden essays (mean) resulting in 2 x 3 = 6 different settings.



Some cool links

http://lsi.argreenhouse.com/lsi/papers/PSYCHREV96.html http://lsa.colorado.edu/

http://www.msci.memphis.edu/~wiemerhp/trg/lsa-followup.html http://www.cs.utk.edu/~lsi/ http://superbook.telcordia.com/~remde/lsi http://www-psych.nmsu.edu/~pfoltz/ http://www.people.vcu.edu/~efmartin2/Papers%20&Presentations/Assessment_Outcome_Coherence_Using_LSA_Scoring.pdf http://64.233.183.104/search?q=cache:6kKHppCNDhcJ:home.autotutor.org/zcai/719.pdf+lsa+algorithm&hl=de http://mizar.org http://www.mpi-inf.mpg.de/units/ag5/teaching/ss05/is05/material.html

TODO

Paul:

I found this in the Cairns-Paper on LSI:

This paper therefore treats formal mathematics as a representation of mathematical language in general. Mizar is particularly strong in providing a wide range of constructs for mathematical concepts that allow its formal proofs to be more like the less formal proofs found in common mathematical texts. In this sense, the MML is actually a reliable representation of the more usual, informal mathematical language.

  • our database is based on a different knowledge-representation, logic-based and not informal like Mizar. How to handle this?
    • Our database is much less formal than Mizar and even much more text oriented so what I understand in this statement rather means that Mizar is resembling text corpus so is probably applicable for the retrieval techniques. Paul
  • the queries seem to be very similar to natural-language (e.g. theorem Int Subset of carrier of TopStruct c= Subset of carrier of TopStruct). Is that what we want?
    • I don't think they are nor are they... you may ask Paul C. It's the proper of a nicely chosen example to have such look Paul
  • Phillips mediator is based on metadata on third party-websites. Why not tune it with LSI, as the disadvantages of metadata are obvious?
    • I don't get what it could be to tune LSI with the mediator... do you mean something such as metadata generation ? Sure... I definitely want to do this and use LSI... pressure is on you! Paul