Read about the updated version Sum+My here.
Automatic Summarization for the Greek language is the title of my undergraduate thesis i completed almost a year ago. Because of the obvious difficulties i choose to use for the first time a PHP Framework. I wanted to focus on the task at hand without worrying about the basics involving a web platform. I could have gone with a cms like joomla or drupal but at the time i found Zend Framework 1.xx.x to be so much better for the job, because of the excellent documentation and online sources.
Automatic summarization is a vague subject in the science of natural language processing. There are two main methodologies you can follow, extraction and abstraction. The second is the hardest and involves computer learning techniques (AI). Basically the machine has to learn how to produce the summaries, pretty much like a human. The extraction model (shallow) is based on maths and doesn't really create sentences from scratch, all the sentences come directly from the original document without any alternation.
For an undergraduate thesis working on the abstraction model is a little too much and honestly a semester is not enough time, so i based the application on shallow methods. In an attempt to not reproduce the few source out there i tried to use and combine as many algorithms as possible. The system produces for each sentence three scores, terms, position and keywords which can be used with different weights to evaluate easy score.
For the Terms score the user can choose between:
- TF-ISF (Term Frequency - Inverse Sentence Frequency)
- TF-IDF (Term Frequency - Inverse Document Frequency)
- TF-RIDF (Term Frequency - Residual Inverse Document Frequency)
Are you still here reader ? Ok stay with me i will be quick. Basically every method produces a score for each word based either:
- On all the words inside the document
- On all the words inside a collection of documents
- 2 + Poisson Distribution Model
For the Position score the user can choose between:
- Baxendale's research
- News article
Baxendale was a researcher who concluded that in 85% of the paragraphs the topic sentence came as the first one and in 7% of paragraphs the last sentence was the topic sentence. Thus, a naive but fairly accurate way to select a topic sentence would be to choose one of these two. The News Articles algorithm basically scores sentences dynamically and clearly favours the first sentences of the first paragraphs.
The Keywords score is actually a cheat, by providing keywords the system can scores sentences that contain them higher than others. This way to system finds the key sentences easier instead of guessing like the above scoring methods. Additionally the user can select to set a words per sentence threshold, so the system can ignore too big or too small sentences. The system also uses stop words lists to ignore common terms as well a greek language stemming algorithm to group words.
The application is hosted here http://thesis.t3-design.com/, soon to be hosted on this domain. If you are not Greek you will probably don't understand much but hey it was my first Zend Application and i am really proud about it and please take a look at the credits section, i couldn't have done it without them.
Time for some pictures (new files system in place ;p)