Read about the updated version Sum+My here.

Automatic Summarization for the Greek language is the title of my undergraduate thesis i completed almost a year ago. Because of the obvious difficulties i choose to use for the first time a PHP Framework. I wanted to focus on the task at hand without worrying about the basics involving a web platform. I could have gone with a cms like joomla or drupal but at the time i found Zend Framework 1.xx.x to be so much better for the job, because of the excellent documentation and online sources.


Automatic summarization is a vague subject in the science of natural language processing. There are two main methodologies you can follow, extraction and abstraction. The second is the hardest and involves computer learning techniques (AI). Basically the machine has to learn how to produce the summaries, pretty much like a human. The extraction model (shallow) is based on maths and doesn't really create sentences from scratch, all the sentences come directly from the original document without any alternation.

For an undergraduate thesis working on the abstraction model is a little too much and honestly a semester is not enough time, so i based the application on shallow methods. In an attempt to not reproduce the few source out there i tried to use and combine as many algorithms as possible. The system produces for each sentence three scores, terms, position and keywords which can be used with different weights to evaluate easy score.


For the Terms score the user can choose between:

  1. TF-ISF (Term Frequency - Inverse Sentence Frequency) 
  2. TF-IDF (Term Frequency - Inverse Document Frequency)
  3. TF-RIDF (Term Frequency - Residual Inverse Document Frequency)

Are you still here reader ? Ok stay with me i will be quick. Basically every method produces a score for each word based either:

  1. On all the words inside the document
  2. On all the words inside a collection of documents
  3. 2 + Poisson Distribution Model



For the Position score the user can choose between:

  1. Baxendale's research
  2. News article

Baxendale was a researcher who concluded that in 85% of the paragraphs the topic sentence came as the first one and in 7% of paragraphs the last sentence was the topic sentence. Thus, a naive but fairly accurate way to select a topic sentence would be to choose one of these two. The News Articles algorithm basically scores sentences dynamically and clearly favours the first sentences of the first paragraphs.

The Keywords score is actually a cheat, by providing keywords the system can scores sentences that contain them higher than others. This way to system finds the key sentences easier instead of guessing like the above scoring methods. Additionally the user can select to set a words per sentence threshold, so the system can ignore too big or too small sentences. The system also uses stop words lists to ignore common terms as well a greek language stemming algorithm to group words. 

The application is hosted here http://thesis.t3-design.com/, soon to be hosted on this domain. If you are not Greek you will probably don't understand much but hey it was my first Zend Application and i am really proud about it and please take a look at the credits section, i couldn't have done it without them.

You can download my paper here, once again it's in Greek but don't worry if you want the gist of it my supervisor teachers published a scientific paper based on my work, you can find it  here.

Time for some pictures (new files system in place ;p)

 


Comments

  1. Chris #3   Chris | Sep 25, 2012 at 11:25

    I am pretty sure you are a bot and that urges me to finish the spam cleaner component for komposta, but anyway go ahead.

  2. san antonio apartments #2   san antonio apartments | Sep 16, 2012 at 10:32

    Do you mind if I quote a small number of your posts as long as I provide credit and sources back to your weblog: http://www.komposta.net/article/automatic-summarization-for-the-greek-language I'm going to aslo ensure to give you the appropriate anchor-text link using your webpage title: Komposta. Be sure to let me know if this is okay with you. Thankyou

  3. Chris #1   Chris | Sep 15, 2012 at 01:54

    The whole system was designed so i can add more languages in the future. I have already found a stemmer for english language, i only need a good collection of documents to add English. If you have something in mind please let me know. I am looking for a large collection of news articles in xml format.

Leave a Comment
It will not be published.