High School students in Greece are admitted to public Universities based on their final exams during their senior year. They are submitting forms with their information and a selection of schools they want to attend. Last year the ministry of education of Greece decided to simplify the applications procedure by making it web only. Two teams of programmers were formed one managed the web platform for students who finished high school in Greece and another for Greek or foreign students who finished high school abroad.
I was part of the second team which consisted of only two people. I was in charged of building the web platforms and Athanasios Rouskas was something like the project manager bringing me all the specifications and dealing with other departments delays and bureaucracy.
Below you can find screenshots from http://mixanografiko-eksoterikou.opengov.gr/ where the application is hosted.
The web platform was used again this year, with no updates. The second application, which had minor changes to meet the state criteria for foreign students, was actually completed by Rouskas who had never seen php code before but he was so damn eager to learn and he learnt. That application was also used this year here http://mixanografiko-alodapon.opengov.gr
I am actually very proud to see that both platforms are online this year again, 3 years and counting (2011-2013).
The application was build with Zend Framework/MySql/Jquery. Below you can also find the sample documents the system creates with Zend_Pdf for the users to verify that everything went fine.
Read about the updated version Sum+My here.
Automatic Summarization for the Greek language is the title of my undergraduate thesis i completed almost a year ago. Because of the obvious difficulties i choose to use for the first time a PHP Framework. I wanted to focus on the task at hand without worrying about the basics involving a web platform. I could have gone with a cms like joomla or drupal but at the time i found Zend Framework 1.xx.x to be so much better for the job, because of the excellent documentation and online sources.
Automatic summarization is a vague subject in the science of natural language processing. There are two main methodologies you can follow, extraction and abstraction. The second is the hardest and involves computer learning techniques (AI). Basically the machine has to learn how to produce the summaries, pretty much like a human. The extraction model (shallow) is based on maths and doesn't really create sentences from scratch, all the sentences come directly from the original document without any alternation.
For an undergraduate thesis working on the abstraction model is a little too much and honestly a semester is not enough time, so i based the application on shallow methods. In an attempt to not reproduce the few source out there i tried to use and combine as many algorithms as possible. The system produces for each sentence three scores, terms, position and keywords which can be used with different weights to evaluate easy score.
For the Terms score the user can choose between:
- TF-ISF (Term Frequency - Inverse Sentence Frequency)
- TF-IDF (Term Frequency - Inverse Document Frequency)
- TF-RIDF (Term Frequency - Residual Inverse Document Frequency)
Are you still here reader ? Ok stay with me i will be quick. Basically every method produces a score for each word based either:
- On all the words inside the document
- On all the words inside a collection of documents
- 2 + Poisson Distribution Model
For the Position score the user can choose between:
- Baxendale's research
- News article
Baxendale was a researcher who concluded that in 85% of the paragraphs the topic sentence came as the first one and in 7% of paragraphs the last sentence was the topic sentence. Thus, a naive but fairly accurate way to select a topic sentence would be to choose one of these two. The News Articles algorithm basically scores sentences dynamically and clearly favours the first sentences of the first paragraphs.
The Keywords score is actually a cheat, by providing keywords the system can scores sentences that contain them higher than others. This way to system finds the key sentences easier instead of guessing like the above scoring methods. Additionally the user can select to set a words per sentence threshold, so the system can ignore too big or too small sentences. The system also uses stop words lists to ignore common terms as well a greek language stemming algorithm to group words.
The application is hosted here http://thesis.t3-design.com/, soon to be hosted on this domain. If you are not Greek you will probably don't understand much but hey it was my first Zend Application and i am really proud about it and please take a look at the credits section, i couldn't have done it without them.
You can download my paper here, once again it's in Greek but don't worry if you want the gist of it my supervisor teachers published a scientific paper based on my work, you can find it here.
Time for some pictures (new files system in place ;p)