Tuesday, June 21, 2005

How to extract data from an unstructured document?

The problem of extracting data from an unstructured document is quite challenging. It is really hard to extract data from a PDF document and bundle it with some useful semantic information. Some ideas on how elements on a page can be identified and classified are found in my short paper "Ideas for extracting data from unstructured documents".

This paper is the result an internship at the database and artificial institute of the technical university of vienna.