Retroweb: data extraction from the Internet

Retroweb: data extraction from the Internet

Developed as an activity of the Walloon region project CETIC-CEIQS, Retroweb is a tool for data extraction from the Internet. Now that Internet has become one of the main source of information, this kind of tool is now a must for any company.

Date: 16 June 2009

Expertises:

Data Science 

About project: CE-IQS 

The internet: a data source of huge value yet hard to use

The Internet can be considered an infinite source of information for both individuals and organizations. Nevertheless, the World Wide Web is inherently hard to use efficiently.

Notably because it is:

  • large: confronted with a precise demand, we are often overwhelmed by information we then have to filter, reorganise... Managing such amount of information is time consuming!
  • noisy: regarding the size of a web-page and the information it provides, relevant information is often small. That’s because the web-pages are often flooded with advertisements or (more useful) navigation menus.
  • user-oriented: encoded in a semantically poor format (HTML), Web data are indeed well suited to human reading in interactive sessions but cannot easily be processed automatically by software agents.
  • evolving: web data is changing quickly, to efficiently cope with it companies must take it into account immediately.

Retroweb, in short...

Developed as an activity of the Walloon region project CETIC-CEIQS, Retroweb is a tool for data extraction from the Internet. With Retroweb you are able to quickly and visually create data extraction software. Periodically executed, these programs can feed your documentation management system or any internal corporate database.

Retroweb is well fitted for search engines, technological intelligence and website migration to a database or a content management system (CMS)

Extraction des données d’un forum avec Retroweb

Any other way?

Retroweb is obviously not the only Internet data-extraction solution. Many scientific project and several well-known companies work on similar solutions.

Several advantages make Retroweb different (and often better):

  • ease of use: no specific knowledge of HTML intricacies or complex mapping languages is needed to use Retroweb. The user directly interacts with Web documents in a user-friendly browser interface.
    -* flexibility: Only relevant data are analysed and made available to any external agents using customisable parameters.
  • robustness: extractions rules are generated based on a multi web-pages sample in order to create robust extraction rules and cope with HTML code modifications.
  • interoperability: based on open standards and recommendations defined by the W3C (XML, XPath, XML Schema), Retroweb provides perfect interoperability as a an input or an output for external software agents.
  • portability: Retroweb has been developed and tested using MS-Windows and GNU/Linux

Just a few technical words

Retroweb is made of two complementary modules:

  • Retroweb-Browser is a graphical user interface dedicated to extraction rules creation
  • Retroweb-Wrapper takes as an input the rules to extract data to an interpreted and structured format. This process can be periodically performed on demand.

Retroweb-Browser is a Java 6 piece of software developed using the Eclipse-RCP framework, Gecko (well-known for being used in Firefox) is the web rendering engine, and the extraction rules are based on XPath , a W3C recommendation.
Retroweb architecture is compliant with the Model-View-Controller principles (MVC) in order to reduce the amount of code written and facilitate the development of new features.

Retroweb-Wrapper is a Java 6 piece of software, well-fitted for batch processing on a server. It takes Retroweb-Browser generated data-extraction rules as an input to generate a structured and interpreted XML data set.

Retroweb was successfully tested on MS-Windows and Ubuntu GNU/Linux

A roadmap for Retroweb

Retroweb is currently an efficient data-extraction rule for the Internet. But it will change along with new technologies and demands originating from the industry. Hence, we are already working on several research topics:

Semantic web interoperability
One of the challenges for the future Internet is its compatibility not only with human users (e.g. better usability) but also with software agents.
The Semantic Web tackles the latter using concepts and tools to enrich web data with tractable meaning. As a semantic annotation tool for the Internet, Retroweb clearly has a role to play.

Self-healing of data extraction rules
If the HTML code of a web-page is deeply modified, a data-extraction rule might not be valid anymore. It is then necessary to detect the failure during the extraction process and to automatically repair the extraction rule.

Integrating Retroweb in a search engine architecture
Legacy search engines collect documents, extract textual content and store it as an index, i.e. a zipped file of terms and documents in which they appear. This indexation process is called “full-text” as it is based only on document syntactic content. On the contrary Retroweb-Wrapper can semantically index document as it is able to use the meaning of extracted data. Integrating Retroweb-Wrapper in a search engine would add a valuable advantage to legacy search engines architectures.