Vismart/News

Материал из NLPub
Перейти к навигации Перейти к поиску

Parsing the News for Specific Facts

Objective

To simplify the process of finding the target enterprises meeting the set of criteria.

Scope of Work

  1. The application must monitor the defined set of news portals to harvest the raw data on once a day. The list of portals could be found in Appendix 1 below. Once acquired page does not need to be reread, since they are not being updated by the news portals.
  2. The application must deliberately parse only those articles that contain the data fields specified in ExcelSheetExample. The samples of news articles that contain the required data are given in Appendix 2.
  3. The application must extract the specified data from selected news articles and post it to Excel sheet. The list of required fields (facts) is provided in ExcelSheetExample.
  4. Elaborate an acceptance criteria for the performed work: accurracу of selection of proper news post, accuracy of fact extraction

Tasks

  • Elaborate architecture taking into account that the final solution should
    • be run on a single middle range laptop computer
    • Executable under MacOS and Windows 8-10
    • The choice of data storage is not limited by the requirements, but should be negotiated and agreed upon.
    • Store its results including the intermediate ones for the traceability purposes
  • Only relevant posts should be parsed. Relevance is defined as follows: the news post contains all the the facts expected in Excel spread sheet - see ExcelSheetExample. This tasks can be omitted if the solution can function without it.
  • Development of preprocessor for HTML to text conversion, capable of removing banners footers, headers and other insignificant excerpts of HTML and Javascript and leaving ideally only the meaningful textual data (see, for example, https://github.com/mozilla/readability) for further fact extraction.
  • Development of web crawler to harvest the news on regular basis. The harvesting is mostly likely to be carried out by scraping procedures since not all of the news portals have their RSS channels.
  • Development of scripts to control the crawler: launch it once a day and check whether it is operational throughout the execution time
  • Production of code to implement the extracting of required facts (see ExcelSheetExample) from the output of preprocessor.
  • Implement an API for posting the results of parsing in Excel sheets (CSV is an option).

Location of Work

St-Petersburg, Russia.

Period of Performance

2 months

Deliverables Stages

  • Alpha prototype: 3 weeks
  • Beta prototype: 2-3 weeks
  • Fixing bugs and preparing the final solution: 1-2 weeks

Special Requirements

Proprietary components could be used and purchased in course of the project if properly justified. The solution must be independent of operating system.