CIS2168 - Homework 11: Search on RSS Feeds (*)

Assignment given: November 23, 2010
Due Date: December 8 by 10pm

You have probably seen that most newspapers provide an RSS feed of the headlines. In fact they provide a number of such feeds. Each feed is an XML document that needs to be analised to identify the news items it contains, and for each item, determine its title, description, and link to the news story. "title", "description", and "link" are strings. Aggregators such as Bloglines allow you to subscribe to such feeds.

What we are aiming to do is a service that, given search keywords, retrieves the items that are "most relevant" to those keywords.

You are helped by three java files that you find in this folder. The HTMLTokenizer.java file is from Princeton and you will not use it directly. The ItemEntry.java file represents a news item. The file ItemParser.java is what is most useful to you. Given an URL it will give you access to its items with an iterator. The URL could be something like "http://timesofindia.indiatimes.com/rssfeedstopstories.cms" or the name of a file in your homework12s10 directory. My Iterator in ItemParser.java seems to work for the news sources I have used, but it is a hack, not serious. You will use as is for the news sources where it works.

You are given a text file feeds.txt that specifies the URL for a number of news sources. Your task for homework 11 is:
  1. Collect the news items from all these sources.
  2. For each [caseless] word that occurs in the title or description of an item, record the item where it occurs, and the score of that item, computed as two times the number of occurrences of the word in the title plus the number of occurrences in the description. Exclude from the words you collect the 50 most common words you determined in a previous assignment.

By now we have collected all the news items from the specified sources, collected the links to all these news items, determined for each word occurring in these news items an array list of the news items where they occurred, with a rank score.

Next you will prompt the user to enter a search query consisting of words separated by spaces. You will collect the distinct news items where all these words occur, you will rank the news items as to their relevance to the query, and you will display the top ranked news items. [Two news items will be considered equal if they have the same link information (some may also consider two news items equal also if they have the same title.]

Special praise but no extra points to the students who will output these best news items as an HTML page viewable by a browser, thus allowing users to access the news items by clicking.

(*) This homework is derived from an old assignment at Stanford and uses code from a Princeton course.