trying to parse (XMLParser) WordPress.com (via http://wordpress.com/next/) into that schema (img by Workbench):
- atom has author’s uri – rss2 doesn’t > use atom
- at the moment tags and categories in feeds are indistinguishable
- only get 10 post from blog at a time, but script runs incrementally
- comments are parsed independently
- post ids sometimes are pretty permalinks and sometimes contain “/?=”
- suffixes “/feed”, “/feed/atom”, “/?feed=atom” works only with permalinks> use permalinks instead of ids
- user’s uri <-> blog’s url mismatch:
- uris often end with “/” – urls never
- uris are often blank
- uris often contain “www.” which is an error – urls never
- to find pingbacks just check if comment author’s uri = post’s url
TO DO:
- parse blogrolls with XFN rel > author-author rel
- use WordNet::Similarity > concept-concept rel
