parsin’ WordPress

By Marek Kopel

trying to parse (XMLParser) WordPress.com (via http://wordpress.com/next/) into that schema (img by Workbench):

DAC4WP SQL Diagram

  1. atom has author’s uri – rss2 doesn’t > use atom
  2. at the moment tags and categories in feeds are indistinguishable
  3. only get 10 post from blog at a time, but script runs incrementally
  4. comments are parsed independently
  5. post ids sometimes are pretty permalinks and sometimes contain “/?=”
  6. suffixes “/feed”, “/feed/atom”, “/?feed=atom” works only with permalinks> use permalinks instead of ids
  7. user’s uri <-> blog’s url mismatch:
    • uris often end with “/” – urls never
    • uris are often blank
    • uris often contain “www.” which is an error – urls never
  8. to find pingbacks just check if comment author’s uri = post’s url

TO DO:

  1. parse blogrolls with XFN rel > author-author rel
  2. use WordNet::Similarity > concept-concept rel

Tags: ,

Leave a Reply