parsin’ WordPress

trying to parse (XMLParser) WordPress.com (via http://wordpress.com/next/) into that schema (img by Workbench):

DAC4WP SQL Diagram

  1. atom has author’s uri – rss2 doesn’t > use atom
  2. at the moment tags and categories in feeds are indistinguishable
  3. only get 10 post from blog at a time, but script runs incrementally
  4. comments are parsed independently
  5. post ids sometimes are pretty permalinks and sometimes contain „/?=”
  6. suffixes „/feed”, „/feed/atom”, „/?feed=atom” works only with permalinks> use permalinks instead of ids
  7. user’s uri <-> blog’s url mismatch:
    • uris often end with „/” – urls never
    • uris are often blank
    • uris often contain „www.” which is an error – urls never
  8. to find pingbacks just check if comment author’s uri = post’s url

TO DO:

  1. parse blogrolls with XFN rel > author-author rel
  2. use WordNet::Similarity > concept-concept rel

Tagi: ,

Skomentuj

Please log in using one of these methods to post your comment:

Logo WordPress.com

Komentujesz korzystając z konta WordPress.com. Log Out / Zmień )

Zdjęcie z Twittera

Komentujesz korzystając z konta Twitter. Log Out / Zmień )

Facebook photo

Komentujesz korzystając z konta Facebook. Log Out / Zmień )

Google+ photo

Komentujesz korzystając z konta Google+. Log Out / Zmień )

Connecting to %s


%d bloggers like this: