One to url them all…

WordPress announced sitemaps support, which I though might be a chance for me to get all the post – not just last 10… so wouldn’t have to do cyclic rss parsing…

but no (here’s  mine), it’s just the permalink’s list… no author, tag/category, summary info that goes along with atom…

So then I thought the pretty permalinks + atom is the answer, i.e.: these are my first ten posts here:

and atom for another 10 posts (last 10 from November):

so the answer to my problem (i.e. atom for first 10 posts) would be:

…but it won’t work… why? ;(

still I can go through each day of the calendar, i.e.:

– hopefully you won’t get more than 10 posts a day…

but now I can see I can’t compete with GGL and index all the

once I was moaning about getting only a few thousands of crawled blogs using Next link, when there are hundreds of thousands of WP blogs created each month

now I think I gained the critical mass, and parsing only blogrolls (also the non-XFN ones) I got:

  • 2006.06.15: 25 996 WP blog URLs
  • a few days later: 55 689 WP blog URLs

and I only managed to parse half of it…

I can have 1000 WP blog’s atoms parsed in ~6 days… even when going parallel  (say 5 sessions – my server can handle that 🙂 -> when I finish parsing the last thousand the first one is already outdated …

I hoped to experiment comparing my search engine (based on Solr – more details soon) to GGL Blog Search in a given period of time (say a week)… now even that seems impossible… what to do? what to do? <panic>


Tagi: ,


Proszę zalogować się jedną z tych metod aby dodawać swoje komentarze:


Komentujesz korzystając z konta Wyloguj /  Zmień )

Zdjęcie na Google

Komentujesz korzystając z konta Google. Wyloguj /  Zmień )

Zdjęcie z Twittera

Komentujesz korzystając z konta Twitter. Wyloguj /  Zmień )

Zdjęcie na Facebooku

Komentujesz korzystając z konta Facebook. Wyloguj /  Zmień )

Połączenie z %s

%d blogerów lubi to: