Archive for the ‘dataset’ Category

Re: Wizualizacja danych wprost z DB

13 July 2009

dzieciou napisał:

Szukam jakiejś alternatywy dla niego, nie wymagającej pisania programu.

a ja to robię tak:
z różnych lokalizacji, więc zdalnie na Ubuntu Server przez VNC (te zielone okna to domyślny twm)

bar chart

bar chart

mysql -uroot -p -e "SELECT y INTO OUTFILE '/tmp/y' LINES TERMINATED BY ',' FROM vis.dane"
//zrzucam kolumnę z MySQL do pliku rozdzielając wartości przecinkami

sed 's/^/y=[/;s/$/];\n/' /tmp/y > o.m
//wstawiam na początku linii y=[, a na końcu ];(nowa linia) i zapisuję jako skrypt octave (czy matlab)

echo "bar(y);" >> o.m
//dodaję polecenie rysowania bar chart


run o
//w terminalu poniżej w odpalonej octave uruchamiam ten skrypt

ploted chart

ploted chart

a tak plotuję wygładzony wykres z tych 20 wartości (żeby nie była to prosta łamana)

octave:2> x=1:20
octave:3> xx=1:.1:20
octave:4> yy=spline(x,y,xx)
octave:5> plot(x,y,"+",xx,yy,"-")

xx to przedział <1;20>, ale nie co 1, tylko co 0,1
yy to aproksymacja 20-u y-ów do 200 wartości spline‘m (podobno tak też się wygładza wykresy w Calc’u czy Excel’u)
plotuję 20 wartości krzyżykiem (“+”) i 200 linią (“-”)

w praktyce te skrypty tworzę z poziomu PHP, a wykresy od razu zrzucam print‘em do PNG
ewentualnie wcześniej ustawiam xlabel, ylabel i title

no i w octave łatwo plotować 3D (mesh) np. tak (trudniej o dane :) :

example mesh

example mesh

XML is going… down?

8 July 2008

Google Open Source Blog: Protocol Buffers: Google’s Data Interchange Format

open, but binary… but not the way ODF and OOXML is… and not the first one (see 5th comment)…

but it’s GGL’s, ya know :)

One to url them all…

25 June 2008

WordPress announced sitemaps support, which I though might be a chance for me to get all the post – not just last 10… so wouldn’t have to do cyclic rss parsing…

but no (here’s  mine), it’s just the permalink’s list… no author, tag/category, summary info that goes along with atom…

So then I thought the pretty permalinks + atom is the answer, i.e.: these are my first ten posts here:

http://marekopel.wordpress.com/2006/11/page/2

and atom for another 10 posts (last 10 from November):

http://marekopel.wordpress.com/2006/11/feed/atom

so the answer to my problem (i.e. atom for first 10 posts) would be:

http://marekopel.wordpress.com/2006/11/page/2/feed/atom

…but it won’t work… why? ;(

still I can go through each day of the calendar, i.e.:

http://marekopel.wordpress.com/2006/11/5/feed/atom

- hopefully you won’t get more than 10 posts a day…

but now I can see I can’t compete with GGL and index all the wordpress.com

once I was moaning about getting only a few thousands of crawled blogs using Next link, when there are hundreds of thousands of WP blogs created each month

now I think I gained the critical mass, and parsing only blogrolls (also the non-XFN ones) I got:

  • 2006.06.15: 25 996 WP blog URLs
  • a few days later: 55 689 WP blog URLs

and I only managed to parse half of it…

I can have 1000 WP blog’s atoms parsed in ~6 days… even when going parallel  (say 5 sessions – my server can handle that :) -> when I finish parsing the last thousand the first one is already outdated …

I hoped to experiment comparing my search engine (based on Solr – more details soon) to GGL Blog Search in a given period of time (say a week)… now even that seems impossible… what to do? what to do? <panic>

“matchmaker matchmaker make me a match”

5 May 2008

“[...] so the blue balloons are men and the pink balloons are women… and the darker balloons are older people and the lighter balloons are younger people [...]“

and it’s… online dating clustering?

p.s. “[...] intelligence is the no1 turn on for people over all [...]” :)

new ‘NEXT’s

5 May 2008

after another 100 000 clicks on WP ‘next’ link my spider harvested only 6 798 – 6 055 = 743 new blogs
- Are they new blogs?
- Nope.

- Did they have little activity recently?
- I don’t think so.

- Are they spamblogs?
- Not really.

- Why then?
- ???

WordPress analysis – next?

16 April 2008

The schema evolved to:

After 2 iterations (100 000 each) with a week interval my spider following the http://wordpress.com/next/ link found only 1733 blogs. But when I made the spider crawl the found blogs’ blogrolls it found another 4322 blogs (in wordpress.com only!). Why? Does the next link show only the active blogs or the rest are just spamblogs? We’ll find out soon (I hope :) .

Some preliminary analysis results:

When do we blog (nr of posts)?

Thursday 3568 16,64%
Friday 3433 16,01%
Monday 3358 15,66%
Tuesday 2948 13,75%
Wednesday 2831 13,20%
Sunday 2766 12,90%
Saturday 2539 11,84%
21443 100,00%

parsin’ WordPress

8 April 2008

trying to parse (XMLParser) WordPress.com (via http://wordpress.com/next/) into that schema (img by Workbench):

DAC4WP SQL Diagram

  1. atom has author’s uri – rss2 doesn’t > use atom
  2. at the moment tags and categories in feeds are indistinguishable
  3. only get 10 post from blog at a time, but script runs incrementally
  4. comments are parsed independently
  5. post ids sometimes are pretty permalinks and sometimes contain “/?=”
  6. suffixes “/feed”, “/feed/atom”, “/?feed=atom” works only with permalinks> use permalinks instead of ids
  7. user’s uri <-> blog’s url mismatch:
    • uris often end with “/” – urls never
    • uris are often blank
    • uris often contain “www.” which is an error – urls never
  8. to find pingbacks just check if comment author’s uri = post’s url

TO DO:

  1. parse blogrolls with XFN rel > author-author rel
  2. use WordNet::Similarity > concept-concept rel

visualcomplexity

4 April 2008

visualcomplexity.com | A visual exploration on mapping complex networks via ze’s page

nice hub

some already seen (liveplasma), some not (silobreaker)

gotta get time to explore more :)

Many Eyes: Social networks and more

24 January 2008

Many Eyes: Social networks and more

just had a glance @ meta-social part of the Many Eyes… great thing… a new era for spreadsheet diagrams…

CMS w SW

23 April 2007

na bardzo ładny serwis, z opisanymi CMS’ami i nie tylko, trafiłem w czasie seminarium na SSI… co ciekawsze każdy system opisany jest RDF’em, a witryna jest globalnym CMS’em (czy Wiki?)… czy to mi się przyda?

inny ładnie zestawiający porównania funkcjonalności CMS i user rating to CMS Matrix

no i do kompletu znany OS CMS z instalacjami  do testdriving’u