Hi there!
Because some of you asked, how I realized the grabbing and thumbnailing of whole websites (here’s an example and I wrote about that in this post), this is a brief HOWTO.
Imagine, you have a Linux system without graphical support. How do you display complex graphical content and make a screenshot? Here it comes: grabbing websites on a Linux system is quite simple.
Prerequisites:
- a Linux operating system (Debian is fine)
khtml2png
(I usedkhtml2png_2.7.6_i386.deb
from here)- a running X server (
Xvfb
does it for me) kdelibs4c2a
libkonq4
This is it!
The trick now is: on a system working as a server, you usually don’t want to have a running X server. So, I just installed Xvfb
, which is a “Virtual Framebuffer ‘fake’ X server”. It is running in the background and khtml2png
uses its display.
First, install Xvfb
and several libs:
apt-get install xvfb kdelibs4c2a libkonq4
Hit ‘y’ to solve dependencies!
Now, get khtml2png
from http://sourceforge.net/projects/khtml2png/ and install it:
dpkg -i khtml2png_2.7.6_i386.deb
Then, start your ‘fake’ X server:
/usr/bin/Xvfb :2 -screen 0 1920x1200x24
Of course, you may reduce the resolution to your needs. But remember the display number (:2) you set for Xvfb
.
And finally, you may use khtml2png
to fetch any website you like:
/usr/bin/khtml2png2 --display :2 --width 1024 --height 768 http://www.thomasgericke.de/ /tmp/website.png
Don’t worry about the fact that the package is named khtml2png
and the binary is called khtml2png2
. It’s okay!
I have a little magical wrapper around that stuff which gets URLs out of a database and performs some checks. Images are save with wget
and converted to PNG, websites are fetched with khtml2png
. Both are saved and thumbnailed on-the-fly with PHP.
I call khtml2png
via cron
like this:
/usr/bin/khtml2png2 --display :2 \ --width 1024 \ --height 768 \ --time 42 \ --disable-js \ --disable-java \ --disable-plugins \ --disable-redirect \ --disable-popupkiller \ http://www.thomasgericke.de/ \ /tmp/website.png
My script is started every minute and checks if new URLs have to be fetched. It also checks if existing PNGs are older than 24 hours and, if so, the URL will be fetched and the PNG overwritten.
Just let me know, if you have any further questions.
New blog post: HOWTO grab and thumbnail websites http://tinyurl.com/d9eqh5
Hi there Thomas. Yours is only a handful of articles I could find on using khtml2png. I have it installed and running, for most URLs. However, for some webpages, I encounter a DOM::DomException from khtml2png2.
/usr/local/bin/khtml2png2 –width 1024 –height 768 –time 42 –disable-js –disable-java –disable-plugins –disable-redirect –disable-popupkiller http://www.google.com/ google.png
terminate called after throwing an instance of ‘DOM::DOMException’
KCrash: Application ‘khtml2png2’ crashing…
I have tried this command with different parameters. The same exception is always thrown. yahoo.com also results in the same exception. I wonder if you’ve encountered this exception before, and what solutions you were able to find?
Thanks for your attention.
@Johann:
I’m having the same issues. Since I run a daemon which produces thumbnails almost on the fly, I rehash all required processes once in a while.
I could not find a final solution yet.