HTTrack

HTTrack is a free (GPL) and easy-to-use offline browser utility.

Basically, it allows you to download the contents of a internet site to a local directory. It builds a complete set of recursively directories, getting HTML, images, and other files from the server and stashing them on your computer.  These are static, HTML images of the original site, even if it was built using some database centered, dynamic page tool.

I find it great for archiving copies of my sites before making major changes, or shutting them down.

Using HTTrack

There are versions of HTTrack for multiple OS environments.  The one I use is for a standard Linux system.  I have configured it to run from a script as a CRON task.  The script reads a series of files that list small collections of web sites.  It only processes one site at a time, to prevent overloading remote sites that are on shared servers. It stashes each collection in a designated directory on my local server for local backup and browsing.

One of the nice features of the %L function is that it automatically builds an index of the site collections in the target folder.

httrack -%U apache -F "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" 
     -%L LinkList-01 -O /home/Mirror/Mirror-01 --update

The file list (LinkList-01) is a simple list of targeted sites.   I found that WordPress sites seem to like to be listed as “http://sob.boatswain.us/”, while my Mediawiki sites won’t work with that and need to be listed without the domain garbage, simply as “sysadm.equoria.com”.

The user agent (-F) is explained in the next section.

user agent 403 rejections

There appears to be a problem with many sites related to the default User Agent identification.

Like a good boy, HTTrack identifies itself when it connects, and immediately get rejected.

Using wget as a testing tool, you can see that it is the HTTrack User Agent that triggers the forbidden message.

[root@neptune temp]# ls -l
total 0
[root@neptune temp]#
[root@neptune temp]# wget -U "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" www.garg.com 2>&1 | egrep HTTP
HTTP request sent, awaiting response... 403 Forbidden
[root@neptune temp]# ls -l
total 0
[root@neptune temp]#
[root@neptune temp]# wget www.garg.com 2>&1
[root@neptune temp]# ls -l
total 4
-rw-r--r--. 1 root root 3288 Mar 24 09:55 index.html
[root@neptune temp]#

This is handled by the security software on the server. The problem is that they simply do not have HTTrack registered in their database of approved agents.

Use the -F option in httrack to change the user agent message.

F  user-agent field (-F "user-agent name") (--user-agent)

In the example above, I used;

-F "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"

The user agent text “Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1” came from Firefox pages on User Agent String.Com.

For additional information, open the HTTrack Users Guide and scroll down the the section on Browser Options.

Facebooktwittergoogle_pluslinkedin

Republican Politics

I wrote the following post in 2009. I received a call from a close friend the other day. After “catching up”, he shifted to the reason behind his call. He and his wife were bothered by the debate on health care. Knowing that I was involved with the Republican Party, he called to express his concerns. His concerns were simple. In all of the reports on the Health Care debate, he did not see a true debate. He heard over … Continue reading →

Apple vs Lazy Big Brother

OK — Color me clueless, but I simply do not understand the government’s position.  They simply do not appear to know what it takes to decrypt an iPhone, so they are taking the easy way out.  Make someone else do it. 1. If you have too many failed attempts, using the iPhone to access the data, the phone will erase the disk. So, trying different passwords is not an option. 2. The first thing you do in a case like … Continue reading →

Everquest 2: An Alternate Mercenary

I first started playing the original version of EverQuest back in October of 1999. I moved to EverQuest II with it’s release in 2004. Over the past eleven years, the EverQuest II design team has added many revolutionary MMORPG features, which were inconceivable back in the EQ1 days.  Some features that are relevant to my proposal are: Instanced Housing, so everyone can have their own home to decorate without impacting other players Mounts that you can ride or place in … Continue reading →

Lawfully Good vs Lawfully Evil

Those of us who played Dungeons & Dragons and its many successors, well remember a concept called alignment. It was a way to measure a character’s ethical and moral positions. You had to place yourself on two distinct and orthogonal scales — whether you respected the law, and whether you would strive to do good or evil things. These two scales were unrelated, but intersected each other.  Today, many people view life in the real world as a choice between … Continue reading →