OPG logo OPG banner Online policy research, outreach, and action on issues such as access, privacy, defamation, and the digital divide.
Home

Web / Email Hosting
Research
Outreach
Action
Network
Media
About / Contact
Join
Volunteer / Intern
Donate


powered by FreeFind
Online Policy Group Software Utilities
The Online Policy Group has developed a set of software utilities to assist in ongoing research about policy issues such as online access, privacy, digital defamation, and the digital divide.

The software utilities package includes the following utilities, all of which require set up, download and installation, and configuration before you can run the utilities:

RandomPageFinder
To obtain a list of random sites from the Internet.

SiteBlockMapper
To test if web sites are blocked by Internet blocking software.

WebMasher
To capture a list of web addresses from the results of a given search on a search engine.

WebGrab
To capture a corpus of text from web pages identified by a list of web addresses.

Please email bug reports, corrections, and request for enhancements to utilities@onlinepolicy.org. Our volunteers will handle them on a time-available basis.

The source code for this software is distributed freely as a matter of full disclosure regarding the research techniques of the Online Policy Group and to encourage other organizations to develop open source software utilities for conducting similar research.

Set Up for the Software Utilities

Since these software utilities run in a Java runtime environment, you must have the Java 1.2 or 1.3 runtime environment installed on your machine before you execute the software.

To determine if the Java 1.2 or 1.3 runtime environment is already installed on your computer:

  1. Search for the jre directory on your system by popping up the Start menu, choosing the Search item, then the For Files or Folders item.
  2. At the search text input area, type jre and click on the Search button.
  3. If you do not find the jre directory, skip to the last step in this procedure. If you do find the jre directory, change directories into
    C:\Program Files\JavaSoft\Jre\1.2\bin or
    C:\Program Files\JavaSoft\Jre\1.3\bin
    and look for a file named java.exe .
  4. If it is there or in another similar subdirectory of a jre directory somewhere else on your machine, then the Java 1.2 or 1.3 runtime environment is most likely already installed on your machine.
  5. If the Java 1.2 or 1.3 runtime environment is not installed on your machine, you can obtain it for free at http://java.sun.com/j2se.

Make sure that you have specified the directory containing java.exe in the command pathname so that you may run the utilities. For each operating system, there is a separate procedure for specifying the command pathname.

On Windows, you can edit your command pathname to include the directory containing java.exe by editing the AUTOEXEC.BAT file as follows:

  1. On Windows machines, start up the MS-DOS Prompt by selecting the Start menu, pulling right to the Programs menu (and possibly also to the Accessories sub-menu), then selecting the MS-DOS Prompt menu item.
  2. Change directories to the C:\ directory by typing to the MS-DOS prompt:
    cd C:\
  3. Copy the AUTOEXEC.BAT file to a backup file called AUTOEXEC.BAK by typing the command:
    copy AUTOEXEC.BAT AUTOEXEC.BAK
  4. Start up the Notepad or WordPad text editing application and open the AUTOEXEC.BAT file.
  5. If there is a line that starts with SET PATH then add a semi-colon (;) followed by the MS-DOS pathname representing the Java 1.2 or 1.3 directory, to the end of that line, for example:
    ;C:\PROGRA~1\JAVASOFT\JRE\1.2\BIN
  6. If there is not a line that starts with SET PATH, then add a line at the end of the AUTOEXEC.BAT file where the pathname listed has the MS-DOS name for the bin subdirectory containing the java command in the directory where JRE is located on your computer, for example:
    SET PATH C:\PROGRA~1\JAVASOFT\JRE\1.2\BIN
  7. Save the modified AUTOEXEC.BAT file.
  8. Reboot your computer.

[Need directions for other platforms here.]

Download and Install the Software Utilities

Once you have the Java 1.2 or 1.3 runtime environment installed on your computer, you can download the software utilities file called onlinepolicy.jar by following this procedure:

  1. On Windows machines, start up the MS-DOS Prompt by selecting the Start menu, pulling right to the Programs menu (and possibly also to the Accessories sub-menu), then selecting the MS-DOS Prompt menu item.
  2. Create a directory called C:\opg by typing the following command at the MS-DOS prompt:
    mkdir C:\opg
  3. Using your favorite web browser, go the the web page where the onlinepolicy.jar file is stored at http://www.onlinepolicy.org/research/ospa/utilities/utilities.htm (this page!).
  4. Click on onlinepolicy.jar.zip to save the file archive of onlinepolicy.jar on your computer. Be sure to save the onlinepolicy.jar file into the C:\opg directory.

    If your web browser software does not permit you to save the file archive to disk when you click on the link above, then:

    • On Windows, try clicking the right mouse over the link and choose the "Save Link As..." or "Save Target As..." menu item to save the file to disk.
    • On Mac, click and hold down the mouse over the link and choose the "Save Link As..." or "Save Target As..." menu item to save the file to disk.

    If your web browser software does not automatically unzip the file archive to the onlinepolicy.jar file, then use an unzip software program such as WinZip to do so.

Configure the Software Utilities

CLASSPATH is a system environment variable required to run the OPG software utilities.

Set your CLASSPATH variable as follows:

  1. On Windows machines, start up the MS-DOS Prompt by selecting the Start menu, pulling right to the Programs menu (and possibly also to the Accessories sub-menu), then selecting the MS-DOS Prompt menu item.
  2. Type in the command(s) appropriate for your computer's operating system and command shell listed in the table below:
Operating System Shell Command
Windows 95/98/NT/2000 DOS set CLASSPATH=.;onlinepolicy.jar
Unix bash,ksh export CLASSPATH=".:onlinepolicy.jar"
Unix sh CLASSPATH=".:onlinepolicy.jar"
export CLASSPATH
Unix csh setenv CLASSPATH ".:onlinepolicy.org"

Run the Software Utilities

You can run any of the software utilities from a command line which varies according to the computer and operating system you are using.

On Windows machines, if you have not already started up MS-DOS, start up the MS-DOS Prompt by as described in the Configure the Software Utilities section.

Before attempting to run any of the utilities, change directories to the directory where the software utilities are located by typing:

On Windows—
cd C:\opg

[Instructions for Mac and Unix machines should go here.]

Specific directions for running each of the software utilities are listed below.

RandomPageFinder

The RandomPageFinder utility generates n (specified by user) sites by constructing a random IP address and then testing that IP for existence of a web page. The test is a simple connection to that IP address and a search for a title (in the Head HTML tag).

RandomPageFinder continues until the desired number of pages is found, which make take hours or even days to find a handful of pages.

RandomPageFinder Example:

Here is an example that generates two random sites into a report file called random.0103231719.txt and logging to a file called
random.0103231719.log.txt

java RandomPageFinder -count 2 -report random.0103231719.txt > random.0103231719.log.txt

To stop execution of the utility before it has completed, you can type Ctrl-C on most operatings systems.

Note: On Unix, you can run the utility in the "background", so you can run multiple commands simultaneously. If you are running the bash shell, you can logout and leave the commands running on the computer until you have a chance to come back and retrieve the results.

RandomPageFinder Usage:

java RandomPageFinder -count n [ -report fn ] [-max n ]

Argument Meaning
count n Total number of pages to be generated.
report fn The name of a report file. This is an optional argument.
max n The total number of attempts. That is, give up after n even if 'count' number of pages has not been found.
test Runs it over a set of built-in, known IP addresses. This confirms the connection is working and the application is reporting correctly.

SiteBlockMapper

This program reads a list of web addresses (URL's) and determines if each one is reachable. For each web address that is reachable, SiteBlockMapper follows all the links on the page pointed at by the URL to determine if they are reachable. This process can be repeated for a specified depth.

The objective is to determine if a given web site is available while an Internet filter is active.

The scanner ignores all Javascript code. On some pages, links are embedded in Javascript as pull down menus or combo boxes. These links are not followed.

Occasionally, SiteBlockMapper finds a page which confuses the scanner (i.e. the scanner chokes on some obscure HTML). In those cases, the message 'Scan Error' occurs for that web page.

SiteBlockMapper scans each web page only once. However, the SiteBlockMapper output may repeat a given web address since the same web address may be encountered from multiple links within a site.

SiteBlockMapper scans links to pages on other sites, but the links are never followed onto sites that were not specified in the original site list.

SiteBlockMapper Example:

Here is an example running SiteBlockMapper on a list of files called siteList.txt with the utility search two levels down of links.

java SiteBlockMapper -maxdepth 2 siteList.txt

SiteBlockMapper Usage:

java SiteBlockMapper -maxdepth n [-debug i] siteList.txt

Argument Meaning
maxdepth n The number of levels of links to follow. Any value over 2 is likely explode into a large set of pages.
debug n Turns on debug messages for node 'i' (each page is displayed with a node number). This is used for diagnosing problem pages.
siteList.txt A list of URL's, one per line. Each URL must be fully specified (i.e. http://...).

WebMasher

WebMasher is simple web browser which allows you to select a web page and then obtain a list of the web addresses for all the links on that page. You can use WebMasher to browse keyword searches on a search engine to collect a list of web pages included in the the search engine result.

Once you have collected the list of web addresses and edited it, you can save the list into a file. This same file format can be read by WebGrab.

WebMasher Usage:

java WebMasher

WebGrab

This program reads the file produced by the WebMasher (which is a list of web addresses) and loads the HTML code from each web address onto the local disk. WebGrab generates a unique filename for each HTML file based on the web address for the file.

java WebGrab webmashfile

Argument Meaning
webmashfile The name of the file saved by WebMasher

Source Code

The source code for the Online Policy Group software utilities is included in the onlinepolicy.jar file, and the source code is accessible by unzipping the onlinepolicy.jar file.

As a condition of use of the source code, please send any code improvements to opgcode@onlinepolicy.org.

Top of page

Issues

Online Access
Online Privacy
Digital Defamation
Digital Divide
Online Community
Diversity of Content
Online Commercialism
Electronic Electorate
Privacy Policy
Site Accessibility
Copyright ©2000-2001
Online Policy Group, Inc.
All rights reserved.