1950 lines
100 KiB
Plaintext
1950 lines
100 KiB
Plaintext
The Webalizer - A web server log file analysis tool
|
|
Copyright 1997-2013 by Bradford L. Barrett
|
|
|
|
Distributed under the GNU GPL. See the files "COPYING" and
|
|
"Copyright" supplied with the distribution for additional info.
|
|
|
|
|
|
What is The Webalizer?
|
|
----------------------
|
|
|
|
The Webalizer is a web server log file analysis program which produces
|
|
usage statistics in HTML format for viewing with a browser. The results
|
|
are presented in both columnar and graphical format, which facilitates
|
|
interpretation. Yearly, monthly, daily and hourly usage statistics are
|
|
presented, along with the ability to display usage by site, URL, referrer,
|
|
user agent (browser), search string, entry/exit page, username and country
|
|
(some information is only available if supported and present in the log
|
|
files being processed). Processed data may also be exported into most
|
|
database and spreadsheet programs that support tab delimited data formats.
|
|
|
|
The Webalizer supports CLF (common log format) log files, as well as
|
|
Combined log formats as defined by NCSA and others, and variations
|
|
of these which it attempts to handle intelligently. In addition, The
|
|
Webalizer supports wu-ftpd xferlog (FTP) formatted logs, squid proxy logs
|
|
and W3C extended format logs.
|
|
|
|
Gzip compressed logs may be used as input directly. Any log filename
|
|
that ends with a '.gz' extension will be assumed to be in gzip format and
|
|
uncompressed on the fly as it is being read. The Webalizer now also has
|
|
the ability to handle BZip2 compressed logs, if enabled at compile time.
|
|
Similar to gzipped logs, any log filename that ends with a '.bz2' will be
|
|
assumed to be in bzip2 format and uncompressed on the fly as it is being
|
|
read.
|
|
|
|
For sites that do not enable hostname lookups (DNS resolution) on their
|
|
web servers (and have only IP addresses in their logs), The Webalizer
|
|
provides its own internal DNS lookup capability as well as geolocation
|
|
services (GeoDB). The optional GeoIP library from MaxMind Inc. is also
|
|
supported and may be used instead of the native GeoDB database.
|
|
|
|
A utility program, "The Webalizer (DNS) Cache file Manager", or 'wcmgr'
|
|
is also provided which allows the creation and manipulation of the DNS
|
|
cache files used and produced by the webalizer. See the file DNS.README
|
|
for additional information regarding DNS support.
|
|
|
|
This documentation applies to The Webalizer Version 2.23
|
|
|
|
Running the Webalizer
|
|
---------------------
|
|
|
|
The Webalizer was designed to be run from a Unix command line prompt or
|
|
as a cron job. There are several command line options which will modify
|
|
the results it produces, and configuration files can be used as well.
|
|
The format of the command line is:
|
|
|
|
webalizer [options ...] [log-file]
|
|
|
|
Where 'options' can be one or more of the supported command line
|
|
switches described below. 'log-file' is the name of the log file
|
|
to process (see below for more detailed information). If a dash
|
|
("-") is specified for the log-file name, STDIN will be used.
|
|
|
|
|
|
Once executed, the general flow of the program follows:
|
|
|
|
o A default configuration file is scanned for. A file named
|
|
'webalizer.conf' is searched for in the current directory, and if
|
|
found, its configuration data is parsed. If the file is not
|
|
present in the current directory, the file '/etc/webalizer.conf'
|
|
is searched for and, if found, is used instead.
|
|
|
|
o Any command line arguments given to the program are parsed. This
|
|
may include the specification of a configuration file, which is
|
|
processed at the time it is encountered.
|
|
|
|
o If a log file was specified, it is opened and made ready for
|
|
processing. If no log file was given, or the filename '-' is
|
|
specified on the command line, STDIN is used for input.
|
|
|
|
o If an output directory was specified, the program does a 'chdir' to
|
|
that directory in preparation for generating output. If no output
|
|
directory was given, the current directory is used.
|
|
|
|
o If a non-zero number of DNS Children processes were specified, they
|
|
will be started, and the specified log file will be processed,
|
|
either creating or updating the specified DNS cache file.
|
|
|
|
o If no hostname was given, the program attempts to get the hostname
|
|
using a uname system call. If that fails, 'localhost' is used.
|
|
|
|
o A history file is searched for. This file keeps previous month
|
|
totals used on the main index.html page. The default file is
|
|
named 'webalizer.hist', kept in the specified output directory,
|
|
however may be changed using the "HistoryName" configuration file
|
|
keyword.
|
|
|
|
o If incremental processing was specified, a data file is searched for
|
|
and loaded if found, containing the 'internal state' data of the
|
|
program at the end of a previous run. The default file is named
|
|
'webalizer.current', kept in the specified output directory, however
|
|
may be changed using the "IncrementalName" configuration file keyword.
|
|
|
|
o Main processing begins on the log file. If the log spans multiple
|
|
months, a separate HTML document is created for each month.
|
|
|
|
o After main processing, the main 'index.html' page is created, which
|
|
has totals by month and links to each months HTML document.
|
|
|
|
o A new history file is saved to disk, which includes totals generated
|
|
by The Webalizer during the current run.
|
|
|
|
o If incremental processing was specified, a data file is written that
|
|
contains the 'internal state' data at the end of this run.
|
|
|
|
|
|
Incremental Processing
|
|
----------------------
|
|
|
|
Version 1.2x of The Webalizer adds incremental run capability. Simply
|
|
put, this allows processing large log files by breaking them up into
|
|
smaller pieces, and processing these pieces instead. What this means
|
|
in real terms is that you can now rotate your log files as often as you
|
|
want, and still be able to produce monthly usage statistics without the
|
|
loss of any detail. This is accomplished by saving and restoring all
|
|
relevant internal data to a disk file between runs. Doing so allows the
|
|
program to 'start where it left off' so to speak, and allows the
|
|
preservation of detail from one run to the next.
|
|
|
|
Some special precautions need to be taken when using the incremental
|
|
run capability of The Webalizer. Configuration options should not be
|
|
changed between runs, as that could cause corruption of the internal
|
|
stored data. For example, changing the MangleAgents level will cause
|
|
different representations of user agents to be stored, producing invalid
|
|
results in the user agents section of the report. If you need to change
|
|
configuration options, do it at the end of the month after normal
|
|
processing of the previous month and before processing the current month.
|
|
You may also want to delete the 'webalizer.current' file as well (or
|
|
whatever name was specified using the "IncrementalName" configuration
|
|
option).
|
|
|
|
The Webalizer also attempts to prevent data duplication by keeping
|
|
track of the timestamp of the last record processed. This timestamp
|
|
is then compared to current records being processed, and any records
|
|
that were logged previous to that timestamp are ignored. This, in
|
|
theory, should allow you to re-process logs that have already been
|
|
processed, or process logs that contain a mix of processed/not yet
|
|
processed records, and not produce duplication of statistics. The
|
|
only time this may break is if you have duplicate timestamps in two
|
|
separate log files... any records in the second log file that do have
|
|
the same timestamp as the last record in the previous log file processed,
|
|
will be discarded as if they had already been processed. There are
|
|
lots of ways to prevent this however, for example, stopping the web
|
|
server before rotating logs will prevent this situation. This setup
|
|
also necessitates that you always process logs in chronological order,
|
|
otherwise data loss will occur as a result of the timestamp compare.
|
|
|
|
|
|
Output Produced
|
|
---------------
|
|
|
|
The Webalizer produces several reports (html) and graphics for each
|
|
month processed. In addition, a summary page is generated for the
|
|
current and previous months (up to 12), a history file is created
|
|
and if incremental mode is used, the current month's processed data.
|
|
The exact location and names of these files can be changed using
|
|
configuration files and command line options. The files produced,
|
|
(default names) are:
|
|
|
|
index.html - Main summary page (extension may be changed)
|
|
usage.png - Yearly graph displayed on the main index page
|
|
usage_YYYYMM.html - Monthly summary page (extension may be changed)
|
|
usage_YYYYMM.png - Monthly usage graph for specified month/year
|
|
daily_usage_YYYYMM.png - Daily usage graph for specified month/year
|
|
hourly_usage_YYYYMM.png - Hourly usage graph for specified month/year
|
|
site_YYYYMM.html - All sites listing (if enabled)
|
|
url_YYYYMM.html - All urls listing (if enabled)
|
|
ref_YYYYMM.html - All referrers listing (if enabled)
|
|
agent_YYYYMM.html - All user agents listing (if enabled)
|
|
search_YYYYMM.html - All search strings listing (if enabled)
|
|
webalizer.hist - Previous month history (may be changed)
|
|
webalizer.current - Incremental Data (may be changed)
|
|
site_YYYYMM.tab - tab delimited sites file
|
|
url_YYYYMM.tab - tab delimited urls file
|
|
ref_YYYYMM.tab - tab delimited referrers file
|
|
agent_YYYYMM.tab - tab delimited user agents file
|
|
user_YYYYMM.tab - tab delimited usernames file
|
|
search_YYYYMM.tab - tab delimited search string file
|
|
|
|
The yearly (index) report shows statistics for a 12 month period, and
|
|
links to each month. The monthly report has detailed statistics for
|
|
that month with additional links to any URLs and referrers found.
|
|
The various totals shown are explained below.
|
|
|
|
Hits
|
|
|
|
Any request made to the server which is logged, is considered a 'hit'.
|
|
The requests can be for anything... html pages, graphic images, audio
|
|
files, CGI scripts, etc... Each valid line in the server log is
|
|
counted as a hit. This number represents the total number of requests
|
|
that were made to the server during the specified report period.
|
|
|
|
Files
|
|
|
|
Some requests made to the server, require that the server then send
|
|
something back to the requesting client, such as a html page or graphic
|
|
image. When this happens, it is considered a 'file' and the files
|
|
total is incremented. The relationship between 'hits' and 'files' can
|
|
be thought of as 'incoming requests' and 'outgoing responses'.
|
|
|
|
Pages
|
|
|
|
Pages are, well, pages! Generally, any HTML document, or anything
|
|
that generates an HTML document, would be considered a page. This
|
|
does not include the other stuff that goes into a document, such as
|
|
graphic images, audio clips, etc... This number represents the number
|
|
of 'pages' requested only, and does not include the other 'stuff' that
|
|
is in the page. What actually constitutes a 'page' can vary from
|
|
server to server. The default action is to treat anything with the
|
|
extension '.htm', '.html' or '.cgi' as a page. A lot of sites will
|
|
probably define other extensions, such as '.phtml', '.php3' and '.pl'
|
|
as pages as well. Some people consider this number as the number of
|
|
'pure' hits... I'm not sure if I totally agree with that viewpoint.
|
|
Some other programs (and people :) refer to this as 'Pageviews'.
|
|
|
|
Sites
|
|
|
|
Each request made to the server comes from a unique 'site', which can
|
|
be referenced by a name or ultimately, an IP address. The 'sites'
|
|
number shows how many unique IP addresses made requests to the server
|
|
during the reporting time period. This DOES NOT mean the number of
|
|
unique individual users (real people) that visited, which is impossible
|
|
to determine using just logs and the HTTP protocol (however, this
|
|
number might be about as close as you will get).
|
|
|
|
Visits
|
|
|
|
Whenever a request is made to the server from a given IP address
|
|
(site), the amount of time since a previous request by the address
|
|
is calculated (if any). If the time difference is greater than a
|
|
pre-configured 'visit timeout' value (or has never made a request before),
|
|
it is considered a 'new visit', and this total is incremented (both
|
|
for the site, and the IP address). The default timeout value is 30
|
|
minutes (can be changed), so if a user visits your site at 1:00 in
|
|
the afternoon, and then returns at 3:00, two visits would be registered.
|
|
Note: in the 'Top Sites' table, the visits total should be discounted
|
|
on 'Grouped' records, and thought of as the "Minimum number of visits"
|
|
that came from that grouping instead. Note: Visits only occur on
|
|
PageType requests, that is, for any request whose URL is one of the
|
|
'page' types defined with the PageType and PagePrefix option, and not
|
|
excluded by the OmitPage option. Due to the limitation of the HTTP
|
|
protocol, log rotations and other factors, this number should not be
|
|
taken as absolutely accurate, rather, it should be considered a pretty
|
|
close "guess".
|
|
|
|
KBytes
|
|
|
|
The KBytes (kilobytes) value shows the amount of data, in KB, that
|
|
was sent out by the server during the specified reporting period. This
|
|
value is generated directly from the log file, so it is up to the
|
|
web server to produce accurate numbers in the logs (some web servers
|
|
do stupid things when it comes to reporting the number of bytes). In
|
|
general, this should be a fairly accurate representation of the amount
|
|
of outgoing traffic the server had, regardless of the web servers
|
|
reporting quirks.
|
|
|
|
Note: A kilobyte is 1024 bytes, not 1000 :)
|
|
|
|
Top Entry and Exit Pages
|
|
|
|
The Top Entry and Exit tables give a rough estimate of what URLs
|
|
are used to enter your site, and what the last pages viewed are.
|
|
Because of limitations in the HTTP protocol, log rotations, etc...
|
|
this number should be considered a good "rough guess" of the actual
|
|
numbers, however will give a good indication of the overall trend in
|
|
where users come into, and exit, your site.
|
|
|
|
|
|
Command Line Options
|
|
--------------------
|
|
|
|
The Webalizer supports many different configuration options that will
|
|
alter the way the program behaves and generates output. Most of these
|
|
can be specified on the command line, while some can only be specified
|
|
in a configuration file. The command line options are listed below,
|
|
with references to the corresponding configuration file keywords.
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
General Options
|
|
---------------
|
|
|
|
-h Display all available command line options and exit program.
|
|
|
|
-v Be Verbose. This will cause the program to print additional
|
|
information at run time. It is the same as specifying
|
|
"Quiet no", "ReallyQuiet no" and "Debug yes" config options.
|
|
|
|
-V Display the program version and exit. Additional program
|
|
specific information will be displayed if 'verbose' mode is
|
|
also used (e.g. '-vV'), which can be useful when submitting
|
|
bug reports.
|
|
|
|
-d Display additional 'debugging' information for errors and
|
|
warnings produced during processing. This normally would
|
|
not be used except to determine why you are getting all those
|
|
errors and wanted to see the actual data. Normally The
|
|
Webalizer will just tell you it found an error, not the
|
|
actual data. This option will display the data as well.
|
|
Config file keyword: Debug
|
|
|
|
-F Specify the log file type to process. Normally, the
|
|
Webalizer expects to find a valid CLF or Combined format
|
|
we server log file. This option allows you to process
|
|
wu-ftpd xferlogs, squid and W3C formatted web logs as well.
|
|
Values can be either 'clf', 'ftp', 'squid' or 'w3c' with
|
|
'clf' being the default. Only the first character needs
|
|
to be specified (eg: -Fs will process a squid log).
|
|
Config file keyword: LogType
|
|
|
|
-f Fold out of sequence log records back into analysis, by
|
|
treating them as if they were the same date/time as the
|
|
last good record. Normally, out of sequence log records
|
|
are ignored. If you run apache, don't worry about this.
|
|
Config file keyword: FoldSeqErr
|
|
|
|
-i Ignore history file. USE WITH CAUTION. This causes The
|
|
Webalizer to ignore any existing history file produced from
|
|
previous runs and generate its output from scratch. The
|
|
effect will be as if The Webalizer is being run for the
|
|
first time and any previous statistics will be lost (although
|
|
the HTML documents, if any, will not be deleted) on the main
|
|
index.html (yearly) web page.
|
|
Config file keyword: IgnoreHist
|
|
|
|
-b Ignore incremental data file. USE WITH CAUTION. This causes
|
|
The Webalizer to ignore any existing incremental (state) data
|
|
file produced by previous runs. By ignoring the incremental
|
|
data file, all previous processing for the current month will
|
|
be lost, and those logs must be re-processed.
|
|
Config file keyword: IgnoreState
|
|
|
|
-p Preserve state (incremental processing). This allows the
|
|
processing of partial logs in increments. At the end of
|
|
the program, all relevant internal data is saved, so that
|
|
it may be restored the next time the program is run. This
|
|
allows sites that must rotate their logs more than once a
|
|
month to still be able to use The Webalizer, and not worry
|
|
about having to gather and feed an entire months logs to
|
|
the program at the end of the month. See the section on
|
|
"Incremental Processing" below for additional information.
|
|
The default is to not perform incremental processing. Use
|
|
this command line option to enable the feature.
|
|
Config file keyword: Incremental
|
|
|
|
-q Quiet mode. Normally, The Webalizer will produce various
|
|
messages while it runs letting you know what its doing.
|
|
This option will suppress those messages. It should be
|
|
noted that this WILL NOT suppress errors and warnings, which
|
|
are output to STDERR.
|
|
Config file keyword: Quiet
|
|
|
|
-Q ReallyQuiet mode. This allows suppression of _all_ messages
|
|
generated by The Webalizer, including warnings and errors.
|
|
Useful when The Webalizer is run as a cron job.
|
|
Config file keyword: ReallyQuiet
|
|
|
|
-T Display timing information. The Webalizer keeps track of the
|
|
time it begins and ends processing, and normally displays the
|
|
total processing time at the end of each run. If quiet mode
|
|
(-q or 'Quiet yes' in configuration file) is specified, this
|
|
information is not displayed. This option forces the display
|
|
of timing totals if quiet mode has been specified, otherwise
|
|
it is redundant and will have no effect.
|
|
Config file keyword: TimeMe
|
|
|
|
-c file This option specifies a configuration file to use. Configuration
|
|
files allow greater control over how The Webalizer behaves, and
|
|
there are several ways to use them. As of version 0.98, The
|
|
Webalizer searches for a default configuration file in the
|
|
current directory named "webalizer.conf", and if not found,
|
|
will search in the /etc/ directory for a file of the same name.
|
|
In addition, you may specify a configuration file to use with
|
|
this command line option.
|
|
|
|
-n name This option specifies the hostname for the reports generated.
|
|
The hostname is used in the title of all reports, and is also
|
|
prepended to URLs in the reports. This allows The Webalizer
|
|
to be run on log files for 'virtual' web servers or web servers
|
|
that are different than the machine the reports are located on,
|
|
and still allows clicking on the URLs to go to the proper
|
|
location. If a hostname is not specified, either on the
|
|
command line or in a configuration file, The Webalizer attempts
|
|
to determine the hostname using a 'uname' system call. If this
|
|
fails, "localhost" will be used as the hostname.
|
|
Config file keyword: HostName
|
|
|
|
-o dir This options specifies the output directory for the reports.
|
|
If not specified here or in a configuration file, the current
|
|
default directory will be used for output.
|
|
Config file keyword: OutputDir
|
|
|
|
-x name This option allows the generated pages to have an extension
|
|
other than '.html', which is the default. Do not include the
|
|
leading period ('.') when you specify the extension.
|
|
Config file keyword: HTMLExtension
|
|
|
|
-P name Specify the file extensions for 'pages'. Pages (sometimes
|
|
called 'PageViews') are normally html documents and CGI
|
|
scripts that display the whole page, not just parts of it.
|
|
Some system will need to define a few more, such as 'phtml',
|
|
'php3' or 'pl' in order to have them counted as well. The
|
|
default is 'htm*' and 'cgi' for web logs and 'txt' for ftp.
|
|
Config file keyword: PageType
|
|
|
|
-O name Specify URLs which are not counted as 'pages'. Requests
|
|
matching one of these URLs will not be counted as a page, even
|
|
if they have an extension matching one of the PageTypes defined
|
|
above or have no extension at all.
|
|
Config file keyword: OmitPage
|
|
|
|
-t name This option specifies the title string for all reports. This
|
|
string is used, in conjunction with the hostname (if not blank)
|
|
to produce the actual title. If not specified, the default of
|
|
"Usage Statistics for" will be used.
|
|
Config file keyword: ReportTitle
|
|
|
|
-Y Suppress Country graph. Normally, The Webalizer produces
|
|
country statistics in both Graph and Columnar forms. This
|
|
option will suppress the Country Graph from being generated.
|
|
Config file keyword: CountryGraph
|
|
|
|
-G Suppress hourly graph. Normally, The Webalizer produces
|
|
hourly statistics in both Graph and Columnar forms. This
|
|
option will suppress the Hourly Graph only from being generated.
|
|
Config file keyword: HourlyGraph
|
|
|
|
-H Suppress Hourly statistics. Normally, The Webalizer produces
|
|
hourly statistics in both Graph and Columnar forms. This
|
|
option will suppress the Hourly Statistics table only from
|
|
being generated.
|
|
Config file keyword: HourlyStats
|
|
|
|
-K num Specify how many months should be displayed in the main index
|
|
(yearly summary) table. Default is 12 months. Can be set to
|
|
anything between 12 and 120 months (1 to 10 years).
|
|
Config file keyword: IndexMonths
|
|
|
|
-k num Specify how many months should be displayed in the main index
|
|
(yearly summary) graph. Default is 12 months. Can be set to
|
|
anything between 12 and 72 months (1 to 6 years).
|
|
Config file keyword: GraphMonths
|
|
|
|
-L Disable Graph Legends. The color coded legends displayed on
|
|
the in-line graphs can be disabled with this option. The
|
|
default is to display the legends.
|
|
Config file keyword: GraphLegend
|
|
|
|
-l num Graph Lines. Specify the number of background reference
|
|
lines displayed on the in-line graphics produced. The default
|
|
is 2 lines, however can range anywhere from zero ('0') for
|
|
no lines, up to 20 lines (looks funny!).
|
|
Config file keyword: GraphLines
|
|
|
|
-P name Page type. This is the extension of files you consider to
|
|
be pages for Pages calculations (sometimes called 'pageviews').
|
|
The default is 'htm*' and 'cgi' (plus whatever HTMLExtension
|
|
you specified if it is different). Don't use a period!
|
|
|
|
-m num Specify a 'visit timeout'. Visits are calculated by looking at
|
|
the time difference between the current and last request made
|
|
by a specific host. If the difference is greater that the
|
|
visit timeout value, the request is considered a new visit.
|
|
This value is specified in number of seconds. The default
|
|
is 30 minutes (1800).
|
|
Config file keyword: VisitTimeout
|
|
|
|
-M num Mangle user agent names. Normally, The Webalizer will keep
|
|
track of the user agent field verbatim. Unfortunately, there are
|
|
a ton of different names that user agents go by, and the field
|
|
also reports other items such as machine type and OS used. For
|
|
Example, Netscape 4.03 running on Windows 95 will report a
|
|
different string than Netscape 4.03 running on Windows NT, so even
|
|
though they are the same browser type, they will be considered
|
|
as two totally different browsers by The Webalizer. For that
|
|
matter, Netscape 4.0 running on Windows NT will report different
|
|
names if one is run on an Alpha and the other on an Intel
|
|
processor! Internet Exploder is even worse, as it reports itself
|
|
as if it were Netscape and you have to search the given string a
|
|
little deeper to discover that it is really MSIE! In order to
|
|
consolidate generic browser types, this option will cause The
|
|
Webalizer to 'mangle' the user agent field, attempting to
|
|
consolidate generic browser types. There are 6 levels that can be
|
|
specified, each producing different levels of detail. Level 5
|
|
displays only the browser name (MSIE or Mozilla) and the major
|
|
version number. Level 4 will also display the minor version
|
|
number (single decimal place). Level 3 will display the minor
|
|
version number to two decimal places. Level 2 will add any
|
|
sub-level designation (such as Mozilla/3.01Gold or MSIE 3.0b).
|
|
Level 1 will also attempt to add the system type. The default
|
|
Level 0 will disable name mangling and leave the user agent
|
|
field unmodified, producing the greatest amount of detail.
|
|
Configuration file keyword: MangleAgents
|
|
|
|
-g num This option allows you to specify the level of domains name
|
|
grouping to be performed. The numeric value represents the
|
|
level of grouping, and can be thought of as the 'number of
|
|
dots' to be displayed. The default value of 0 disables any
|
|
domain name grouping.
|
|
Configuration file keyword: GroupDomains
|
|
|
|
-D name This allows the specification of a DNS Cache file name. This
|
|
filename MUST be specified if you have dns lookups enabled
|
|
(using the -N command line switch or DNSChildren configuration
|
|
keyword). The filename is relative to the default output
|
|
directory if an absolute path is not specified (ie: starts
|
|
with a leading '/'). This option is only available if DNS
|
|
support was enabled at compile time, otherwise an 'Invalid
|
|
Keyword' error will be generated. See the DNS.README file
|
|
for additional information regarding DNS lookups.
|
|
Configuration file keyword: DNSCache
|
|
|
|
-N num Number of DNS child processes to use for reverse DNS lookups.
|
|
If specified, a DNSCache name MUST be specified also. If you
|
|
do not wish a DNS cache file to be generated, specify a value
|
|
of zero ('0') to disable it. This does not prevent using an
|
|
existing cache file, only the generation of one at run time.
|
|
See the DNS.README file for additional information.
|
|
Configuration file keyword: DNSChildren
|
|
|
|
-j Enable native GeoDB geolocation services.
|
|
Configuration file keyword: GeoDB
|
|
|
|
-J name Specify an alternate GeoDB database filename to use. This
|
|
shouldn't normally be needed. If used, the filename 'name'
|
|
is relative to the output directory being used unless an
|
|
absolute path is specified (ie: starts with a leading '/').
|
|
Configuration file keyword: GeoDBDatabase
|
|
|
|
-w Enable GeoIP support if it is available.
|
|
Configuration file keyword: GeoIP
|
|
|
|
-W name Specify an alternate GeoIP database filename to use. This
|
|
shouldn't normally be needed. If used, the filename 'name'
|
|
is relative to the specified output directory unless an
|
|
absolute name is given (ie: starts with a leading '/').
|
|
Configuration file keyword: GeoIPDatabase
|
|
|
|
-z name Specify location of the country flag graphics and enable
|
|
their display in the top country table. The directory name
|
|
is relative to the output directory unless an absolute path
|
|
is specified (ie: starts with a leading '/').
|
|
Configuration file keyword: FlagDir
|
|
|
|
Hide Options
|
|
------------
|
|
|
|
The following options take a string argument to use as a comparison
|
|
for matching. Except for the IndexAlias option, the string argument
|
|
can be plain text, or plain text that either starts or ends with the
|
|
wildcard character '*'.
|
|
|
|
For Example:
|
|
|
|
Given the string "yourmama/was/here", the arguments "was", "*here" and
|
|
"your*" will all produce a match.
|
|
|
|
|
|
-a name This option allows hiding of user agents (browsers) from the
|
|
"Top User Agents" table in the report. This option really
|
|
isn't too useful as there are a zillion different names that
|
|
current browsers go by, depending where they were obtained,
|
|
however you might have some particular user agents that hit
|
|
your site a lot that you would like to exclude from the list.
|
|
You must have a web server that includes user agents in its
|
|
log files for this option to be of any use. In addition, it
|
|
is also useless if you disable the user agent table in the
|
|
report (see the -A command line option or "TopAgents"
|
|
configuration file keyword). You can specify as many of these
|
|
as you want on the command line. The wildcard character '*'
|
|
can be used either in front of or at the end of the string.
|
|
(ie: Mozilla/4.0* would match anything that starts with the
|
|
string "Mozilla/4.0").
|
|
Config file keyword: HideAgent
|
|
|
|
-r name This option allows hiding of referrers from the "Top Referrer"
|
|
table in the report. Referrers are URLs, either on your own
|
|
local site or a remote site, that referred the user to a URL
|
|
on your web server. This option is normally used to hide
|
|
your own server from the table, as your own pages are usually
|
|
the top referrers to your own pages (well, you get the idea).
|
|
You must have a web server that includes referrer information
|
|
in the log files for this option to be of any use. In addition,
|
|
it is also useless if you disable the referrers table in the
|
|
report (see the -R command line option or "TopReferrers"
|
|
configuration file keyword). You can specify as many of these
|
|
as you like on the command line.
|
|
Config file keyword: HideReferrer
|
|
|
|
-s name This option allows hiding of sites from the "Top Sites" table
|
|
in the report. Normally, you will only want to hide your own
|
|
domain name from the report, as it usually is one of the top
|
|
sites to visit your web server. This option is of no use if
|
|
you disable the top sites table in the report (see the -S
|
|
command line option or "TopSites" configuration file option).
|
|
Config file keyword: HideSite
|
|
|
|
-X This causes all individual sites to be hidden, which results
|
|
in only grouped sites to be displayed on the report.
|
|
Config file keyword: HideAllSites
|
|
|
|
-u name This option allows hiding of URLs from the "Top URLs" table
|
|
in the report. Normally, this option is used to hide images,
|
|
audio files and other objects your web server dishes out that
|
|
would otherwise clutter up the table. This option is of no
|
|
use if you disable the top URLs table in the report (see the
|
|
-U command line option or "TopURLs" configuration file keyword).
|
|
Config file keyword: HideURL
|
|
|
|
-I name This option allows you to specify additional index.html aliases.
|
|
The Webalizer usually strips the string 'index.*' from URLs
|
|
before processing (unless disabled using the 'DefaultIndex'
|
|
config option), which has the effect of turning a URL such
|
|
as /somedir/index.html into just /somedir/ which is really the
|
|
same URL and should be treated as such. This option allows you
|
|
to specify _additional_ strings that are to be treated the same
|
|
way. Use with care, improper use could cause unexpected results.
|
|
For example, if you specify the alias string of 'home', a URL
|
|
such as /somedir/homepages/brad/home.html would be converted
|
|
into just /somedir/ which probably isn't what was intended.
|
|
This option is useful if your web server uses a different default
|
|
index page other than the standard 'index.html' or 'index.htm',
|
|
such as 'home.html' or 'homepage.html'. The string specified
|
|
is searched for _anywhere_ in the URL, so "home.htm" would
|
|
turn both "/somedir/home.htm" and "/somedir/home.html" into
|
|
just "/somedir/". Wildcards are _not_ allowed on this one.
|
|
Config file keyword: IndexAlias
|
|
|
|
Table Size Options
|
|
------------------
|
|
|
|
-e num This option specifies the number of entries to display in the
|
|
"Top Entry Pages" table. To disable the table, use a value of
|
|
zero (0).
|
|
Config file keyword: TopEntry
|
|
|
|
-E num This option specifies the number of entries to display in the
|
|
"Top Exit Pages" table. To disable the table, use a value of
|
|
zero (0).
|
|
Config file keyword: TopExit
|
|
|
|
-A num This option specifies the number of entries to display in the
|
|
"Top User Agents" table. To disable the table, use a value of
|
|
zero (0).
|
|
Config file keyword: TopAgents
|
|
|
|
-C num This option specifies the number of entries to display in the
|
|
"Top Countries" table. To disable the table, use a value of
|
|
zero (0).
|
|
Config file keyword: TopCountries
|
|
|
|
-R num This option specifies the number of entries to display in the
|
|
"Top Referrers" table. To disable the table, use a value of
|
|
zero (0).
|
|
Config file keyword: TopReferrers
|
|
|
|
-S num This option specifies the number of entries to display in the
|
|
"Top Sites" table. To disable the table, use a value of
|
|
zero (0).
|
|
Config file keyword: TopSites
|
|
|
|
-U num This option specifies the number of entries to display in the
|
|
"Top URLs" table. To disable the table, use a value of
|
|
zero (0).
|
|
Config file keyword: TopURLs
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
|
|
CONFIGURATION FILES
|
|
-------------------
|
|
|
|
The Webalizer allows configuration files to be used in order to simplify
|
|
life for all. There are several ways that configuration files are accessed
|
|
by the Webalizer. When The Webalizer first executes, it looks for a
|
|
default configuration file named "webalizer.conf" in the current directory,
|
|
and if not found there, will look for "/etc/webalizer.conf". In addition,
|
|
configuration files may be specified on the command line with the '-c'
|
|
option. There are lots of different ways you can combine the use of
|
|
configuration files and command line options to produce various results.
|
|
The Webalizer always looks for and reads configuration options from a
|
|
default configuration file before doing anything else. Because of this,
|
|
you can override options found in the default file by use of additional
|
|
configuration files specified on the command line or command line options
|
|
themselves. If you specify a configuration file on the command line, you
|
|
can override options in it by additional command line options which follow.
|
|
For example, most users will most likely want to create the default file
|
|
/etc/webalizer.conf and place options in it to specify the hostname, log
|
|
file, table options, etc... At the end of the month when a different log
|
|
file is to be used (the end of month log), you can run The Webalizer as
|
|
usual, but put the different filename on the end of the command line, which
|
|
will override the log file specified in the configuration file. It should
|
|
be noted that you cannot override some configuration file options by the
|
|
use of command line arguments. For example, if you specify "Quiet yes" in
|
|
a configuration file, you cannot override this with a command line argument,
|
|
as the command line option only _enables_ the feature (-q option).
|
|
|
|
The configuration files are standard ASCII text files that may be created
|
|
or edited using any standard editor. Blank lines and lines that begin
|
|
with a pound sign ('#') are ignored. Any other lines are considered to
|
|
be configuration lines, and have the form "Keyword Value", where the
|
|
'Keyword' is one of the currently available configuration keywords defined
|
|
below, and 'Value' is the value to assign to that particular option. Any
|
|
text found after the keyword up to the end of the line is considered the
|
|
keyword's value, so you should not include anything after the actual value
|
|
on the line that is not actually part of the value being assigned. The
|
|
file "sample.conf" provided with the distribution contains lots of useful
|
|
documentation and examples as well. It should be noted that you do not
|
|
have to use any configuration files at all, in which case, default values
|
|
will be used (which should be sufficient for most sites).
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
General Configuration Keywords
|
|
------------------------------
|
|
|
|
LogFile This defines the log file to use. It should be a fully qualified
|
|
name (ie: contain the path), but relative names will work as
|
|
well. If not specified, the logfile defaults to STDIN.
|
|
|
|
LogType This specified the log file type being used. Normally, The
|
|
Webalizer processes web logs in either CLF or Combined format.
|
|
You may also process wu-ftpd xferlog formatted logs, squid
|
|
proxy logs or W3C formatted web logs by setting the appropriate
|
|
type using this keyword. Values may be either 'clf', 'ftp',
|
|
'squid' or 'w3c'. Ensure that you specify the proper file type,
|
|
otherwise you will be presented with a long stream of 'invalid
|
|
record' messages when the Webalizer is run ;)
|
|
Command line argument: -F
|
|
|
|
OutputDir This defines the output directory to use for the reports. If
|
|
it is not specified, the current directory is used.
|
|
Command line argument: -o
|
|
|
|
HistoryName Allows specification of a history path/filename if desired.
|
|
The default is to use the file named 'webalizer.hist', kept
|
|
in the normal output directory (OutputDir above). Any name
|
|
specified is relative to the normal output directory unless
|
|
an absolute path name is given (ie: starts with a '/').
|
|
|
|
ReportTitle This specifies the title to use for the generated reports.
|
|
It is used in conjunction with the hostname (unless blank)
|
|
to produce the final report titles. If not defined, the
|
|
default of "Usage Statistics for" is used.
|
|
Command line argument: -t
|
|
|
|
HostName This defines the hostname. The hostname is used in the
|
|
report title as well as being prepended to URLs in the
|
|
"Top URLs" table. This allows The Webalizer to be run
|
|
on "virtual" web servers, or servers that do not reside
|
|
on the local machine, and allows clicking on the URL to
|
|
go to the right place. If not specified, The Webalizer
|
|
attempts to get the hostname via a 'uname' system call,
|
|
and if that fails, will default to "localhost".
|
|
Command line argument: -n
|
|
|
|
UseHTTPS Causes the links in the 'Top URLs' table to use 'https://'
|
|
instead of the default 'http://' prefix. Not much use if
|
|
you run a mix of secure/insecure servers on your machine.
|
|
Only useful if you run the analysis on a secure servers
|
|
logs, and want the links in the table to work properly.
|
|
|
|
HTAccess Enables the creation of a default .htaccess file in the
|
|
output directory. If enabled, the file will be created
|
|
(with a single "DirectoryIndex" directive), unless one
|
|
already exists. The default is 'no', which disables the
|
|
creation of any .htaccess files.
|
|
|
|
Quiet This allows you to enable or disable informational messages
|
|
while it is running. The values for this keyword can be
|
|
either 'yes' or 'no'. Using "Quiet yes" will suppress these
|
|
messages, while "Quiet no" will enable them. The default
|
|
is 'no' if not specified, which will allow The Webalizer
|
|
to display informational messages. It should be noted that
|
|
this option has no effect on Warning or Error messages that
|
|
may be generated, as they go to STDERR.
|
|
Command line argument: -q
|
|
|
|
ReallyQuiet This allows all generated output to be suppressed, including
|
|
warning and error messages. The values for this keyword
|
|
can be either 'yes' or 'no', with 'no' being the default.
|
|
Command line argument: -Q
|
|
|
|
TimeMe This allows you to display timing information regardless of
|
|
any "quiet mode" specified. Useful only if you did in fact
|
|
tell the webalizer to be quiet either by using the -q command
|
|
line option or the "Quiet" keyword, otherwise timing stats
|
|
are normally displayed anyway. Values may be either 'yes'
|
|
or 'no', with the default being 'no'.
|
|
Command line argument: -T
|
|
|
|
GMTTime This keyword allows timestamps to be displayed in GMT (UTC)
|
|
time instead of local time. Normally The Webalizer will
|
|
display timestamps in the time-zone of the local machine
|
|
(ie: PST or EDT). This keyword allows you to specify the
|
|
display of timestamps in GMT (UTC) time instead. Values
|
|
may be either 'yes' or 'no'. Default is 'no'.
|
|
|
|
Debug This tells The Webalizer to display additional information
|
|
when it encounters Warnings or Errors. Normally, The
|
|
Webalizer will just tell you it found a bad record or
|
|
field. This option will enable the display of the actual
|
|
data that produced the Warning or Error as well. Useful
|
|
only if you start getting lots of Warnings or Errors and
|
|
want to determine the cause. Values may be either 'yes'
|
|
or 'no', with the default being 'no'.
|
|
Command line argument: -d
|
|
|
|
IgnoreHist This suppresses the reading of a history file. USE WITH
|
|
EXTREME CAUTION as the history file is how The Webalizer
|
|
keeps track of previous months. The effect of this option
|
|
is as if The Webalizer was being run for the very first
|
|
time, and any previous data is discarded. Values may be
|
|
either 'yes' or 'no', with the default being 'no'.
|
|
Command line argument: -i
|
|
|
|
IgnoreState This suppresses the reading of an existing incremental
|
|
data file. USE WITH EXTREME CAUTION! By ignoring an
|
|
existing incremental data file, all previous processing
|
|
for the current month will be lost, and those logs must
|
|
be re-processed. Values may be 'yes' or 'no', with the
|
|
default being 'no'.
|
|
Command line argument: -b
|
|
|
|
FoldSeqErr Allows log records that are out of sequence to be folded
|
|
back into the analysis, by treating them as if they had
|
|
the same date/time as the last good record. Normally,
|
|
out of sequence log records are simply ignored. If you
|
|
run apache, don't worry about this.
|
|
|
|
VisitTimeout Set the 'visit timeout' value. Visits are determined by
|
|
looking at the time difference between the current and last
|
|
request made by a specific site. If the difference in time
|
|
is greater than the visit timeout value, the request is
|
|
considered a new visit. The value is in number of seconds,
|
|
and defaults to 30 minutes (1800).
|
|
Command line argument: -m
|
|
|
|
PageType Allows you to define the 'page' type extension. Normally,
|
|
people consider HTML and CGI scripts as 'pages'. This
|
|
option allows you to specify what extensions you consider
|
|
a page. Default is 'htm*' and 'cgi' for web logs, and
|
|
'txt' for ftp logs.
|
|
Command line argument: -P
|
|
|
|
PagePrefix Allows all requests with a specified prefix to be considered
|
|
as 'pages'. If you want everything under /documents to be
|
|
treated as pages no matter what their extension is. Also
|
|
useful if you have cgi-scripts with PATH_INFO.
|
|
|
|
OmitPage Allows specified URLs to not be counted as pages under any
|
|
circumstance, even if they have an extension matching a
|
|
PageType or PagePrefix as defined above.
|
|
|
|
GraphLegend Enable/disable the display of color coded legends on the
|
|
produced graphs. Default is 'yes', to display them.
|
|
Command line argument: -L
|
|
|
|
GraphLines Specify the number of background reference lines to display
|
|
on produced graphs. The default is 2. To disable the use
|
|
of background lines, use zero ('0').
|
|
Command line argument: -l
|
|
|
|
IndexMonths Specify the number of months to display in the main index
|
|
(yearly summary) table. Default is 12 months. Can be set
|
|
to anything between 12 and 120 months (1 to 10 years).
|
|
Command line argument: -K
|
|
|
|
YearHeaders Enable/disable the display of year headers in the main index
|
|
(yearly summary) table. If enabled, year headers will be
|
|
shown when the table is displaying more than 16 months worth
|
|
of data. Values can be 'yes' or 'no'. Default is 'yes'.
|
|
|
|
GraphMonths Specify the number of months to display in the main index
|
|
(yearly summary) graph. Default is 12 months. Can be set
|
|
to anything between 12 and 72 months (1 to 6 years).
|
|
Command line argument: -k
|
|
|
|
CountryGraph This keyword is used to either enable or disable the creation
|
|
and display of the Country Usage graph. Values may be either
|
|
'yes' or 'no', with the default being 'yes'.
|
|
Command line argument: -Y
|
|
|
|
CountryFlags Enables or disables the display of flags in the top country
|
|
table. If enabled, the default directory 'flags' directly
|
|
under the output directory will be used unless a different
|
|
path is specified with the 'FlagDir' option below.
|
|
Command line argument: -zflags
|
|
|
|
FlagDir Specifies the location of flag graphics. If not specified,
|
|
the default is in the 'flags' directory directly under the
|
|
output directory being used for the reports. If specified,
|
|
the display of flags will be enabled by default.
|
|
Command line argument: -z
|
|
|
|
DailyGraph This keyword is used to either enable or disable the creation
|
|
and display of the Daily Usage graph. Values may be either
|
|
'yes' or 'no', with the default being 'yes'.
|
|
|
|
DailyStats This keyword is used to either enable or disable the creation
|
|
and display of the Daily Usage statistics table. Values may
|
|
be either 'yes' or 'no', with the default being 'yes'.
|
|
|
|
HourlyGraph This keyword is used to either enable or disable the creation
|
|
and display of the Hourly Usage graph. Values may be either
|
|
'yes' or 'no', with the default being 'yes'.
|
|
Command line argument: -G
|
|
|
|
HourlyStats This keyword is used to either enable or disable the creation
|
|
and display of the Hourly Usage statistics table. Values may
|
|
be either 'yes' or 'no', with the default being 'yes'.
|
|
Command line argument: -H
|
|
|
|
IndexAlias This allows additional 'index.html' aliases to be defined.
|
|
Normally, The Webalizer scans for and strips the string
|
|
"index." from URLs before processing them (unless disabled
|
|
using the DefaultIndex config option below). This turns a
|
|
URL such as /somedir/index.html into just /somedir/ which
|
|
is really the same URL. This keyword allows _additional_
|
|
names to be treated in the same fashion for sites that use
|
|
different default names, such as "home.html". The string
|
|
is scanned for anywhere in the URL, so care should be used
|
|
if and when you define additional aliases. For example,
|
|
if you were to use an alias such as 'home', the URL
|
|
/somedir/homepages/brad/home.html would be turned into just
|
|
/somedir/ which probably isn't the intended result. Instead,
|
|
you should have specified 'home.htm' which would correctly
|
|
turn the URL into /somedir/homepages/brad/ like intended.
|
|
It should also be noted that specified aliases are scanned
|
|
for in EVERY log record... A bunch of aliases will noticeably
|
|
degrade performance as each record has to be scanned for
|
|
every alias defined. You don't have to specify 'index.' as
|
|
it is always the default (unless disabled with the config
|
|
option "DefaultIndex" described below).
|
|
Command line argument: -I
|
|
|
|
DefaultIndex This option is used to enable/disable the use of "index." as
|
|
a default index name to be stripped from the end of a URL.
|
|
Most sites should not need to use this option, however some
|
|
may find it useful, particularly those whose default index
|
|
file name is something different, or those sites that use
|
|
'index.php' or similar URLs to generate dynamic content.
|
|
This option does not effect any of the names that may be
|
|
defined using the IndexAlias option, and those names will
|
|
still function as described. Values may be 'yes' or 'no',
|
|
with 'yes' being the default.
|
|
|
|
MangleAgents The MangleAgents keyword specifies the level of user agent
|
|
name mangling, if any. There are 6 levels that may be specified,
|
|
each producing a different level of detail displayed. Level 5
|
|
displays only the browser name (MSIE or Mozilla) and the major
|
|
version number. Level 4 adds the minor version (single
|
|
decimal place). Level 3 adds the minor version to two decimal
|
|
places. Level 2 will also add any sub-level designation
|
|
(such as Mozilla/3.01Gold or MSIE 3.0b). Level 1 will also
|
|
attempt to add the system type. The default level 0 will
|
|
leave the user agent field unmodified and produces the
|
|
greatest amount of detail.
|
|
Command line argument: -M
|
|
|
|
SearchEngine This keyword allows specification of search engines and
|
|
their query strings. Search strings are obtained from
|
|
the referrer field in the record, and in order to work
|
|
properly, the Webalizer needs to know what query strings
|
|
different search engines use. The SearchEngine allows
|
|
you to specify the search engine and its query string
|
|
to parse the search string from. The line is formatted
|
|
as: "SearchEngine engine-string query-string" where
|
|
'engine-string' is a substring for matching the search
|
|
engine with, such as "yahoo.com" or "altavista". The
|
|
'query-string' is the unique query string that is added
|
|
to the URL for the search engine, such as "search=" or
|
|
"MT=" with the actual search strings appended to the
|
|
end. There is no command line option for this keyword.
|
|
|
|
SearchCaseI The SearchCaseI option specifies if search strings should
|
|
be lowercased (case insensitive) or not. Since most
|
|
search engines use case insensitive searches (ie: a
|
|
search for "Hello" is the same as "HELLO" or "hello"),
|
|
converting to lowercase will improve keyword accuracy,
|
|
which is the default. If desired, case sensitivity can
|
|
be forced with this option. The value can be 'yes' or
|
|
'no', with 'yes' (case insensitive) being the default.
|
|
|
|
Incremental This allows incremental processing to be enabled or disabled.
|
|
Incremental processing allows processing partial logs without
|
|
the loss of detail data from previous runs in the same month.
|
|
This feature saves the 'internal state' of the program so that
|
|
it may be restored in following runs. See the section above
|
|
titled "Incremental Processing" for additional information.
|
|
The value may be 'yes' or 'no', with the default being 'no'.
|
|
Command line argument: -p
|
|
|
|
IncrementalName
|
|
Allows specification of the incremental data filename if
|
|
desired. Normally, the file named "webalizer.current' is
|
|
used, kept in the standard output directory. If specified,
|
|
filenames are relative to the standard output directory,
|
|
unless an absolute name is given (ie: starts with '/').
|
|
|
|
StripCGI Determines if CGI variables should be stripped from the
|
|
end of URLs or not. Normally, these variables are removed
|
|
from URLs to improve accuracy, however some sites may wish
|
|
to keep them preserved (particularly on highly dynamic
|
|
sites). Values may be either 'yes' or 'no', with 'yes'
|
|
being the default.
|
|
|
|
TrimSquidURL Allows squid log URLs to be reduced in granularity by
|
|
truncating them after a specified number of '/' path
|
|
separators after the http:// portion. A value of 1 will
|
|
cause all URLs to be summarized by domain only. The
|
|
default value is zero (0), which leaves URLs unmodified.
|
|
|
|
DNSCache Specifies the DNS cache filename. This name is relative
|
|
to the default output directory unless an absolute name
|
|
is given (ie: starts with '/'). See the DNS.README file
|
|
for additional information.
|
|
Command line argument: -D
|
|
|
|
DNSChildren The number of DNS children processes to run in order to
|
|
create/update the DNS cache file. If specified, the DNS
|
|
cache filename must also be specified (see above). Use
|
|
a value of zero ('0') to disable. See the DNS.README
|
|
file for additional information.
|
|
Command line argument: -N
|
|
|
|
CacheIPs Specifies if unresolved addresses should also be cached
|
|
in the DNS database. If enabled, unresolved IP addresses
|
|
will be stored along with resolved addresses. This may
|
|
be useful on some sites that have lots of unresolved IPs
|
|
visiting so they are not looked up each time the program
|
|
is run. Values may be 'yes' or 'no'. Default is 'no'.
|
|
|
|
CacheTTL Specifies the Time To Live (TTL) value for cached DNS
|
|
entries in days. Default value is 7 (1 week). Can be
|
|
any value between 1 and 100.
|
|
|
|
GeoDB Controls the use of the native GeoDB geolocation services
|
|
provided by The Webalizer. Values may be 'yes' or 'no'
|
|
with 'no' being the default.
|
|
Command line argument: -j
|
|
|
|
GeoDBDatabase Specifies and alternate GeoDB database filename to use.
|
|
This is relative to the output directory being used unless
|
|
an absolute path is given (ie: starts with a '/').
|
|
Command line argument: -J
|
|
|
|
GeoIP Controls the use of GeoIP geolocation services. If The
|
|
Webalizer was compiled with GeoIP support, it is used by
|
|
default. Values may be 'yes' or 'no'. Default is 'yes'.
|
|
Command line argument: -w
|
|
|
|
GeoIPDatabase Specifies an alternate GeoIP database filename to use.
|
|
This name is relative to the default output directory
|
|
unless an absolute name is given (ie: starts with '/').
|
|
Command line argument: -W
|
|
|
|
|
|
Top Table Keywords
|
|
------------------
|
|
|
|
TopAgents This allows you to specify how many "Top" user agents are
|
|
displayed in the "Top User Agents" table. The default
|
|
is 15. If you do not want to display user agent statistics,
|
|
specify a value of zero (0). The display of user agents
|
|
will only work if your web server includes this information
|
|
in its log file (ie: a combined log format file).
|
|
Command line argument: -A
|
|
|
|
AllAgents Will cause a separate HTML page to be generated for all
|
|
normally visible User Agents. A link will be added to
|
|
the bottom of the "Top User Agents" table if enabled.
|
|
Value can be either 'yes' or 'no', with 'no' being the
|
|
default.
|
|
|
|
TopCountries This allows you to specify how many "Top" countries are
|
|
displayed in the "Top Countries" table. The default is
|
|
30. If you want to disable the countries table, specify
|
|
a value of zero (0).
|
|
Command line argument: -C
|
|
|
|
TopReferrers This allows you to specify how many "Top" referrers are
|
|
displayed in the "Top Referrers" table. The default is
|
|
30. If you want to disable the referrers table, specify
|
|
a value of zero (0). The display of referrer information
|
|
will only work if your web server includes this information
|
|
in its log file (ie: a combined log format file).
|
|
Command line argument: -R
|
|
|
|
AllReferrers Will cause a separate HTML page to be generated for all
|
|
normally visible Referrers. A link will be added to the
|
|
"Top Referrers" table if enabled. Value can be either
|
|
'yes' or 'no', with 'no' being the default.
|
|
|
|
TopSites This allows you to specify how many "Top" sites are
|
|
displayed in the "Top Sites" table. The default is 30.
|
|
If you want to disable the sites table, specify a value
|
|
of zero (0).
|
|
Command line argument: -S
|
|
|
|
TopKSites Identical to TopSites, except for the 'by KByte' table.
|
|
Default is 10. No command line switch for this one.
|
|
|
|
AllSites Will cause a separate HTML page to be generated for all
|
|
normally visible Sites. A link will be added to the
|
|
bottom of the "Top Sites" table if enabled. Value can
|
|
be either 'yes' or 'no', with 'no' being the default.
|
|
|
|
TopURLs This allows you to specify how many "Top" URLs are
|
|
displayed in the "Top URLs" table. The default is 30.
|
|
If you want to disable the URLs table, specify a value
|
|
of zero (0).
|
|
Command line argument: -U
|
|
|
|
TopKURLs Identical to TopURLs, except for the 'by KByte' table.
|
|
Default is 10. No command line switch for this one.
|
|
|
|
AllURLs Will cause a separate HTML page to be generated for all
|
|
normally visible URLs. A link will be added to the
|
|
bottom of the "Top URLs" table if enabled. Value can
|
|
be either 'yes' or 'no', with 'no' being the default.
|
|
|
|
TopEntry Allows you to specify how many "Top Entry Pages" are
|
|
displayed in the table. The default is 10. If you
|
|
want to disable the table, specify a value of zero (0).
|
|
Command line argument: -e
|
|
|
|
TopExit Allows you to specify how many "Top Exit Pages" are
|
|
displayed in the table. The default is 10. If you
|
|
want to disable the table, specify a value of zero (0).
|
|
Command line argument: -E
|
|
|
|
TopSearch Allows you to specify how many "Top Search Strings" are
|
|
displayed in the table. The default is 20. If you
|
|
want to disable the table, specify a value of zero (0).
|
|
Only works if using combined log format (ie: contains
|
|
referrer information).
|
|
|
|
TopUsers This allows you to specify how many "Top" usernames are
|
|
displayed in the "Top Usernames" table. Usernames are
|
|
only available if you use http authentication on your
|
|
web server, or when processing wu-ftpd xferlogs. The
|
|
default value is 20. If you want to disable the Username
|
|
table, specify a value of zero (0).
|
|
|
|
AllUsers Will cause a separate HTML page to be generated for all
|
|
normally visible usernames. A link will be added to the
|
|
bottom of the "Top Usernames" table if enabled. Value
|
|
can be either 'yes' or 'no', with 'no' being the default.
|
|
|
|
AllSearchStr Will create a separate HTML page to be generated for all
|
|
normally visible Search Strings. A link will be added
|
|
to the bottom of the "Top Search Strings" table if
|
|
enabled. Value can be either 'yes' or 'no', with 'no'
|
|
being the default.
|
|
|
|
|
|
Hide Object Keywords
|
|
--------------------
|
|
|
|
These keywords allow you to hide user agents, referrers, sites, URLs
|
|
and usernames from the various "Top" tables. The value for these keywords
|
|
are the same as those used in their command line counterparts. You
|
|
can specify as many of these as you want without limit. Refer to the
|
|
section above on "Command Line Options" for a description of the string
|
|
formatting used as the value. Values cannot exceed 80 characters in
|
|
length.
|
|
|
|
HideAgent This allows specified user agents to be hidden from the
|
|
"Top User Agents" table. Not very useful, since there
|
|
a zillion different names by which browsers go by today,
|
|
but could be useful if there is a particular user agent
|
|
(ie: robots, spiders, real-audio, etc..) that hits your
|
|
site frequently enough to make it into the top user agent
|
|
listing. This keyword is useless if 1) your log file does
|
|
not provide user agent information or 2) you disable the
|
|
user agent table.
|
|
Command line argument: -a
|
|
|
|
HideReferrer This allows you to hide specified referrers from the
|
|
"Top Referrers" table. Normally, you would only specify
|
|
your own web server to be hidden, as it is usually the
|
|
top generator of references to your own pages. Of course,
|
|
this keyword is useless if 1) your log file does not include
|
|
referrer information or 2) you disable the top referrers
|
|
table.
|
|
Command line argument: -r
|
|
|
|
HideSite This allows you to hide specified sites from the "Top
|
|
Sites" table. Normally, you would only specify your own
|
|
web server or other local machines to be hidden, as they
|
|
are usually the highest hitters of your web site, especially
|
|
if you have their browsers home page pointing to it.
|
|
Command line argument: -s
|
|
|
|
HideAllSites This allows hiding all individual sites from the display,
|
|
which can be useful when a lot of groupings are being
|
|
used (since grouped records cannot be hidden). It is
|
|
particularly useful in conjunction with the GroupDomain
|
|
feature, however can be useful in other situations as well.
|
|
Value can be either 'yes' or 'no', with 'no' the default.
|
|
Command line argument: -X
|
|
|
|
HideURL This allows you to hide URLs from the "Top URLs" table.
|
|
Normally, this is used to hide items such as graphic files,
|
|
audio files or other 'non-html' files that are transferred
|
|
to the visiting user.
|
|
Command line argument: -u
|
|
|
|
HideUser This allows you to hide Usernames from the "Top Usernames"
|
|
table. Usernames are only available if you use http based
|
|
authentication on your web server.
|
|
|
|
|
|
Group Object Keywords
|
|
---------------------
|
|
|
|
The Group* keywords allow object grouping based on Site, URL, Referrer,
|
|
User Agent and Usernames. Combined with the Hide* keywords, you can
|
|
customize exactly what will be displayed in the 'Top' tables. For example,
|
|
to only display totals for a particular directory, use a GroupURL and
|
|
HideURL with the same value (ie: '/help/*'). Group processing is only
|
|
done after the individual record has been fully processed, so name mangling
|
|
and site total updates have already been performed. Because of this, groups
|
|
are not counted in the main site total (as that would cause duplication).
|
|
Groups can be displayed in bold and shaded as well. Grouped records are
|
|
not, by default, hidden from the report. This allows you to display a
|
|
grouped total, while still being able to see the individual records, even
|
|
if they are part of the group. If you want to hide the detail records,
|
|
follow the Group* directive with a Hide* one using the same value. There
|
|
are no command line switches for these keywords. The Group* keywords also
|
|
accept an optional label to be displayed instead of the actual value used.
|
|
This label should be separated from the value by at least one whitespace
|
|
character, such as a space or tab character. If the match string contains
|
|
whitespace (spaces or tabs), the string should be quoted, using either
|
|
single or double quotes. See the sample configuration file for examples.
|
|
|
|
GroupReferrer Allows grouping Referrers. Can be handy for some of the
|
|
major search engines that have multiple host names a
|
|
referral could come from.
|
|
|
|
GroupURL This keyword allows grouping URLs. Useful for grouping
|
|
complete directory trees.
|
|
|
|
GroupSite This keywords allows grouping Sites. Most used for
|
|
grouping top level domains and unresolved IP address
|
|
for local dial-ups, etc...
|
|
|
|
GroupAgent Groups User Agents. A handy example of how you could use
|
|
this one is to use "Mozilla" and "MSIE" as the values for
|
|
GroupAgent and HideAgent keywords. Make sure you put the
|
|
"MSIE" one first.
|
|
|
|
GroupDomains Allows automatic grouping of domains. The numeric value
|
|
represents the level of grouping, and can be thought of
|
|
as 'the number of dots' to display. A 1 will display
|
|
second level domains only (xxx.xxx), a 2 will display
|
|
third level domains (xxx.xxx.xxx) etc... The default
|
|
value of 0 disables any domain grouping.
|
|
Command line argument: -g
|
|
|
|
GroupUser Allows grouping of usernames. Combined with a group
|
|
name, this can be handy for displaying statistics on
|
|
a particular group of users without displaying their
|
|
real usernames.
|
|
|
|
GroupShading Allows shading of table rows for groups. Value can be
|
|
'yes' or 'no', with the default being 'yes'.
|
|
|
|
GroupHighlight Allows bolding of table rows for groups. Value can be
|
|
'yes' or 'no', with the default being 'yes'.
|
|
|
|
|
|
Ignore/Include Object Keywords
|
|
----------------------
|
|
|
|
These keywords allow you to completely ignore log records when generating
|
|
statistics, or to force their inclusion regardless of ignore criteria.
|
|
Records can be ignored or included based on site, URL, user agent, referrer
|
|
and username. Be aware that by choosing to ignore records, the accuracy of
|
|
the generated statistics become skewed, making it impossible to produce
|
|
an accurate representation of load on the web server. These keywords
|
|
behave identical to the Hide* keywords above, where the value can have
|
|
a leading or trailing wildcard '*'. These keywords, like the Hide* ones,
|
|
have an absolute limit of 80 characters for their values. These keywords
|
|
do not have any command line switch counterparts, so they may only be
|
|
specified in a configuration file. It should also be pointed out that
|
|
using the Ignore/Include combination to selectively exclude an entire
|
|
site while including a particular 'chunk' is _extremely_ inefficient,
|
|
and should be avoided. Try grep'ing the records into a separate file
|
|
and process it instead.
|
|
|
|
IgnoreSite This allows specified sites to be completely ignored from
|
|
the generated statistics.
|
|
|
|
IgnoreURL This allows specified URLs to be completely ignored from
|
|
the generated statistics. One use for this keyword would
|
|
be to ignore all hits to a 'temporary' directory where
|
|
development work is being done, but is not accessible to
|
|
the outside world.
|
|
|
|
IgnoreReferrer This allows records to be ignored based on the referrer
|
|
field.
|
|
|
|
IgnoreAgent This allows specified User Agent records to be completely
|
|
ignored from the statistics. Maybe useful if you really
|
|
don't want to see all those hits from MSIE :)
|
|
|
|
IgnoreUser This allows specified username records to be completely
|
|
ignored from the statistics. Usernames can only be used
|
|
if you use http authentication on your server.
|
|
|
|
IncludeSite Force the record to be processed based on hostname. This
|
|
takes precedence over the Ignore* keywords.
|
|
|
|
IncludeURL Force the record to be processed based on URL. This takes
|
|
precedence over the Ignore* keywords.
|
|
|
|
IncludeReferrer Force the record to be processed based on referrer.
|
|
This takes precedence over the Ignore* keywords.
|
|
|
|
IncludeAgent Force the record to be processed based on user agent.
|
|
This takes precedence over the Ignore* keywords.
|
|
|
|
IncludeUser Force the record to be processed based on username.
|
|
Usernames are only available if you use http based
|
|
authentication on your server. This takes precedence over
|
|
the Ignore* keywords.
|
|
|
|
|
|
Dump Object Keywords
|
|
--------------------
|
|
|
|
The Dump* Keywords allow text files to be generated that can then be used
|
|
for import into most database, spreadsheet and other external programs.
|
|
The file is a standard tab delimited text file, meaning that each column
|
|
is separated by a tab (0x09) character. A header record may be included
|
|
if required, using the 'DumpHeader' keyword. Since these files contain
|
|
all records that have been processed, including normally hidden records,
|
|
an alternate location for the files can be specified using the 'DumpPath'
|
|
keyword, otherwise they will be located in the default output directory.
|
|
|
|
DumpPath Specifies an alternate location for the dump files. The
|
|
default output location will be used otherwise. The value
|
|
is the path portion to use, and normally should be an
|
|
absolute path (ie: has a leading '/' character), however
|
|
relative path names can be used as well, and will be
|
|
relative to the output directory location.
|
|
|
|
DumpExtension Allows the dump filename extensions to be specified. The
|
|
default extension is "tab", however may be changed with
|
|
this option.
|
|
|
|
DumpHeader Allows a header record to be written as the first record
|
|
of the file. Value can be either 'yes' or 'no', with
|
|
the default being 'no'.
|
|
|
|
DumpSites Dump tab delimited sites file. Value can be either 'yes'
|
|
or 'no', with the default being 'no'. The filename used
|
|
is site_YYYYMM.tab (YYYY=year, MM=month).
|
|
|
|
DumpURLs Dump tab delimited url file. Value can be either 'yes' or
|
|
'no', with the default being 'no'. The filename used is
|
|
url_YYYYMM.tab (YYYY=year, MM=month).
|
|
|
|
DumpReferrers Dump tab delimited referrer file. Value can be either
|
|
'yes' or 'no', with the default being 'no'. Filename
|
|
used is ref_YYYYMM.tab (YYYY=year, MM=month). Referrer
|
|
information is only available if present in the log
|
|
file (ie: combined web server log).
|
|
|
|
DumpAgents Dump tab delimited user agent file. Value can be either
|
|
'yes' or 'no', with the default being 'no'. Filename
|
|
used is agent_YYYYMM.tab (YYYY=year, MM=month). User
|
|
agent information is only available if present in the
|
|
log file (ie: combined web server log).
|
|
|
|
DumpUsers Dump tab delimited username file. Value can be either
|
|
'yes' or 'no', with the default being 'no'. Filename
|
|
used is user_YYYYMM.tab (YYYY=year, MM=month). The
|
|
username data is only available if processing a wu-ftpd
|
|
xferlog or http authentication is used on the web server
|
|
and that information is present in the log.
|
|
|
|
DumpSearchStr Dump tab delimited search string file. Value can be
|
|
either 'yes' or 'no', with the default being 'no'.
|
|
Filename used is search_YYYYMM.tab (YYYY=year, MM=month).
|
|
the search string data is only available if referrer
|
|
information is present in the log being processed and
|
|
recognized search engines were found and processed.
|
|
|
|
|
|
|
|
HTML Generation Keywords
|
|
------------------------
|
|
|
|
These keywords allow you to customize the HTML code that The Webalizer
|
|
produces, such as adding a corporate logo or links to other web pages.
|
|
You can specify as many of these keywords as you like, and they will be
|
|
used in the order that they are found in the file. Values cannot exceed
|
|
80 characters in length, so you may have to break long lines up into two
|
|
or more lines. There are no command line counterparts to these keywords.
|
|
|
|
HTMLExtension Allows generated pages to use something other than the
|
|
default 'html' extension for the filenames. Do not
|
|
include the leading period ('.') when you specify the
|
|
extension.
|
|
Command line argument: -x
|
|
|
|
HTMLPre Allows code to be inserted at the very beginning of the
|
|
HTML files. Defaults to the standard HTML 3.2 DOCTYPE
|
|
record. Be careful not to include any HTML here, as it
|
|
is inserted _before_ the <HTML> tag in the file. Use it
|
|
for server-side scripting capabilities, such as php3, to
|
|
insert scripting files and other directives.
|
|
|
|
HTMLHead Allows you to insert HTML code between the <HEAD></HEAD>
|
|
block. There is no default. Useful for adding scripts
|
|
to the HTML page, such as Javascript or php3, or even
|
|
just for adding a few META tags to the document.
|
|
|
|
HTMLBody This keyword defines HTML code to be placed immediately
|
|
after the <HEAD> section of the report, just before the
|
|
title and "summary period/generated on" lines. If used,
|
|
the first HTMLHead line MUST include a <BODY> tag. Put
|
|
whatever else you want in subsequent lines, but keep in
|
|
mind the placement of this code in relation to the title
|
|
and other aspects of the web page. Some typical uses
|
|
are to change the page colors and possibly add a corporate
|
|
logo (graphic) in the top right. If not specified, a
|
|
default <BODY> tag is used that defines page color, text
|
|
color and link colors (see "sample.conf" file for example).
|
|
|
|
HTMLPost This keyword defines HTML code that is placed after the
|
|
title and "summary period/generated on" lines, just before
|
|
the initial horizontal rule <HR> tag. Normally this keyword
|
|
isn't needed, but is provided in case you included a large
|
|
graphic or some other weird formatting tag in the HTMLHead
|
|
section that needs to be cleaned up or terminated before the
|
|
main report section.
|
|
|
|
HTMLTail This keyword defines HTML code that is placed at the bottom
|
|
right side of the report. It is inserted in a <TABLE> section
|
|
between table data <TD>..</TD> tags, and is top and right
|
|
aligned within the table. Normally this keyword is used to
|
|
provide a link back to your home page or insert a small
|
|
graphic at the bottom right of the page.
|
|
|
|
HTMLEnd This allows insertion of closing code, at the very end of
|
|
the page. The default is to put the closing </BODY> and
|
|
</HTML> tags. If specified, you _must_ specify these tags
|
|
yourself.
|
|
|
|
LinkReferrer This specifies if the referrers listed in the top referrer
|
|
table should be displayed as plain text, or as a link to the
|
|
referrer. Values can be either 'yes' or 'no', with 'no'
|
|
being the default.
|
|
|
|
|
|
Graph Color Commands
|
|
--------------------
|
|
|
|
These keywords allow altering the colors used in the various graphs
|
|
produced by the Webalizer. The value is specified as a standard HTML
|
|
RGB hexdecimal color string, without the leading '#' character. The
|
|
value is case insensitive. If not specified, the default color shown
|
|
will be used.
|
|
|
|
ColorHit Color used for 'Hits'. Default is '00805C' (green)
|
|
|
|
ColorFile Color used for 'Files'. Default is '0040FF' (blue)
|
|
|
|
ColorSite Color used for 'Sites'. Default is 'FF8000' (orange)
|
|
|
|
ColorKbyte Color used for 'KBytes'. Default is 'FF0000' (red)
|
|
|
|
ColorPage Color used for 'Pages'. Default is '00E0FF' (cyan)
|
|
|
|
ColorVisit Color used for 'Visits'. Default is 'FFFF00' (yellow)
|
|
|
|
ColorMisc Color used for miscellaneous titles in various 'Top'
|
|
tables (not graphs). Default is '00E0FF' (cyan)
|
|
|
|
PieColor1 Pie Chart color #1. Default is '800080' (purple)
|
|
|
|
PieColor2 Pie Chart color #2. Default is '80FFC0' (lt. green)
|
|
|
|
PieColor3 Pie Chart color #3. Default is 'FF00FF' (lt. purple)
|
|
|
|
PieColor4 Pie Chart color #4. Default is 'FFC080' (tan)
|
|
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
|
|
Notes on Web Log Files
|
|
----------------------
|
|
|
|
The Webalizer supports CLF log formats, which should work for just
|
|
about everyone. If you want User Agent or Referrer information, you
|
|
need to make sure your web server supplies this information in its
|
|
log file, and in a format that the Webalizer can understand. While
|
|
The Webalizer will try to handle many of the subtle variations in
|
|
log formats, some will not work at all. Most web servers output
|
|
CLF format logs by default. For Apache, in order to produce the
|
|
proper log format, add the following to the httpd.conf file:
|
|
|
|
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\""
|
|
|
|
This instructs the Apache web server to produce a 'combined' log
|
|
that includes the referrer and user agent information on the end of
|
|
each record, enclosed in quotes (This is the standard recommended
|
|
by both Apache and NCSA). Netscape and other web servers have
|
|
similar capabilities to alter their log formats. (note: the above
|
|
works for apache servers up to V1.2. V1.3 and higher now have additional
|
|
ways to specify log formats... refer to included documentation).
|
|
|
|
Notes on FTP Log Files
|
|
----------------------
|
|
|
|
The Webalizer supports ftp logs produced by wu-ftpd, proftpd and others,
|
|
as a standard 'xferlog'. To process an ftp log, you must either use the
|
|
-Ff command line option or have "LogType ftp" in your configuration file.
|
|
It is recommended that you create a separate configuration file for ftp
|
|
analysis, since the values used for your web server will most likely not
|
|
be suited for ftp log analysis (ie: page types, hostname, etc.. should
|
|
be different).
|
|
|
|
Because of the difference in web and ftp logs, there are a few limitations:
|
|
|
|
o Because there is no concept of a 'response code' in ftp world, response
|
|
codes are restricted to either 200 (OK) or 206 (Partial Content), based
|
|
on the completion status found in xferlog (for wu-ftpd, 'i'=incomplete
|
|
and will generate a 206, 'c'=complete and will generate a 200). If your
|
|
ftp server doesn't supply the completion status, all requests will be
|
|
assigned a response code of 200. This allows the usage graph to display
|
|
all transfer requests (hits), and how many of those completed in success
|
|
(files - ie: 200 response codes).
|
|
|
|
o Page totals won't accurately reflect reality, since there isn't really
|
|
the concept of a 'page' in regards to ftp services. I have found that
|
|
setting the PageType value to "README", "FIRST", etc... seems to work
|
|
fairly well however, and will give a pretty good indication of how
|
|
many 'non-binary' files were requested. Of course, the content of your
|
|
ftp site will be different, so your results may vary.
|
|
|
|
o Visit totals also won't accurately reflect reality, since visits are
|
|
triggered on PageType requests (see above). What you usually wind up
|
|
with is visits=sites in most cases.
|
|
|
|
o Entry/Exit pages will not be calculated for ftp logs.
|
|
|
|
o For obvious reasons, referrers and user agents are not supported.
|
|
|
|
o You _cannot_ analyze both web and ftp logs at the same time.. they must
|
|
be done separately in different runs.
|
|
|
|
|
|
Notes on Referrers
|
|
------------------
|
|
|
|
Referrers are weird critters... They take many shapes and forms, which makes
|
|
it much harder to analyze than a typical URL, which at least has some
|
|
standardization. What is contained in the referrer field of your log
|
|
files varies depending on many factors, such as what site did the referral,
|
|
what type of system it comes from and how the actual referral was generated.
|
|
Why is this? Well, because a user can get to your site in many ways... They
|
|
may have your site bookmarked in their browser, they may simply type your
|
|
sites URL field in their browser, they could have clicked on a link on some
|
|
remote web page or they may have found your site from one of the many search
|
|
engines and site indexes found on the web. The Webalizer attempts to deal
|
|
with all this variation in an intelligent way by doing certain things to
|
|
the referrer string which makes it easier to analyze. Of course, if your
|
|
web server doesn't provide referrer information, you probably don't really
|
|
care and are asking yourself why you are reading this section...
|
|
|
|
Most referrers will take the form of "http://somesite.com/somepage.html",
|
|
which is what you will get if the user clicks on a link somewhere on the
|
|
web in order to get to your site. Some will be a variation of this, and
|
|
look something like "file:/some/such/sillyname", which is a reference from
|
|
a HTML document on the users local machine. Several variations of this can
|
|
be used, depending on what type of system the user has, if he/she is on
|
|
a local network, the type of network, etc... To complicate things even
|
|
more, dynamic HTML documents and HTML documents that are generated by
|
|
CGI scripts or external programs produce lots of extra information which
|
|
is tacked on to the end of the referrer string in an almost infinite number
|
|
of ways. If the user just typed your URL into their browser or clicked on
|
|
a bookmark, there won't be any information in the referrer field and will
|
|
take the form "-".
|
|
|
|
In order to handle all these variations, The Webalizer parses the referrer
|
|
field in a certain way. First, if the referrer string begins with "http",
|
|
it assumes it is a normal referral and converts the "http://" and following
|
|
hostname to lowercase in order to simplify hiding if desired. For example,
|
|
the referrer "HTTP://WWW.MyHost.Com/This/Is/A/HTML/Document.html" will become
|
|
"http://www.myhost.com/This/Is/A/HTML/Document.html". Notice that only the
|
|
"http://" and hostname are converted to lower case... The rest of the
|
|
referrer field is left alone. This follows standard convention, as the
|
|
actual method (HTTP) and hostname are always case insensitive, while the
|
|
document name portion is case sensitive.
|
|
|
|
Referrers that came from search engines, dynamic HTML documents, CGI
|
|
scripts and other external programs usually tack on additional information
|
|
that it used to create the page. A common example of this can be found
|
|
in referrals that come from search engines and site indexes common on the
|
|
web. Sometimes, these referrers URLs can be several hundred characters
|
|
long and include all the information that the user typed in to search for
|
|
your site. The Webalizer deals with this type of referrer by stripping
|
|
off all the query information, which starts with a question mark '?'.
|
|
The Referrer "http://search.yahoo.com/search?p=usa%26global%26link" will
|
|
be converted to just "http://search.yahoo.com/search".
|
|
|
|
When a user comes to your site by using one of their bookmarks or by
|
|
typing in your URL directly into their browser, the referrer field is
|
|
blank, and looks like "-". Most sites will get more of these referrals
|
|
than any other type. The Webalizer converts this type of referral into
|
|
the string "- (Direct Request)". This is done in order to make it easier
|
|
to hide via a command line option or configuration file option. This is
|
|
because the character "-" is a valid character elsewhere in a referrer
|
|
field, and if not turned into something unique, could not be hidden without
|
|
possibly hiding other referrers that shouldn't be.
|
|
|
|
|
|
Notes on Character Escaping
|
|
---------------------------
|
|
|
|
The HTTP protocol defines certain ways that URLs can look and behave. To
|
|
some extent, referrer fields follow most of the same conventions. Character
|
|
escaping is a technique by which non-printable or other non-ASCII (and even
|
|
some ASCII) characters can be used in a URL. This is done by placing the
|
|
Hexadecimal value of the character in the URL, preceded by a percent sign '%'.
|
|
Since Hex values are made up of ASCII characters, any character can be
|
|
escaped to ensure only printable ASCII characters are present in the URL.
|
|
Some systems take this concept to the extreme and escape all sorts of stuff,
|
|
even characters that don't need to be escaped. To deal with this, The
|
|
Webalizer will un-escape URLs and referrers before being processed. For
|
|
Example, the URL "/www.webalizer.org/%7Efoo/bar.html" is the same URL as
|
|
"/www.webalizer.org/~foo/bar.html", a very common form of a URL to access
|
|
users web pages. If the URLs were not un-escaped, they would be treated as
|
|
two separate documents, even though they are really one and the same.
|
|
|
|
|
|
Search String Analysis
|
|
----------------------
|
|
|
|
The Webalizer will do a minimal analysis on referrer strings that
|
|
it finds, looking for well known search string patterns. Most of
|
|
the major search engines are supported, such as Yahoo!, Altavista,
|
|
Lycos, etc... Unfortunately, search engines are always changing
|
|
their internal/CGI query formats, new search engines are coming on
|
|
line every day, and the ability to detect _all_ search strings is
|
|
nearly impossible. However, it should be accurate enough to give
|
|
a good indication of what users were searching for when they stumbled
|
|
across your site. Note: as of version 1.31, search engines can now
|
|
be specified within a configuration file. See the sample.conf file
|
|
for examples of how to specify additional search engines.
|
|
|
|
|
|
|
|
Notes on Visits/Entry/Exit Figures
|
|
----------------------------------
|
|
|
|
The majority of data analyzed and reported on by The Webalizer is
|
|
as accurate and correct as possible based on the input log file.
|
|
However, due to the limitation of the HTTP protocol, the use of
|
|
firewalls, proxy servers, multi-user systems, the rotation of your
|
|
log files, and a myriad of other conditions, some of these numbers
|
|
cannot, without absolute accuracy, be calculated. In particular,
|
|
Visits, Entry Pages and Exit Pages are suspect to random errors
|
|
due to the above and other conditions. The reason for this is
|
|
twofold, 1) Log files are finite in size and time interval, and
|
|
2) There is no way to distinguish multiple individual users apart
|
|
given only an IP address. Because log files are finite, they have
|
|
a beginning and ending, which can be represented as a fixed time
|
|
period. There is no way of knowing what happened previous to this
|
|
time period, nor is it possible to predict future events based on
|
|
it. Also, because it is impossible to distinguish individual users
|
|
apart, multiple users that have the same IP address all appear to
|
|
be a single user, and are treated as such. This is most common where
|
|
corporate users sit behind a proxy/firewall to the outside world,
|
|
and all requests appear to come from the same location (the address
|
|
of the proxy/firewall itself). Dynamic IP assignment (used with
|
|
dial-up Internet accounts) also present a problem, since the same
|
|
user will appear as to come from multiple places.
|
|
|
|
For example, suppose two users visit your server from XYZ company,
|
|
which has their network connected to the Internet by a proxy server
|
|
'fw.xyz.com'. All requests from the network look as though they
|
|
originated from 'fw.xyz.com', even though they were really initiated
|
|
from two separate users on different PCs. The Webalizer would
|
|
see these requests as from the same location, and would record only
|
|
1 visit, when in reality, there were two. Because entry and exit
|
|
pages are calculated in conjunction with visits, this situation
|
|
would also only record 1 entry and 1 exit page, when in reality,
|
|
there should be 2.
|
|
|
|
As another example, say a single user at XYZ company is surfing
|
|
around your website.. They arrive at 11:52pm the last day of
|
|
the month, and continue surfing until 12:30am, which is now a
|
|
new day (in a new month). Since a common practice is to rotate
|
|
(save then clear) the server logs at the end of the month, you
|
|
now have the users visit logged in two different files (current
|
|
and previous months). Because of this (and the fact that the
|
|
Webalizer clears history between months), the first page the
|
|
user requests after midnight will be counted as an entry page.
|
|
This is unavoidable, since it is the first request seen by that
|
|
particular IP address in the new month.
|
|
|
|
For the most part, the numbers shown for visits, entry and exit
|
|
pages are pretty good 'guesses', even though they may not be 100%
|
|
accurate. They do provide a good indication of overall trends,
|
|
and shouldn't be that far off from the real numbers to count much.
|
|
You should probably consider them as the 'minimum' amount possible,
|
|
since the actual (real) values should always be equal or greater
|
|
in all cases.
|
|
|
|
|
|
Exporting Webalizer Data
|
|
------------------------
|
|
|
|
The Webalizer now has the ability to dump all object tables to tab
|
|
delimited ASCII text files, which can then be imported into most
|
|
popular database and spreadsheet programs. The files are not normally
|
|
produced, as on some sites they could become quite large, and are only
|
|
enabled by the use of the Dump* configuration keywords. The filename
|
|
extensions default to '.tab' however may be changed using the
|
|
'DumpExtension' keyword. Since this data contains all items, even
|
|
those normally hidden, it may not be desirable to have them located
|
|
in the output directory where they may be visible to normal web users..
|
|
For this reason, the 'DumpPath' configuration keyword is available,
|
|
and allows the placement of these files somewhere outside the normal
|
|
web server document tree. An optional 'header' record may be written
|
|
to these files as well, and is useful when the data is to be imported
|
|
into a spreadsheet.. databases will not normally need the header. If
|
|
enabled, the header is simply the column names as the first record of
|
|
the file, tab separated.
|
|
|
|
|
|
Log files and The Webalizer
|
|
---------------------------
|
|
|
|
Most sites will choose to have The Webalizer run from cron at specified
|
|
intervals. Care should be taken to ensure that data is not lost as a
|
|
result of log file rotations. A suggested practice is to rotate your
|
|
web server logs at the end of each month as close to midnight as possible,
|
|
then have The Webalizer process the 'end of month' log file before running
|
|
statistics on the new, current log. On our systems, a shell script called
|
|
'rotate_logs' is run at midnight, the end of each month. This script file
|
|
looks like:
|
|
|
|
------------------------- file: rotate_logs ------------------------------
|
|
#!/bin/sh
|
|
|
|
# halt the server
|
|
kill `cat /var/lib/httpd/logs/httpd.pid`
|
|
|
|
# define backup names
|
|
OLD_ACCESS_LOG=/var/lib/httpd/logs/old/access_log.`date +%y%m%d-%H%M%S`
|
|
OLD_ERROR_LOG=/var/lib/httpd/logs/old/error_log.`date +%y%m%d-%H%M%S`
|
|
|
|
# make end of month copy for analyzer
|
|
cp /var/lib/httpd/logs/access_log /var/lib/httpd/logs/access_log.backup
|
|
|
|
# move files to archive directory
|
|
mv /var/lib/httpd/logs/access_log `echo $OLD_ACCESS_LOG`
|
|
mv /var/lib/httpd/logs/error_log `echo $OLD_ERROR_LOG`
|
|
|
|
# restart web server
|
|
/usr/sbin/httpd
|
|
|
|
# compress the archived files
|
|
/bin/gzip $OLD_ACCESS_LOG
|
|
/bin/gzip $OLD_ERROR_LOG
|
|
------------------------- end of file ------------------------------------
|
|
|
|
This script first stops the web server using a 'kill' command. Apache
|
|
keeps the PID of the server in the file httpd.pid, so we use it as the
|
|
argument for the kill. Next, it defines some names for the backup files,
|
|
which are basically the name of the files with the date and time appended
|
|
to the end of them. It then makes a copy of the log file, appended with
|
|
'.backup' in the log directory, moves the current log files to an archive
|
|
directory (/var/lib/httpd/logs/old) and restarts the server. This setup
|
|
allows the web server to be down for the minimum amount of time needed,
|
|
which is important for busy sites. If you don't want to stop the server,
|
|
you can remove the initial 'kill' command, and replace the '/usr/sbin/httpd'
|
|
line with "kill -1 `cat /var/lib/httpd/logs/httpd.pid`" command instead,
|
|
On most web servers, this will cause a restart of the server and create
|
|
the new log files in the process...
|
|
|
|
At this point, we have made copies of the previous months logs, the web
|
|
server is going about its business as usual, and we have all the time in
|
|
the world to do any other additional processing we want. The last two
|
|
lines of the script compress the archived logs using the GNU zip program
|
|
(gzip). Remember, we still have a copy of the log which we can now run
|
|
The Webalizer on without having to do any further processing.
|
|
|
|
Next, we define two crontab entries. The first runs the above 'rotate_logs'
|
|
script at midnight at the end of the month. The second runs The Webalizer
|
|
on the '.backup' log file created above at 5 minutes after midnight. This
|
|
gives other end of month processing jobs a chance to run so we don't bog
|
|
the system down too much. If you have lots of end of month stuff going on,
|
|
you can change the timing to suit your needs. The crontab entries look
|
|
something like:
|
|
|
|
------------------------- crontab entries --------------------------------
|
|
# Rotate web server logs and run monthly analysis
|
|
0 0 1 * * /usr/local/adm/rotate_logs
|
|
5 0 1 * * /usr/bin/webalizer -Q /var/lib/httpd/logs/access_log.backup
|
|
------------------------- end of crontab ---------------------------------
|
|
|
|
As you can see, the log rotations occur at midnight, and the analysis
|
|
is done at 5 minutes after. Once you verify that The Webalizer ran
|
|
successfully, the access_log.backup file can be deleted as it isn't
|
|
needed any more. If you need to re-run the analysis, you still have
|
|
the compressed archive copy that the shell script created. In order
|
|
for the above analysis to work properly, you should have already
|
|
created an /etc/webalizer.conf configuration file suitable for your
|
|
site, or otherwise specify configuration options or a configuration
|
|
file on the crontab command line above.
|
|
|
|
If you want The Webalizer to be run more often than once a month, you
|
|
can specify additional crontab entries to do this as well. Care should
|
|
be taken however to ensure that The Webalizer is not running when the
|
|
end of month processing above occurs, or unpredictable results may
|
|
happen (such as an inability to rotate the logs due to a file lock).
|
|
The easiest way is to run it on the half hour with a crontab entry like:
|
|
|
|
30 * * * * /usr/bin/webalizer
|
|
|
|
|
|
Reverse DNS Lookups
|
|
-------------------
|
|
|
|
The Webalizer fully supports both IPv4 and IPv6 DNS lookups, and
|
|
maintains a cache of those lookups to reduce processing the same
|
|
addresses in subsequent runs. The cache file can be created at
|
|
run-time, or may be created before running the webalizer using either
|
|
the stand alone 'webazolver' program, or The Webalizer (DNS) Cache
|
|
file Manager program 'wcmgr'. In order to perform reverse lookups,
|
|
a DNS Cache file must be specified, either on the command line or in
|
|
a configuration file. In order to create/update the cache file at
|
|
run-time, the number of DNS Children must also be specified, and can
|
|
be anything between 1 and 100. This specifies the number of child
|
|
processes to be forked, each of which will perform network DNS
|
|
queries in order to lookup up the addresses and update the cache.
|
|
Cached entries that are older than a specified TTL (time to live)
|
|
will be expired, and if encountered again in a log, will be looked
|
|
up at that time in order to 'freshen' them (verify the name is still
|
|
the same and update its timestamp). The default TTL is 7 days, however
|
|
may be set to anything between 1 and 100 days. Using the 'wcmgr'
|
|
program, entries may also be marked as 'permanent', in which case
|
|
they will persist (with an infinite TTL) in the cache until manually
|
|
removed. See the file DNS.README for additional information.
|
|
|
|
|
|
Geolocation Lookups
|
|
-------------------
|
|
|
|
The Webalizer has the ability to perform geolocation lookups on IP
|
|
addresses using either it's own internal GeoDB database or optionally
|
|
the GeoIP database from MaxMind, Inc. (www.maxmind.com). If used,
|
|
unresolved addresses will be searched for in the database and it's
|
|
country of origin will be returned if found. This actually produces
|
|
more accurate Country information than DNS lookups, since the DNS
|
|
address space has additional gcTLDs that do not necessarily map to
|
|
a specific country (such as '.net' and '.com'). It is possible to
|
|
use both DNS lookups and geolocation lookups at the same time, which
|
|
will cause any addresses that could not be resolved using DNS lookups
|
|
to then be looked up in the database, greatly reducing the number of
|
|
'Unknown/Unresolved' entries in the generated reports. The native
|
|
GeoDB geolocation database provided by The Webalizer fully supports
|
|
IPv4 and IPv6 lookups, is updated regularly, and is the preferred
|
|
geolocation method for use with The Webalizer. The most current
|
|
version of the database can be obtained from our ftp site.
|
|
|
|
|
|
Language Support
|
|
----------------
|
|
|
|
Version 1.0x of The Webalizer added language support. This
|
|
support is only provided at compile time in the form of an
|
|
include file containing all the strings used by The Webalizer.
|
|
The source distribution contains all language files that were
|
|
available at the time, with English being the default as
|
|
that is the only human language I speak fluently, and me
|
|
Espanol es muy malo. Several people have already indicated
|
|
the desire to do translations into various languages, and as
|
|
I receive the language files, will make them available via
|
|
ftp at ftp://ftp.mrunix.net/pub/webalizer/lang. Unless there
|
|
happens to be a binary distribution in the language you need,
|
|
you will need to grab the source distribution and compile the
|
|
program yourself. See the file INSTALL that comes in the source
|
|
distribution for information on how to use a language other than
|
|
English.
|
|
|
|
It should also be noted that the GD graphics library, used to
|
|
produce the in-line graphics in the output HTML, doesn't
|
|
support extended character sets, so if you are translating
|
|
the language file, you will no doubt encounter this problem.
|
|
|
|
New: You can now specify the language to use when you are building
|
|
program from source, using the configure script. Just add
|
|
--with-language=language_name , where 'language_name' is the
|
|
name of a valid language file in the /lang/ directory. For
|
|
example, --with-language=french will build using French as
|
|
the default language. You should consult the INSTALL file
|
|
for additional information on building the program from source.
|
|
|
|
|
|
Known Issues
|
|
------------
|
|
|
|
o Memory Usage. The Webalizer makes liberal use of memory for internal
|
|
data structures during analysis. Lack of real physical memory will
|
|
noticeably degrade performance by doing lots of swapping between memory
|
|
and disk. One user who had a rather large log file noticed that The
|
|
Webalizer took over 7 hours to run with only 16 Meg of memory. Once
|
|
memory was increased, the time was reduced to a few minutes.
|
|
|
|
|
|
o Performance. The Hide*, Group*, Ignore*, Include* and IndexAlias
|
|
configuration options can cause a performance decrease if lots of
|
|
them are used. The reason for this is that every log record must
|
|
be scanned for each item in each list. For example, if you are
|
|
Hiding 20 objects, Grouping 20 more, and Ignoring 5, each record
|
|
is scanned, at most, 46 times (20+20+5 + an IndexAlias scan).
|
|
On really large log files, this can have a profound impact. It
|
|
is recommended that you use the least amount of these configuration
|
|
options that you can, as it will greatly improve performance.
|
|
|
|
|
|
Final Notes
|
|
-----------
|
|
|
|
A lot of time and effort went into making The Webalizer, and to ensure that
|
|
the results are as accurate as possible. If you find any abnormalities or
|
|
inconsistent results, bugs, errors, omissions or anything else that doesn't
|
|
look right, please let me know so I can investigate the problem or correct
|
|
the error. This goes for the minimal documentation as well. Suggestions
|
|
for future versions are also welcome and appreciated.
|