Software and Computer Systems Company, LLC |
Overview
This document is intended to illustrate how to use Scannerz
for Mac OS X to perform system, hard drive, and SSD testing to
identify problems that will range from the simple and
straightforward to those that are much more difficult to
isolate. Many people think Scannerz is a drive testing
application, probably because the vast majority of problems Scannerz
detects are related directly to surface defects on hard drive
platters or bad blocks in an SSD. However, Scannerz is
not simply a hard drive testing tool, it’s properly described as
fault detection software. Surface scan problems are simply one
of the faults Scannerz is capable of detecting.
Scannerz uses the progress of a surface scan over media
(the surface of a hard drive or the blocks in an SSD) as a
reference to help isolate problems with a system. Media related
problems are always repeatable until corrected. For example, if
a bad sector exists on a hard drive starting at the byte
location 34,359,738,368 with respect to the start of the drive,
it will remain at that exact same location unless corrected. If
on the other hand, Scannerz detects problems (or faults)
that occur inconsistently with respect to the progress of the
scan, then they usually lie somewhere else in the system. Other
products on the market often miss faults completely, or in some
case misidentify them as media problems when such problems don't
exist.
With the introduction of Scannerz Version 1.7 and
beyond, Scannerz introduces a new mode known as Diagnostics
Mode. With Diagnostics Mode the user will be
able to do the following:
Scannerz is not intended to simply tell you whether a
drive or SSD has problems, it's been designed and packaged to
help users find the root cause of problems. As you'll see during
the rest of the document below, many problems may manifest
themselves with symptoms similar to drive problems and yet have
nothing to do with the actual drive itself.
The Testing Process
The common way to use Scannerz is to perform a Normal
Mode test on a drive or system, and if Scannerz
has flagged some problems or areas of concern, use Diagnostics
Mode to evaluate them. In some cases, such as excessive
data corruption or system lock ups, Diagnostics Mode may
be used directly without the need for a Normal Mode
test.
For reference, the following screen shots show Scannerz
Normal Mode and Diagnostics Mode interfaces.
A surface scan test underway in
Normal Mode
Scannerz in Diagnostics
Mode performing tests on a weak sector
The
Logging window may be brought up in any tests to provide
details about tests. In this screen capture, Scannerz
is in Diagnostics Mode and evaluating a drive with
some obvious problems.
Diagnostics Mode tests may be configured to analyze errors and irregularities detected in previous tests, perform interface tests, and perform memory and system bus tests.
A Note About Using a Phoenix Boot Volume for Testing
Scannerz includes a product named Phoenix, which
can create what's called a Phoenix Boot Volume and perform
volume cloning. We strongly recommend creating a Phoenix Boot
Volume on a secondary volume, or creating one on a 32GB (or
larger) USB Flash drive. In the creation of the boot volume, all
SCSC products will be transferred, as will the core operating
system. Third party applications and user folders will not be
copied into a Phoenix Boot Volume, but Phoenix can clone entire
systems as well, if needed. This volume may become invaluable in
the future if your system ever experiences a crash rendering the
original boot media unusable. Note that some older PowerPC based
systems cannot easily boot from a USB device.
Sources of Problems with a System and How Scannerz
can Isolate Them
Performance and functional problems with a system can often be
traced to one of the following:
The list above is not a list of every possible problem on a
system, but rather a list of the most likely problems one may
encounter.
Bad sectors/blocks on a hard drive or SSD will be flagged
during a Normal Mode test using Scannerz, and
confirmed using Diagnostics Mode. The symptoms of the
problem(s) will depend on how active the faulty region of the
media is. If the problematic area is in the boot code of the
drive, the system may fail to boot. If it's in an application
file, the file may fail to load. Both hard drives and SSDs are
capable of remapping bad regions to "spare" regions if they
exist.
Weak sectors should generally only occur on a hard drive.
A weak sector is a damaged, but readable sector. It typically
takes a fairly long time (often seconds) for the drive to read
such a sector. A weak sector in a hard drive will be identified
as an irregularity in a Normal Mode test, and confirmed
as a weak sector in Diagnostics Mode. Symptoms are long
periods of spinning beach balls any time the sector is
encountered by the system. A weak sector can be every bit as
problematic as a bad sector.
Intermittent connections may be detected in Normal
Mode tests as errors, irregularities, or both. An
intermittent connection is typically found in an I/O cable but
may be caused by faulty connectors as well, and even cracked or
marginal logic board traces. Unlike bad or weak sectors, these
problems never correlate to the progress of the surface scan
with any degree of consistency. When Diagnostics Mode is
used to evaluate data from a Normal Mode scan that
contains this type of problem, it will flag the problems as
being potential system faults. Putting the system into
Diagnostics Mode and performing prolonged interface tests
on the unit will likely expose the problem as probable system
faults may be registered by Scannerz during interface
testing.
Data corruption occurs when data being transferred
between a drive or SSD and the system is corrupt. The symptoms
will be files that are garbage filled and often the need to
repair the media using Disk Utilities "Repair Disk" mode, to
correct the inevitable indexing problems. This type of problem
may or may not be detected in Normal Mode testing,
depending on the cause. It will be detected in Diagnostics
Mode testing and registered as an interface error. This
problem will most likely occur in external drives that are
either under-powered or have failing stages in their conversion
of data between a hard drive and an external interfaces. This is
a critical error, especially if the drive is a backup drive.
Memory defects and system bus problems are two totally
different things, but they are both evaluated in Diagnostics
Mode using the Memory Test option. If a system has memory
problems such as bad memory, incompatible memory, or poorly
seated memory, Scannerz Diagnostics Mode will show this
as a memory error. These may or may not show up in Normal
Mode tests as intermittent faults. System bus problems
will likely show up as intermittent irregularities or errors in
a surface scan, with the exception that they will occur during
all tests on all devices. This will be because the faults are on
the logic board, not a device such as an external or internal
drive.
System timeouts, drive timeouts, and prolonged head parking
events can be caused by a drive or the system. Timeouts will be
detectible and identified in Scannerz Diagnostics Mode
because such events typically have no correlation to surface
scan progress but they occur with roughly identical durations,
such as +/- a few tenths of a second. If the timing event occurs
only during tests on a specific drive, then the drive is to
blame. If the problems occur regardless of what drive is being
tested, it's likely caused by a logic board problem. The most
likely cause of such a logic board problem is poorly seated or
loose heat sinks on the logic board. Some low power drives
designed as backup drives may exhibit this behavior by design -
that's apparently just the way they work.
Lack of memory and lack of free drive space cannot be
detected by Scannerz, but they can be detected by a tool
included with Scannerz named Performance Probe.
Lack of free drive space is the more critical of the two because
the system will be unable to swap memory to and from the drive.
Aside from causing excessive bottlenecks, in extreme cases, this
may cause the system to shut down or lock up. A lack of memory
is most often caused by too many applications running at a time,
or the system simply doesn't have enough memory to adequately
run even the core operating system. If there's too little
memory, you will likely experience very slow loading of
applications, long delays in application execution, excessive
swapping, very high CPU utilization, and large changes in the
size of the swap files.
Excessive MDS indexing can be notorious for slowing a
system down. MDS, which stands for meta data server, is used by
both Spotlight and Time Machine to index drives on the system. Scannerz
once again, can't monitor it, but it does have a provision to
unload the MDS process from running while a test is going on. Performance
Probe, which is included in the Scannerz package,
will likely indicate high CPU and I/O utilization. We offer
another product named SpotOff which can be used to control MDS
indexing, and a free MDS monitoring tool named Spot-O-Meter,
which may be obtained HERE.
Software problems cannot be checked with Scannerz,
but if there are bad kernel extensions present
it can tend to skew not only some of Scannerz test
results with a fair number of false irregularities, but may
bottleneck the system, cause slow boot ups, and possibly system
crashes. By instruction, you're supposed to stop all
applications from running while Scannerz is performing a
test, however there may be things going on that you're unaware
of. For this reason, Scannerz also includes an
application named FSE or FSE-Lite (depending on the package)
that may be able to expose operations going on in the
background, particularly those generating excessive file system
activity. Performance Probe may also be of use dealing with such
a problem as it will indicate what seems to be inexplicable
system loading.
How to Test With Scannerz
As stated previously, the normal way to use Scannerz is
to perform a Normal Mode test, end to end on the drive
or volume you wish to evaluate, and proceed with Diagnostics
Mode tests if Normal Mode tests found errors or
irregularities. Many people check their systems periodically
using Normal Mode simply to confirm that their system
and drive are in good working order. Even in the event there are
no problems, it may be wise to run tests on both the memory and
interface in Diagnostics Mode simply to confirm that
everything is OK. Diagnostics Mode tests of the
interface and memory may also be needed if you're having erratic
system problems, but a Normal Mode test made no
indication of problems. The rest of this document will focus on
problems and how to identify and isolate them.
Diagnostics Mode has three test options which are
illustrated in the configuration dialog of Scannerz
above. The dialog allows the user to select three different
types of tests. The Analyze Errors and Irregularities option
will access test data from a Normal Mode scan and
evaluate it. It has an optional parameter to perform an
historical analysis (or not), with an historical analysis
evaluating all data acquired during testing since the original
tests were performed on a given device (recommended). If this
option is “Off” then only the data from the last Normal Mode
test will be performed. The Perform Interface Tests option will
exercise the entire interface between the media and the hosting
system. The Perform Memory Tests option will evaluate the system
memory and system bus for possible problems. The interface and
memory tests do not require data from a Normal Mode
scan, since they are intended to be used when problems have
already been detected.
The Analyze Errors and Irregularities Option
When the option to analyze errors and irregularities is
selected, the following will be identified if present:
In reality, if problems are found, the majority of them will
likely be repeatable, directly associated with the progress of
the scan on the hard drive or SSD, and manifest themselves as
bad blocks/sectors, weak sectors, or a combination of both.
Dealing with and possibly correcting them is detailed in the
users manual for Scannerz.
If Scannerz identifies possible timeouts, it will be
necessary to determine if they're being caused by the system or
the drive. This can usually be accomplished by using another
scan target, such as a USB flash drive or another, different
external drive as the target and then performing interface tests
on that drive for a fairly long period of time (for example,
increment the interface test counter to a fairly high value like
1000.) If the timeout is being caused by the system, they will
continue to occur on each and every drive tested. If they are
associated with the original drive, they will only occur when Scannerz
is testing that particular drive. If the cause of the timeout is
the system, either there are some very intrusive and
dysfunctional kernel extensions in the system, or the logic
board has problems. Drive timeouts may be caused by controller
resets, firmware bugs, overly aggressive head parking, or
(believe it not) apparently by design on some low power drives.
Timeouts never correlate to the surface scan progress.
Abnormally long irregularities with inconsistent durations will
be called out in Diagnostics Mode as a potential
problem. The presence of such irregularities typically indicates
an intermittent connection of some sort. They will not correlate
to the progress of the surface scan, indicating the media on the
drive or SSD is not the problem. These are not timeouts, because
timeouts will always have relatively consistent durations. An
example of such an event might be irregularities detected with
durations of 10.33 seconds, 5.21 seconds, and 8.91 seconds.
Intermittent problems of this nature typically vary widely in
duration and at random with respect to the surface scan
progress. Problems of this nature may be evaluated and possibly
isolated using interface tests using a technique known as "Path
isolation." Path isolation is described in a later section of
this document (scroll down to find it.)
The Perform Interface Tests Option
This is the primary option used to evaluate intermittent
(erratic) problems with systems as well as to identify possible
corruption between the system and the media. The evaluation of
erratic intermittent errors and/or irregularities was described
in the preceding paragraph and will be detailed in more depth
later in this document (path isolation.) This leaves data
corruption, which is an extremely serious problem.
If interface tests are performed on a volume and interface
errors are found, it indicates that the data being transferred
between the system and the storage medium cannot be trusted.
This is particularly important if the drive exhibiting the
problem is a backup drive.
To illustrate this type of problem, suppose you saved a file to
a hard drive or SSD that contained the sentence "My dog has
fleas." If you re-read the file from the drive or SSD and what
you get back is "M*&dog~has fleas " this is data corruption.
Clearly the data sent to the storage device and what was
received are not consistent.
When tests are run using Scannerz in this mode and this
type of error is detected, it will increment the field
"Interface Errors" (see the figure titled "Scannerz in Diagnostics
Mode performing tests on a weak sector" above to see the
field.) Even a single instance of this error should be taken
seriously. This type of problem will eventually cause indexing
problems which will become evident by notices that the drive
needs to be repaired with Disk Utility. Eventually, the drive
may be rendered "read only" or may even be marked as unusable by
the system. If this was a backup drive (the one that's supposed
to be reliable) this is obviously a very serious problem.
The Perform Memory Tests Option
This option differs from all other test options in Scannerz
in that it doesn't utilize I/O between the system and a drive.
This is essentially designed primarily to expose system faults,
load the CPU, memory, and system bus, and verify memory
contents. During this test it is not uncommon for the system to
run an higher than normal temperatures, and cooling fans may
kick on or increase their speed.
If an error is detected in this test, the field in the user
interface named "Memory Errors" will increment. If errors are
consistent, as in they repeat each time an iteration of the test
is performed it indicates a likely problem with memory. For
example, each time an iteration of the memory test is performed,
and you get three errors every time, it indicates that the
memory itself has a problem. If the memory errors are erratic,
appearing occasionally but inconsistently, it implies that
either the logic board has problems or something connected
directly to the logic board such as an Airport card, RAM,
bluetooth card, keyboard, or trackpad (to name a few) may either
be poorly seated, malfunctioning, or improperly connected. Do
not assume the logic board is dead and needs to be thrown out
without first investigating all possibilities.
Using Path Isolation to Identify System Problems
Using Scannerz Diagnostics Mode interface testing
option, isolating intermittent and erratic problems can be
greatly simplified using a technique know as path isolation.
Intermittent and erratic problems are often difficult to trace
and can cause side effects nearly, if not identical, to bad
sectors or blocks on media. It should be noted that the problems
detected to qualify for this type of evaluation should be
inconsistent errors during surface scan tests or irregularities
detected with durations greater than 3 seconds.
The most likely causes of intermittent errors and/or
irregularities, in order of likelihood, are the following:
From the list above, items 5 and 6 should become evident using
the memory/system bus testing option previously described, and
won't be dealt with in this section.
More obscure, but possible causes of intermittent errors and/or
irregularities are the following:
From the list above, items 1, 4, and 5 should become evident
using the memory/system bus testing option previously described
A path is said to be isolated when inconsistent errors and/or
irregularities are isolated to a single path.
Note the following important points:
A. Many older Macs use a USB “hub” controller chip, and
if problems exist with connections between this chip and the
system’s I/O controller, it’s likely problems will show up on
all USB ports and devices connected to it. This is actually a
logic board problem. You may also encounter problems with other
devices connected to this chip, such as the keyboard or
trackpad. If possible, obtain a block diagram of your logic
board to see if it fits into this category.
B. If there’s a problem related to the internal hard
drive support circuitry, such as a cable, and the internal drive
is being used as the boot drive, errors and/or irregularities
will likely show up on all tests of all I/O ports. The best way
to verify this is to use another, external boot drive, such as a
Phoenix Boot Volume, and launch tests using the internal hard
drive as the target. If the problems end up being isolated to
the internal hard drive, then the path should be considered
isolated to that path. Internal hard drive cable problems,
especially on laptops, should be considered the most likely
cause of such problems.
C. Mac Pro’s, Power Mac’s, and some MacBook Pro’s and
aluminum PowerBooks use I/O cards that host several I/O ports.
If there’s a fault in the cable connecting the I/O card to the
logic board, errors and irregularities will likely show up on
tests of ports associated with these cards, but not on any other
ports in the system.
D. Power supply problems are rare on laptops, but may be
more likely in desktop units assembled and sold between 2002 and
2010. This is because of a “capacitor plague” that existed in
this time frame. The market was apparently flooded with poor
quality capacitors that would lose their filtering capacity and
allow spikes into the circuitry of a system. In some cases such
spikes or transients may cause chips on the board to reset, or
see data as invalid and enter a cycle of retries. Eventually the
capacitors will fail completely causing the unit to malfunction.
Laptops typically don’t use capacitors of this type because
they’re too large to fit on the logic board. Problems of this
nature will appear very erratic and system wide.
E. If, during the process of path isolation, all problems
are pointing at the logic board as the source of the problems,
you should not assume the logic board is bad. Poorly seated or
loose connections in the logic board may be the cause of the
problems. Common problems are loose or improperly seated RAM
chips and Airport cards.
F. During the process of path isolation, particularly on
externally connected devices, remember that the cable itself is
a possible source of problems. For example, if you’re having USB
problems, and you test each and every USB port using a device
with a faulty USB cable, you might assume your problems fit into
those described in item A above, when in fact the cable is
causing the problem. USB ports can be tested with a USB flash
drive as well as a hard drive, although their slower response
may introduce a few more false irregularities.
G. True logic board faults are most likely to exist in
iBooks, Titanium PowerBooks, Aluminum PowerBooks, plastic
MacBooks, and MacBook Pro’s without machined aluminum housings.
These systems are susceptible to logic board flexing which can
in turn create cracks in logic board traces. This doesn’t mean
problems can’t occur on other systems, it’s just not as likely.
Path isolation is performed as follows:
1. A Normal Mode scan is performed on a device, and
errors and/or significant irregularities have been detected.
These types of results will not be confirmed as weak blocks or
sectors during Diagnostics Mode re-evaluation of the
Normal Mode tests. Diagnostics Mode will likely
log messages using one of the following formats:
2. Select Diagnostics Mode and a volume on the
drive where the problems were encountered. Set the increment for
testing fairly high (such as 1000). Select only the option to
perform interface testing as the only option. Click on the
"Start Diagnostics..." button.
3. If the unit is external, while the test is running, see if
moving the cable around, particularly near the junction between
the cables and connectors can induce faults. If the drive is
internal and you can open it up and gain access to the drive
cables, the same can be done using a non-conductive probe. You
will be looking for Scannerz Diagnostics Mode to detect
faults, which will yield messages similar to the following:
When messages similar to those above occur, you will know that
your system is experiencing system faults. If the problems can't
be correlated specifically to a cable, it's possibly a cracked
trace either on the logic board, logic board connectors, or
inside a drive housing if it's an external drive.
4. Boot from an alternate source from that used in step 1, using
a completely different type of port. Pay close attention
to points A., B., and C. above and make sure that the alternate
boot source you’re using is not a shared port from the same I/O
card or ports feeding from the same interface circuit.
For example, if the original test was done using an internal
drive with a SATA interface, use another boot volume such as a
Phoenix Boot Volume using a USB interface. If the original test
was done using a USB based Phoenix Boot Volume, then boot from
the internal SATA drive you normally use to boot the system.
Launch Scannerzfrom that, and re-perform the Diagnostics
Mode tests as identified in step 2 above to test, at a
minumum both the alternate boot source as well the original boot
source. It’s highly recommended that you perform tests of this
nature on as many I/O ports as possible, preferably all I/O
ports. The idea is make sure the problem is isolated only to a
single data path.
5.The test results should fall into one of the following
categories:
Resolving Problems with Inconsistent Errors and/or
Irregularities
If the results indicate that the errors and/or irregularities
are present in all tests, it implies there’s something wrong
with the logic board, or something connected to the logic board.
We recommend proceeding by attempting to start doing the easiest
things first before going into more intense work. For example,
re-seating the RAM. In some rare (very rare) circumstances an
auxiliary device, such as a printer may be causing problems, and
it might be wise to see if problems go away if other devices are
removed.
If this doesn’t provide positive results, the unit will need to
be opened up, and all internal items reseated and inspected for
damage. If the unit uses an internal supply with large
capacitors, the supply should be checked for capacitor bloating
and signs of other failure. You may very well need to replace
the logic board if this type of condition exists.
If the problems are isolated to a single path, you will need to
determine the exact cause of the problems. In most cases, it
will be either a cable, connector, or poorly seated cable in the
path.
It is not uncommon for the connectors on the logic board leading
to external I/O devices, such as USB and FIreWire ports to
develop cracks at the junction between the logic board and the
connector if subjected to lateral impact or strain. The
connector plugging into these ports can effectively act almost
like a lever which can, in a sense, “amplify” the amount of
strain being placed on the actual connectors. All tests,
regardless of the device tested on this port will exhibit the
exact same intermittent behavior. The only solution in this case
is to either replace the logic board or not use the port.
Similar problems can exist on external drive enclosures, and
generally the only solution will be to replace the interface
board in the housing.
Any cables found to be defective, whether internal or external,
will need to be replaced. External cables tend to malfunction
near the connector ends. Always check the seating of cables and
in the case of external cables, check the inside of the
connectors for possible contamination by a foreign substance.
If the problems exist on a group of I/O ports, the problem may
be cable or logic board related. If you have a unit as described
in item C. above, then the most likely culprit will be the cable
connecting the I/O board to the logic board. Attempt to reseat
the cable first to see if the problems clear up. If they
persist, attempt replacing the cable with a known good one. If
this fails, then there’s unrepairable damage either to the I/O
card or damage to the logic board path that connects to the
cable. The option will be to replace the faulty components or
simply not use the ports associated with the bad path.
If you’re using an older unit that uses a USB controller/hub
chip as described in A. above, this is logic board damage and
the only option will be to not use the ports or replace the
logic board. However, confirm that the problem exists on a host
of devices. For example, if you were to test 2 USB ports with
the same device and cable, it’s quite possible the device or
cable could be defective, thus leading you to think that all USB
ports are bad when in fact it’s the device being used in
testing.
If the problems can’t be replicated there are several
possibilities. The first is to ensure that the test is being
conducted properly. Scannerzrequires that no other
applications other than the core operating system be running. If
this condition hasn’t been met, the tests should be considered
invalid.
If the test was conducted properly, use Activity Monitor and FSE
or FSE-Lite to confirm there are no other, hidden applications
running. It may be necessary to open the log files for the
system and see if there are any tell-tale signs of
malfunctioning applications, start up items, or faulty kernel
extensions. It may be helpful to reboot the system in safe mode
to see if the problems clear up.
Finally, if there’s a problem that’s just beginning to surface,
problems may only show up once in a while. If the problem is due
to a true fault, it will eventually get worse, not better (they
never get better!) We would recommend monitoring the system and
paying attention to see if these currently rare events can be
correlated to a specific device or activity. When problems
associated with actual faults in the system are in their initial
stages of development, they may be difficult to isolate and
frustrating to deal with.
Sub Isolation of a Problem Associated with a Specific Path
If the problems have been traced to a specific path, the actual
source of the problem should be isolated. In the vast majority
of cases, cables, failing connectors, or poorly seated
connectors will likely be the cause. Much of this is nothing
more than common sense and basic logic, but it may take a little
thinking to isolate the actual cause of the problems.
To isolate this problem, first visually inspect the cables and
their connectors for any signs of damage or contamination and
repair, replace, or attempt to clean as needed. If there are no
obvious visual signs of problems, attempt reseating the cables
to see if the problem clears up. If this doesn’t work, replace
the cable with one that’s known to be in good working order.
If none of these attempts clear up the problem, you need to
start to consider the possibility that the logic board, an
external enclosure (if the path is to an external drive), or
possibly the drive itself has problems. The most likely culprits
will be the connectors on the logic board, or those on the
interface connector of an internal or external hard drive. The
only way to really evaluate this is to swap the external or
internal unit with a known good, working unit. If the problems
continue, the logic board is to blame, otherwise the internal or
external drive is to blame.
If the problems are traced to the logic board, the unit can be
run from an external drive, so it isn’t necessarily the end of
the unit. Be advised that if problems exist with an external
drive, the drive inside that unit may be in perfect working
order if the problems are associated with the drive enclosure
instead of the drive itself.
Purchasing Scannerz
To purchase one of the Scannerz packages, click on
the Buy Scannerz Now button below.
Prices:
$21.95 For Scannerz Lite
$39.95 For Scannerz with FSE-Lite,
Performance Probe, and Phoenix
$49.95 For a Scannerz with FSE,
Performance Probe, and Phoenix
Scannerz, ScannerzLite, FSE, FSE-Lite, Performance
Probe 2, Phoenix, SpotOff, and Spot-O-Meter are Mac OS X
universal binaries and support both Intel and PowerPC G4 and G5
based systems using Mac OS X versions 10.5 (Leopard), 10.6 (Snow
Leopard), 10.7 (Lion), 10.8 (Mountain Lion), 10.9 (Mavericks),
and 10.10 (Yosemite). PowerPC based systems must use Mac OS X
10.5
Supported Intel based systems include all variants of the
MacBook, MacBook Air, MacBook Pro, iMac, Mac Pro, and Mac Mini.
Supported PowerPC based systems must be running MacOS 10.5
(Leopard) and include the iBook, Power Mac, eMac,
iMac, Mac Mini, and PowerBook G4 Series.