|
|
| ■■■jnudev
>
各種技術ドキュメント
|
<原文はこちら>
|
| ■■■Gnutella: To the Bandwidth Barrier and Beyond
November 6, 2000■■■
|
■Overview■
このレポートで、Clip2 DSSは、5ヶ月間以上に渡る多くのデータに基づいて、ピア・ツー・ピアのファイル共有ネットワークであるGnutellaの進化と現在の状態について記述しています.ここで意味がある点としては、このネットワークが2000年8月にダイアルアップモデルの帯域をたびたび越えて成長して以来、ネットワークは滑らかに伸びているわけでも、破局的なほど崩壊しているわけでもないことを発見したことです.その代わり、継続的に発展し反応のある数多くのセグメントから構成される、断片的な状態として、このネットワークは生き残っているのです.この中で最大のものは、数百のホストを含む典型的なものとなっています.我々は、目下のところ、一日あたりのユニークなGnutellaのユーザーの数は、少なくとも10,000人、多いには30,000人に達するものと見積もっています.我々は、Gnutellaのネットワークが、現在の状態を越えて成長していくためには、これ以上の技術的な革新と、その革新の広範囲での採用が必要であるとここで示しています.
ダウンロードファイルを提供しているホストの割合は、9月の相当数のスライドにつづいて、10月の跳ね返りで、15%から40%の間を振れていました<赤>(訳注:9月のスライドは、Napsterの裁判が起こされた影響でNpasterユーザーがGnutellaへとなだれ込んできた時期)</赤>.ホストの大部分は、数分から数時間オンラインというもので、1日中、それ以上の時間接続しているのは少数派(8月初頭に30%)でした.ユーザーは、主にネットワークへ、オーディオファイル、ビデオファイル、イメージファイル、プログラムファイルを要求するクエリーを投げていて、自動化されたインデックス化を行う仕組みは定期的に数パーセントのクエリートラフィックからなっています.
大規模な調査では、3つのネットワークのうち1つが、非アメリカの人口を支配しているドイツと日本という、アメリカを中心とするトップレベルドメインの外部から、認められました.インターネット・サービス・プロバイダーのセカンド・レベル・ドメインでは、COMドメインとNETドメインがホストの上位
のソースの大半を占め、@HomeとRoad Runnerのブロードバンドサービスは、特に優勢でした.MITやバージニア工科大学が広めたパッケージによると、Gnutellaのホストは、EDUドメインにある550を超える研究所で見つかりました.
|
|
|
|
■Contents
by Frequently Asked Question■
|
|
■
Introduction ■
Gnutellaの歴史は、2つの時代に分けられます:これは「壁の前:プレ・バリア」と「壁の後:ポスト・バリア」で、この期間が、ダイアルアップモデルの帯域に関して、公的なGnutellaのネットワーク上のトラフィックレベルを参照しているもので、この2つの時代は、2000年8月において線引きされます.この記事で、Clip2
DSSは、プレ・バリアネットワークの状態に関して、以前に発表されなかったデータを公表し、追加的な情報を提供します.これは、バリアの移り変わりを説明し、その移り変わりは、私たちが9月8日に最初に報告した<A
Href=http://dss.clip2.com/dss_barrier.html>ものです.加えて、理解が不十分だったポスト・バリアのネットワークの状態を明らかにし、ネットワークホストのロケーションを調査しました.最後に、ネットワークの現在の状態を描き、ネットワークの発展のための将来のシナリオを示すために、類推を行います.
|
|
|
|
■
Pre-Barrier Gnutella
Clip2 DSS began conducting systematic studies of the Gnutella network in June 2000. Most Gnutella servents provide a "host count" feature in which they report to the user
a dynamically updated count of the hosts they have identified
on the network. During the period from June through mid-August,
servents connected to the public Gnutella network for at least
a matter of minutes typically reported host counts on the order
of 1,000 to 4,000 hosts. Clip2 DSS's Gnutella network crawler
regularly found 1,000 to 8,000 hosts online during the course
of a sub-1-hour full-network traversal. Were these hosts connected
in a tight concentration or in a loosely knit and far-flung
configuration? How many connections did a typical host have?
We were able to answer these questions using the network graphs
generated by our crawler.
The "concentration" of
the network is a particularly interesting question because Gnutella
servents typically issue queries with a "TTL" of 7, meaning
a user's query will travel up to 7 hosts away on the Gnutella
network from the originating computer. There is always a (not
necessarily unique) shortest path between any two hosts on the
network, and the longest such path is a known in mathematical
graph theory as the "diameter" of the network. Clearly, if the
diameter were greater than (7*2+1)=15, there would be hosts
from which a query could be launched and not reach the entire
network before expiring. We saw such high-diameter networks
in early July. Below, we show data for a crawl of 1,959 hosts
on July 7. Here, we plot on a semi-log scale the number of pairs
of hosts that had a given shortest-path distance (or "separation")
between them. The diameter of the network discovered by this
crawl was 22, indicating some regions were not in communication
with others (assuming messages used the common TTL value). We
also note that most pairs of hosts were separated by 7 hops.
In the latter days of the pre-barrier period, the network diameter
fell to smaller values, typically 8 or 9. Below, we show network
diameters for crawls made between July 25 and August 18. Note
that while the period of larger diameters around July 28 corresponds
to the "Napster
Flood", a larger diameter network is not necessarily a direct
result of more hosts connecting to the network. A network with
few hosts can have a large diameter and vice versa; the diameter
is purely a function of how the hosts are connected, not the
number of hosts. The smaller diameters are significant in that
they imply any host on the network could easily reach all other
hosts using TTL=7.
Further examining host connectivity,
we found a host was most likely to have a single connection,
and hosts with higher numbers of connections were increasingly
uncommon. In more technical terms, we saw roughly power-law
degree distributions, where "power-law" means the number of
hosts having a given degree varied as a power of the degree,
and "degree" is shorthand for the total number of incoming and
outgoing connections a given host has open. In addition to host
separations, the degree distribution is another means of quantitatively
assessing network connectivity, and we present below the degree
distribution based on a 1,813-host crawl made on July 7. One
particularly interesting coincidence is in that other researchers
(e.g., Broder
et al. 1999) have found power-law degree distributions for
graphs in which the nodes are Web pages and the connections
are Web links. In the plot, we compare measured data with a
least-squares best fit of slope (power-law index) = -2.3.
As the
above data show, the structure of the Gnutella network is in
a continuous state of flux. Hosts come and go as quickly as
users open and close Gnutella applications, and connections
between hosts may only last for seconds. We found that half
of an initial host population persisted after five hours, and
that approximately 30% of the initial host population was stable
on the timescale of 24 hours. We determined this result by comparing
host populations found in a succession of crawls to an initial
reference crawl and by repeating this analysis for multiple
reference crawls; the plot below illustrates our findings.
|
|
■Gnutella
Hits the Wall■
As we first reported in "Bandwidth
Barriers to Gnutella Network Scalability" (September 8),
average Gnutella network traffic began to regularly exceed
the throughput capacity of dial-up modems in August. Hosts
connected to the Internet via dial-up modems ceased to be
able to effectively participate as peers on the Gnutella network.
These hosts essentially became dead-ends, resulting in a widespread
fragmentation of the Gnutella network into effectively disconnected
components comprised of hosts with higher-speed Internet connections.
The animation below illustrates the evolution of the network
as traffic passed the dial-up barrier.

Data gathered by Clip2 DSS clearly
illustrates the effective loss of dial-up hosts from the responsive
portion of the Gnutella network. In one long-term experiment,
we regularly visited hosts and issued probe-like Gnutella
"ping" messages of TTL=2 to discover their neighbors. As shown
below, unique hosts sending Gnutella "pong" messages in response
to these small-TTL pings declined substantially and permanently
in mid-to-late August.
As noted earlier, most Gnutella servents display
a host count based on the number of pongs received since the
application began running. As the barrier was passed, numerous
postings
on public user forums noted servents were displaying lower-than-usual
host counts, often only in the tens or hundreds. Many users
interpreted the decrease in host counts as meaning the total
number of hosts on the network had decreased. However, as
the above data show, the change in network responsiveness
was rather abrupt. Such an abrupt transition is much more
plausibly explained by the reaching of a technical barrier
than a mass change in user behavior. Had users departed the
network, we would have expected to see a decrease in the usage
rate of the Clip2 DSS host list service; instead, we saw an
unabated rise in usage. In addition, even though responses
in the ping experiment had dropped, this experiment continued
to find ten to twenty times the number of hosts that would
be reported in a servent's host counter over a similar time
interval. In sum, we found no evidence supporting a user exodus
and multiple indications that the host population remained
sizable but fragmented. The total number of hosts online can
be substantially larger than the number reported in a servent's
host counter, since a servent only sees out as far as the
boundary of the responsive region to which the servent is
connected.
The preceding discussion begs
the question of the source of the traffic that caused the
network to reach the dial-up modem bandwidth barrier. Was
it the result of continued growth in the number of Gnutella
users, or was it the result of the introduction of programmatic
sources of traffic, such as machine-generated spam? The question
is difficult to answer due to a lack of comprehensive data.
Clip2 DSS reported
on one anomalous form of traffic seen on the network in early
September that may have existed for some time prior. Since
that report, we have regularly observed other forms of apparently
automated messages on the network, including repeated series
such as {a.mp3, b.mp3, c.mp3, ...} that appear to be network-indexing
attempts. While this is an unresolved question, it is moot
in a sense, because with continued growth in the user base,
user-generated network traffic would have eventually reached
the barrier level of its own accord.
|
|
■Gnutella Beyond the Barrier■
What is the state of the post-barrier
Gnutella network? Among other sources, we can find some clues
in data from the Clip2 DSS-operated "gnutellahosts.com"
host list service, which publishes IP addresses of live Gnutella
hosts. Approximately 10% of users access this list by visiting
the Clip2 DSS home page; the remainder retrieve addresses
by connecting their servents to a special-purpose Gnutella
server operated by Clip2 DSS at gnutellahosts.com, port 6346.
After responding to the incoming Gnutella servent with multiple
IP addresses, the gnutellahosts.com server disconnects, and
the servent proceeds to connect directly to the hosts at the
provided addresses. As noted in the previous section, we have
not observed any decrease in the traffic to gnutellahosts.com
due to Gnutella having hit the barrier. On the contrary, traffic
has continued to grow in the post-barrier period. Below, we
show the long-term (3-month) straight-line trend in gnutellahosts.com
usage.
How
many users are there on the post-barrier network? How are they
connected? From the number of callers to gnutellahosts.com and
our probing experiments, we found no evidence of a sudden population
collapse as the barrier was passed. Instead, we found evidence
that the population had fragmented into multiple dynamically
changing responsive and unresponsive segments. The sum of all
data sources leads us to estimate that the total number of daily
users of Gnutella numbers between 10,000 and 30,000, where the
lower bound is a much better approximation than the upper bound.
Below, we show the numbers of hosts in the largest responsive
network segments we were able to identify over a range of post-barrier
dates.

Since the fragmentation occurred,
it has been a matter of chance whether or not a user manages
to find and remain connected to a responsive segment. Typically,
the gnutellahosts.com host list service has provided addresses
of hosts in the largest identifiable responsive segment, although
this region is a moving target. In order to track it, Clip2
DSS has refined its crawling strategy in recent weeks and
regularly crawls the network on the timescale of every 15
minutes.
On the post-barrier network, dial-up modem
users cannot effectively participate as peers throughout the
network. What can be done to alleviate this situation? One
solution is to connect these users to high-speed proxies that
handle network traffic on their behalf. This is an underlying
concept of the Clip2 Reflector(TM),
a special-purpose Gnutella server, and the network architecture
that results is illustrated below:
Reflectors are programmed to maintain
connectivity to the most responsive segment of the network
by calling gnutellahosts.com (by default). Dial-up users singly
connect to Reflectors that in turn maintain multiple outgoing
network connections on their behalf. A list of running public-access
Reflectors can be found on the Clip2 DSS home page.
How different is user behavior in the post-barrier
period relative to the pre-barrier period? One measure of
behavior is the fraction of hosts serving a non-zero number
of files. In a 24-hour period in early August, during the
late pre-barrier era, researchers at Xerox PARC found 30%
of hosts made available one or more files for download (Adar
& Huberman 2000). Using a different methodology, Clip2
DSS independently measured the "serving fraction" before,
during, and since the period of the PARC study. Our results
confirm theirs during the period of their study. In the final
days of the pre-barrier era, we observed an increase in the
serving fraction to a maximum in excess of 40%. However, as
the network evolved into the post-barrier period, we saw a
substantial decrease in the serving fraction, down to a low
of less than 15%. Notably, since early October we have observed
a general rise in the serving fraction back to near pre-barrier
levels. Below, we plot the serving fraction over time.
What are users searching for in the post-barrier
period? Clip2 DSS analyzed three query stream samples of varying
sizes taken on three different dates. Applying a subjective
analysis to categorize 2,000 queries heard on September 19,
we found the following breakdown:
Notes on these categories:
The "gibberish" category includes queries consisting of non-alphanumeric
characters; among other sources, we have seen such queries
generated by poorly programmed clients that do not properly
read and forward Gnutella query messages. The "automated indexing"
category includes queries that appeared to be programmatically
generated with the intent of indexing network content, such
as "a.mp3", "b.mp3", etc. "File extension only" queries are
just that, containing extensions such as "mp3" or "mpg" (possibly
with accompanying periods, asterisks, or both) but no other
content. "Song and artist+song" counts queries either containing
only a song title or a song title along with an artist name.
"Artist" counts queries containing just an artist name.
By objectively analyzing only those
queries containing a file extension anywhere in the query
string, we are able to analyze much larger data sets. We found
from 20% to 40% of queries in three separate stream samples
of 2,000, 30,000, and 150,000 queries contained a popular
file extension somewhere in the query string. Below, among
the set of queries that contained an extension, we show the
frequencies of specific types of extensions.
Note the programmatically generated queries of the form "a.mp3",
"b.mp3", etc. were a recurring feature in every sample, and
the samples were spread over a 21-day period. These queries
likely originated from a single source and amounted to a few
percent of total network query traffic.
|
|
■Global Gnutella■
We complete our survey of the Gnutella
network by examining the access points of Gnutella hosts.
Where are Gnutella hosts? To arrive at an answer, we utilized
3.3 million non-unique IP addresses gathered continuously
by Clip2 DSS between July 27 and November 3, spanning both
Gnutella eras. Of these addresses, 1.3 million (39%) were
resolvable to non-numeric hostnames, and our reports below
are on this resolvable subset. The populations we report below
should be interpreted in probabilistic terms; the data have
not been de-duplicated, so that the relative populations represent
relative probabilities of host discovery within a given domain.
We divided top-level domains into US-Centric
(COM, NET, EDU, MIL, GOV, US) and Non-US-Centric (all others)
categories, although we note a number of non-US-based organizations
operate domains in the US-Centric set. Our first finding is
that Gnutella is a truly international phenomenon, with one
out of three hosts located on a non-US-centric domain.
Among Non-US-Centric domains, 95% of hosts
were found in just 17 country-specific top-level domains.
European domains comprised 67% and the Asia-Pacific domains
comprised 20%
Among US-Centric hosts, COM, NET, and EDU
domains predictably dominated, although non-zero numbers of
hosts were found on each of ORG, US, GOV, and MIL domains
(in ratios 19:8:2:1, respectively).

In the case of the COM, NET, and EDU domains,
we dug deeper to examine populations at the level of second-level
domain names. The popular second-level COM domains were primarily
broadband Internet service providers. The ISP @Home accounted
for half of all Gnutella hosts in the COM domain, and Road
Runner trailed in second place at nearly one quarter of hosts.
Gnutella hosts with resolvable host names in the COM domain
were therefore strongly concentrated in a small number of
second-level domains, with 87% of hosts residing in the top
25. Among the top 25 were six second-level domains either
representing non-ISP companies or organizations whose nature
we could not determine. In total, Gnutella hosts were found
on 1750 unique second-level domains within the COM domain.
The popular second-level NET domains were
exclusively Internet service providers, both broadband and
dial-up. The distribution among NET domains was less concentrated
than among COM domains, with the leader, Road Runner, claiming
less than 10% of Gnutella hosts on the NET domain. Note the
second-place second-level NET domain was a German company,
illustrating that while the NET domain is US-centric, it is
not US-exclusive. Over 2000 unique second-level domain names
were represented in the Gnutella host population.
The distribution of hosts among EDU domains
was even less concentrated than among NET domains. The Massachusetts
Institute of Technology led the pack at 3.8% by a sizable
margin. Virginia Tech made a strong showing at slightly less
than 3%, and the distribution declined smoothly from 3rd-place
University of Southern California through the remainder of
the list. In all, Gnutella hosts were discovered on over 550
second-level domains within the EDU domain.
In summary, major conclusions
are that (1) the Gnutella network is an international phenomenon
led by the US, Germany, and Japan; (2) substantial populations
of hosts in COM and NET domains are on ISP second-level domains;
and (3) hosts are widely distributed among EDU domains.
|
|
■Gnutella Tomorrow■
The Gnutella
network is analogous to a continuous global rave: an informal,
decentralized, unregulated gathering without a permanent location.
Network hosts, like rave attendees, come and go unpredictably,
connecting and disconnecting as fast as ravers switch dance
partners. In the pre-barrier era, the Gnutella rave could
accommodate all comers. The effect of the network hitting
the dial-up modem barrier is analogous to a rave venue reaching
capacity, with many would-be revelers being crammed shoulder-to-shoulder
and left unable to dance, and still more spilling out the
doors. Small regions within the crowd, corresponding to responsive
segments of the Gnutella network, remain sufficiently open
to enable movement. These pockets form and vanish, grow and
shrink, and merge and split on a variety of timescales. In
the post-barrier era, the Gnutella rave remains more popular
than the typical raver, pressed in among the crowd, might
realize. The crowding results in the potential of the gathering
being released isolated bursts rather than in a continuous
widespread discharge. While Gnutella has not scaled, we call
attention to the fact that it has also not collapsed. Like
many simple decentralized systems, it has remained remarkably
robust in the face of technical adversity. In the present
post-barrier period, Gnutella exists in an intermediate state
between scaling and collapsing.
In the opinion of Clip2 DSS, this
situation is probable to persist until either (1) the user
population collapses and traffic falls to pre-barrier levels
or (2) an "organizing principle" for connectivity that enables
scaling takes root. In the former case, the improved performance
that would result could potentially drive alienated users
to return to the network, driving resurgence in traffic, another
barrier crossing, and repetition of the entire cycle. In the
latter case, examples of organizing connectivity principles
include (1) dial-up users regularly connecting to broadband
Reflectors
and (2) widespread adoption of servents with sophisticated
and consistently implemented connection management rules.
However, to be widely and rapidly successful, any organizing
principal must require no user action or change in behavior
that is not immediately and powerfully rewarded, and it must
not involve a change to the protocol that breaks the considerable
installed application base. In the post-barrier era, there
have been various initiatives to create new and smaller networks
- spin-off raves - enabled by user adjustment of the "Gnutella
handshake" mechanism in servents that support this feature.
However, because the underlying technology is no different,
if traffic on these networks were to grow sufficiently large,
they would be subject to bandwidth barriers as well. These
attempts to re-create pre-barrier conditions do so at the
cost of sacrificing the relatively large user base on the
main network and do not directly address the problem. To move
beyond its present state, Gnutella awaits widely adopted technical
innovation.
© 2000 Ian Hall-Beyer. All Rights Reserved.
<manuka@nerdherd.net>
|
|
|