LiveLab Web Usage Dataset
Author: Rice Efficient Computing Group, Rice University
Date:   April, 2012

=======================
 Copyright and License
=======================

1. We grant you a nonexclusive, nontransferable license to use the data for
   commercial, educational, and/or research purposes only. You agree to not
   redistribute the data without written permission from us. You agree to not
   attempt to reveal the identity of a participant.
2. You agree to acknowledge the source of the data, i.e., the LiveLab and Tempo
   projects, by citing the following papers in your publication or product:
   - Clayton Shepard, Ahmad Rahmati, Chad Tossell, Lin Zhong, and Phillip 
     Kortum, "LiveLab: measuring wireless networks and smartphone users in the
	 field", in ACM SIGMETRICS Perform. Eval. Rev., vol. 38, no. 3, December
	 2010.
   - Zhen Wang, Felix Xiaozhu Lin, Lin Zhong, and Mansoor Chishtie, "How far
     can client-only solutions go for mobile browser speed?" in Int. World Wide
	 Web Conf. (WWW), April 2012.
3. We provide no warranty whatsoever on any aspect of the data. Use at your own
   risk.

========
 README
========

We do not provide any support for the use of this data other than this README
file. Nor would we provide more information regarding the data and how it is
collected beyond this README and the two papers mentioned in the License
agreement.

LiveLab web usage dataset contains 24 .csv files. Each file has one user's web
usage over one year. There are 4 columns in each file.
- uid: The user's unique ID.
- time: The unix timestamp of the date/time the user accessed the website
  (seconds since Thu, Jan. 01 1970 00:00:00 GMT).
- count: The number of times the user accessed this URL (further described
  below).
- hashedurl: The url that is hashed.


Understanding the "count":
The web history was collected nightly from Safari's history file on iPhone 3GS,
which only stores the timestamp of the most recent visit to a specific URL, but
increments the "count" each time that URL was visited. Notably, this count is
reset automatically for older entries, or can be reset by the user. Because of
this, our logger may capture an entry multiple times between visits, or after a
reset.

Therefore, a heuristic is needed to determine the correct visit count. For
example, all the following examples would indicate 3 total visits to
google.com/sub1/ AFTER TIME = 1000

Time, #count, URL
1000, 4, google.com/sub1/
2000, 1, google.com/sub1/
4000, 3, google.com/sub1/

#count, URL
1000, 4, google.com/sub1/
2000, 1, google.com/sub1/
3000, 2, google.com/sub1/
4000, 3, google.com/sub1/

#count, URL
1000, 4, google.com/sub1/
4000, 3, google.com/sub1/

#count, URL
1000, 4, google.com/sub1/
2000, 1, google.com/sub1/
3000, 2, google.com/sub1/
4000, 1, google.com/sub1/


Understanding the "hashedurl":
Notably the url is hashed due to privacy concerns. A url can be parsed into six
fields: scheme, netloc, path, params, query, and fragment. The field "netloc"
has up to four sub-fields: username, password, host, and port. (ref: urlparse
and URI scheme ) Different rules are applied to different fields:

- scheme: not hashed
- netloc:
  - username: one hash value
  - password: one hash value
  - host: one hash value per subdomain
  - port: not hashed
- path: one hash value per sub-path
- params: one hash value
- query: one hash value
- fragment: one hash value

Some general domain information for the field "host" is not hashed so that a
user's browsing behavior can be better understood from the data. In particular,
the following subdomains are not hashed:
- top-level domains (ref: List of Internet top-level domains)
- second-level domains if they are one of these: "ac", "co", "com", "net", "org", "edu", "gov", "sch", "gc"
- bottom-level domains if they are one of these: "www", "login", "m", "mobile", "iphone", "touch"

Here is an example on how a url is hashed, assuming hash() is the hash
function. hash() returns "hash#", where "#" is 1 to 9 in the following example.

url:
http://username:password@www.example.com:80/over/there/index.dtb;parameters?type=animal&name=narwhal#nose

hashed url:
http://hash(username):hash(password)@www.hash(example).com:80/hash(over)/hash(there)/hash(index.dtb);hash(parameters)?hash(type=animal&name=narwhal)#hash(nose)

it looks like this in the .csv file:
http://hash1:hash2@www.hash3.com:80/hash4/hash5/hash6;hash7?hash8#hash9
