pubmed.ncbi.nlm.nih.gov

Search strategies of Wikipedia readers - PubMed

  • ️Sun Jan 01 2017

Search strategies of Wikipedia readers

Giovanna Chiara Rodi et al. PLoS One. 2017.

Abstract

The quest for information is one of the most common activity of human beings. Despite the the impressive progress of search engines, not to miss the needed piece of information could be still very tough, as well as to acquire specific competences and knowledge by shaping and following the proper learning paths. Indeed, the need to find sensible paths in information networks is one of the biggest challenges of our societies and, to effectively address it, it is important to investigate the strategies adopted by human users to cope with the cognitive bottleneck of finding their way in a growing sea of information. Here we focus on the case of Wikipedia and investigate a recently released dataset about users' click on the English Wikipedia, namely the English Wikipedia Clickstream. We perform a semantically charged analysis to uncover the general patterns followed by information seekers in the multi-dimensional space of Wikipedia topics/categories. We discover the existence of well defined strategies in which users tend to start from very general, i.e., semantically broad, pages and progressively narrow down the scope of their navigation, while keeping a growing semantic coherence. This is unlike strategies associated to tasks with predefined search goals, namely the case of the Wikispeedia game. In this case users first move from the 'particular' to the 'universal' before focusing down again to the required target. The clear picture offered here represents a very important stepping stone towards a better design of information networks and recommendation strategies, as well as the construction of radically new learning paths.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Datasets under consideration.

In (A) we illustrate the English Wikipedia Clickstream dataset. The 9 different external sources plus the MainPage are illustrated with the fraction of flux outgoing from them. The paths we considered in our analysis start from one of the 9 sources to randomly walking over the Wikipedia articles accordingly to the transition counts provided by the dataset. (B) Two examples of paths followed by players of the Wikispeedia game, whose task was that of navigating on a reduced version of Wikipedia from a given starting page to a given target one (from House to Electric_Field in the example).

Fig 2
Fig 2. Example illustrating the construction of the topical vector for the Isaac Newton article.

For the Isaac Newton page one first considers the list of parents categories (panel A). For each category, one identifies the most-representative-topics (panel B), selecting the ones from which the depth of the category in the categories tree is minimal. For each page, we consider the whole list of most-representative-topics and corresponding depths (panel C). For instance the category copernican_revolution has the smallest depth (equal to 3) in the tree of the topic SCIENCE. The vector representation of the coordinates of the main topics is now obtained by weighting each topic with the inverse of the minimal depth computed above (panel D). For instance the topic SCIENCE appears in the topical vector with weight 1/2.

Fig 3
Fig 3. Distributions of page norms (left) and entropies (right).

The distributions are computed over the set of all pages for which a vector representation was derived. For both norm and entropy, in the boxes some exemplar pages are reported to illustrate the meaning of extreme values.

Fig 4
Fig 4. Paths generated from the external source google: averages.

The 107 paths simulated with google as source were split by lengths. For each fixed length l, we computed the averages of the following quantities over all the nodes(pairs) at k steps(jumps) to the end: (A) the average norm ∥wkl∥¯, (B) the entropy S(wkl)¯, (C) the distance and (E) the similarity between all the pairs of nodes consecutively visited along each path, respectively d(wkl,wk-1l)¯ and sim(wkl,wk-1l)¯, (D) the distance and (F) the similarity between every node visited and the ending node along each path, i.e. d(wkl,w0l)¯ and sim(wkl,w0l)¯. The error bars display the standard errors of the means. Each color refers to a path length, from 3 (blue) to 9 (light green).

Fig 5
Fig 5. Rescaled averages over the simulated paths.

In this panel we report the same data of Fig 4 (left column) after rescaling. The walks lengths are normalized to 1. The corresponding averages for step of the different measures (A)-(F) are rescaled with the mean value of the same measures evaluated over the whole set of nodes belonging to paths with the same length. The averages used to rescale the data are displayed in Fig D in S1 File. In the central and right columns similarly processed data are reported which refer respectively to a semantically uncorrelated model based on the google paths and to the Wikispeedia paths. Each color refers to a path length, from 3 (blue) to 9 (light green). The standard error of the means are reported.

Fig 6
Fig 6. Similarity scores between sources.

For the two observables norm (left panel) and entropy (right panel), we report the matrix of similarities score between all the sources and Wikispeedia. The score is defined by Eq (6). For each pair of sources, the unrescaled averages values of the observable are considered (as in Fig 4). Then, for each path length between 4 and 9, the Spearman correlation coefficient is computed between the averaged values of the observable. The final score is the obtained after averaging over all the lengths.

Similar articles

Cited by

References

    1. Giedd JN, Chied MD. The Digital Revolution and Adolescent Brain Evolution. J Adolesc Health. 2012;51(2):101–105. 10.1016/j.jadohealth.2012.06.002 - DOI - PMC - PubMed
    1. Levitin DJ. The Organized Mind: Thinking Straight in the Age of Information Overload. Dutton; 2014.
    1. Foerde K, Knowlton BJ, Poldrack RA. Modulation of competing memory systems by distraction. Proceedings of the National Academy of Sciences. 2006;103(31):11778–11783. Available from: http://www.pnas.org/content/103/31/11778.abstract 10.1073/pnas.0602659103 - DOI - PMC - PubMed
    1. Just MA, Keller TA, Cynkar J. A decrease in brain activation associated with driving when listening to someone speak. Brain Res. 2008;1205:70–80. 10.1016/j.brainres.2007.12.075 - DOI - PMC - PubMed
    1. Schweizer TA, Kan K, Hung Y, Tam F, Naglie G, Graham S. Brain activity during driving with distraction: an immersive fMRI study. Frontiers in Human Neuroscience. 2013;7(53). Available from: http://www.frontiersin.org/human_neuroscience/10.3389/fnhum.2013.00053/a... 10.3389/fnhum.2013.00053 - DOI - DOI - PMC - PubMed

MeSH terms

Grants and funding

The authors acknowledge support from the KREYON project funded by the John Templeton Foundation under contract n. 51663. The sponsors had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources