A template for the hubs-and-authorities algorithm - HITS (hyperlink-induced topic search).
At each step in this notebook, recommended functions for implementation of MapReduce data flow pattern are listed.
import sys
from pyspark import SparkConf, SparkContext
# the number of iterations
NUM_ITERATIONS = 40
# input files
big_file = 'graph-full.txt'
small_file = 'graph-small.txt'
# prevent the "cannot run multiple SparkContexts at once" error
sc = SparkContext.getOrCreate()
sc.stop()
conf = SparkConf()
sc = SparkContext(conf=conf)
# Load big/small data file
data = sc.textFile(small_file).map(lambda line: line.split('\t')).map(lambda line: (line[0], line[1]))
print("%d lines" % data.count())
Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank.
.cache()
L = # key-value pairs for L-matrix
LT = # key-value pairs for transpose of L-matrix
h = # initial hubbiness vector
Repeat until convergence (or use fixed number of iterations):
If I have an RDD that has key-value pair and I want to get only the value part, what is the most efficient way of doing it?
Answer: youRDD.values()
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
for _ in range(NUM_ITERATIONS):
a = # compute a = LTh
a_max = # obtain maximum value in a
a = # scale a so the largest component is 1
h = # compute h = La
h_max = # obtain maximum value in h
h = # scale h so the largest component is 1
Tips:
.cache()
.# TO-DO: implement four for loops and print out the results
sc.stop()