# A Fast and Better Hybrid Recommender System Based on Spark

- 2 Citations
- 7 Mentions
- 862 Downloads

## Abstract

With the rapid development of information technology, recommender systems have become critical components to solve information overload. As an important branch, weighted hybrid recommender systems are widely used in electronic commerce sites, social networks and video websites such as Amazon, Facebook and Netflix. In practice, developers typically set a weight for each recommendation algorithm by repeating experiments until obtaining better accuracy. Despite the method could improve accuracy, it overly depends on experience of developers and the improvements are poor. What worse, workload will be heavy if the number of algorithms rises. To further improve performance of recommender systems, we design an optimal hybrid recommender system on Spark. Experimental results show that the system can improve accuracy, reduce execution time and handle large-scale datasets. Accordingly, the hybrid recommender system balances accuracy and execution time.

## Keywords

Recommender system Hybrid Weight Spark## 1 Introduction

Along with the popularization of the Internet, a sharp increase in the amount of data leads to information overload [1]. Thus, recommender systems [2] were proposed to relieve the stress of massive data. To improve recommender systems performance, researchers put forward the weighted hybrid method. Despite performance boost has been brought by the method, there are still several problems affecting performance, including weight setting and computation load. Hence, we implement a weighted hybrid recommender system on Spark. In the system, we design a new method to compute weights, using cluster analysis and user similarity. Besides, the execution time can be reduced by deploying the system on Spark.

### 1.1 Hybrid Recommender Systems

Hybrid recommender systems combine two or more recommendation algorithms to overcome weaknesses of each algorithm. It is generally classified as Switching, Mixed, Feature Combination, Meta-Level, and Weighted [3].

The weighted hybrid technique combines different algorithms with different weights [3]. The main idea is that the algorithm with better accuracy has a higher weight. At present, developers always set a weight for an algorithm manually and repeat experiments until achieving superior accuracy. Thus, the method depends on developers’ experience to determine accuracy of an algorithm in different datasets. Due to large-scale datasets, sparsity of rating data and the number of algorithms, it’s generally hard to obtain appropriate weights. Eventually the improvements of accuracy are poor.

In addition, to improve user experience, the system should return recommendation results efficiently. In other words, it has to quickly locate information which can appeal users in massive data. Thus, execution time is another evaluation standard of performance. However, the weighted hybrid technique needs to execute two or more algorithms and compute hybrid results, it’s tough to reduce execution time.

Apart from accuracy and execution time of the system, scalability is also an important consideration. With the increasing of data scale and the algorithm complexity, the system requires more storage space and computing resources. It’s difficult to meet the actual demand by only optimizing algorithms.

To address the above-mentioned issues, we design a hybrid recommender system on Spark. In the system, we propose an optimized method to improve accuracy. It computes weights and hybrid results based on cluster analysis and user similarity. Meanwhile, we deploy the system on Spark which is a fast and general engine for large-scale data processing [4] to accelerate the training process and improve scalability.

### 1.2 Work of Paper

The rest of this paper is organized as five sections. Section 2 reviews recommendation algorithms and introduces the Spark. Section 3 describes the design of the optimized method. Section 4 shows how we implement the system on Spark. Section 5 gives experimental results and our analysis. Section 6 presents our conclusions and future work.

## 2 Related Work

In this section, we first review and compare recommendation algorithms and recommender systems. Then, we briefly analyze predicting ratings of algorithms. Finally, we introduce the distributed computing platform Spark and compare Hadoop and Spark.

### 2.1 Recommender Systems

Recommendation algorithms are the basis of recommender systems. In this section, we first introduce several representative algorithms.

Collaborative recommendation is almost the most popular algorithm. Based on overlapped ratings, it computes similarities among users. And then, it uses similarities to predict the rating that the current user on an item [5]. Tapestry [6], Ringo [7] and GroupLens [8] are typical systems with the algorithm.

Content-based recommendation pays attention to connections between items. It analyses descriptions of items that have been rated by users [9] and calculates similarities between items. The representation of an item’s feature and the way to classify a new item are two important sub-problems [9].

Strengths and weaknesses of recommendation algorithms

Algorithm | Strength | Weakness |
---|---|---|

Collaborative | Field independence. Not necessary to understand descriptions of items. Support users to discover potential interests | New user problem. New item problem. Sparsity. |

Content | Improve accuracy by increasing dimensions of item features | Cold start problem. Similarity measurement is one-sided. |

Demographic | Historical data are not necessary. Wide range of applications. No cold start problem | The algorithm is rough and imprecise |

As the simple and effective technique, the weighted hybrid recommender system has been widely used in numerous fields. P-Tango and Pazzani are two typical systems. P-Tango is an online news system. It combines collaborative and content-based recommendation algorithms. The system adjusts weights of algorithms in the process of operation. Until the system obtains the expected accuracy, it determines weights. Pazzani is the other weighted hybrid recommender system. It combines collaborative, content-based and demographic-based recommendation algorithms. The system uses voting to determine recommendation results.

### 2.2 Weight Analysis

The results of statistic analysis on predicting ratings

Algorithm | countH | countL | countE |
---|---|---|---|

User-CF | 9181 | 10752 | 11 |

ALS | 6992 | 12952 | 0 |

- (1)
In these algorithms, there are little predicting ratings that equal to ratings.

- (2)
A part of predicting ratings are greater than ratings, and another are less than ratings.

- (3)
Only a weight for an algorithm may affect accuracy.

Thus, it is essential to optimize weights.

### 2.3 Spark

Spark is a fast and general-purpose cluster computing platforms for large-scale data processing [4] which is developed by UC Berkeley. In the environment of Spark, it includes Spark SQL [10], Spark Streaming [11], Mllib [12], GraphX [13], etc. Based on resilient distributed dataset (RDD) [14], it achieves memory-based computing, fault tolerance and scalability. Currently, Spark is deployed in Amazon, ebay and Yahoo! to process large-scale datasets.

For a hybrid recommender system, performance is affected by data scale, the number of algorithms and the complexity of algorithms. Deploy the system on Spark can mitigate above affects.

- (1)
In the system, large-scale datasets could be stored in distributed storage.

- (2)
Algorithms are independent with each other, they are supposed to be performed in parallel.

- (3)
Intermediate values can be cached in memory to decrease execution time.

Therefore, in this paper, we design an optimized hybrid recommender system on Spark.

## 3 Design Overview

The empirical evidence from Sect. 2 suggests that accuracy still has chance to be improved. The predicting ratings are higher or lower than corresponding ratings. Thus, we use cluster analysis to obtain more accurate weights. The principle of cluster analysis is that according to the properties of samples, using mathematical methods to determine relationship between samples, and according to the relationship to cluster samples. Based on cluster analysis, we present an optimized method for calculating personalized weights. Now let us discuss the method in detail.

### 3.1 Objective Function

- (1)
Assume that there are n algorithms in the system and j is the j’th algorithm.

- (2)
u for user, i for item and (u,i) represents the data item of u and i.

- (3)
\(R_{ui}\) is the rating of u on i, \(r_{ui}^{j}\) is the predicting rating of u on i which is computed by the j’th algorithm.

- (4)
For the j’th algorithm, the error between the rating and the predicting rating is: \(D_{ui}^{j} = R_{ui} - r_{ui}^{j}\). In order to reduce \(\sum _{j=1}^{n} \sum _{u,i}D_{ui}^{j}\), similar errors are expected to get same weights. Based on errors, we divide (u,i) into k clusters and design \(\varvec{C_{ui}} = (c_{1}, c_{2}, \cdots , c_{k})\) to reflect the cluster of (u,i). For the j’th algorithm, \(\varvec{\alpha _{j}} = (\alpha _{j1}, \alpha _{j2}, \cdots , \alpha _{jk})\) represents k weights of the algorithm. \(\varvec{\alpha _{j}}\) \(\varvec{C_{ui}^{T}}\) finally determines the weight for \(r_{ui}^{j}\).

### 3.2 Weight Calculation

According to \(D_{ui}^{j}\), the optimized method classifies all (u,i) into k clusters. For each (u,i), it has a vector \(\varvec{C_{ui}} = (c_{1}, c_{2}, \cdots , c_{k})\) and is initialized to \(\varvec{C_{ui}} = (0, 0, \cdots , 0)\). The value which corresponds to (u,i)’s cluster is set to 1. For instance, if (u,i) belongs to the cluster 2, \(\varvec{C_{ui}} = (0, 1, 0, \cdots , 0)\). The weight for \(r_{ui}^{j}\) is \(\alpha _{j2}\) which is computed by \(\varvec{\alpha _{j}}\varvec{C_{ui}}^{T}\). Therefore, \(\varvec{C}\) could map weights to predicting ratings and achieve multiple weights for an algorithm. Figure 1 shows the pipeline of the method.

## 4 Implementation

### 4.1 Modules

Data storage module is the basis of the system. It stores input data, including historical data and ratings. We use HDFS which is a distributed file system to store raw data [17]. The pre-processed data are put in the database such as HBase, Redis, Hive, etc. [18, 19, 20]. Topside modules read data from the database. Prediction module is used to compute predicting ratings. It performs recommendation algorithms in parallel. Outputs are predicting ratings.

The cluster module concentrates on errors of (u,i). It exploits k-means to classify (u,i). Output of the module is \(\varvec{C}\). The weight module accepts \(\varvec{C}\) to compute weights. With \(\varvec{C}\) and \(\varvec{\alpha }\), the module can get a weight for each \(r_{ui}^{j}\). Output of it is \(\varvec{\alpha }\).

The model fusion calculates hybrid results based on predicting ratings, \(\varvec{C}\), \(\varvec{\alpha }\) and user similarity. According to these parameters, it determines hybrid results by logistic regression [21]. Recommendation is used to recommend items for users. Based on hybrid results, it generates recommendation lists. Besides, it also outputs an evaluation for results.

### 4.2 Discussion

In the hybrid recommender system on Spark, data are translated into RDDs. Because of the characteristics of memory-based computing and parallel operations, RDDs can be processed in parallel to reduce execution time. The read-only and fault tolerance of RDD make the system more reliable. Besides, due to the distributed storage of Spark, the system is able to handle large-scale datasets. It improves scalability of the system. Therefore, deploy the hybrid recommender system on Spark could decrease execution time and further improve scalability.

## 5 Performance

### 5.1 Evaluation Index

**Accuracy.**The system accuracy is measured by root mean square error (RMSE) [22]. It is defined as:

*T*| denotes the number of \(\hat{r_{ui}}\).

**Execution Time.** The execution time includes time of algorithms, clustering, calculating weights and hybrid results. It is measured in minutes.

### 5.2 Experimental Setup

In this experiment, we choose Spark as our platform. All experiments were performed using a local cluster with 7 nodes (1 master and 6 worker nodes): each node has Xeon(R) dual-core 2.53 GHz processor and 6 GB memory.

**Dataset.**In Table 3, we list datasets that were used in the experiment. For each dataset, we divide it into 2 training sets and a test set randomly.

Datasets in the experiment

Dataset | Users | Items | Ratings |
---|---|---|---|

MovieLens-100K | 1000 | 1700 | 100000 |

MovieLens-200K | 1371 | 10153 | 200000 |

MovieLens-300K | 2004 | 10850 | 300000 |

MovieLens-400K | 2661 | 11634 | 400000 |

MovieLens-500K | 3462 | 13257 | 500000 |

MovieLens-600K | 4073 | 13488 | 600000 |

MovieLens-700K | 4753 | 14154 | 700000 |

MovieLens-800K | 5543 | 14230 | 800000 |

MovieLens-900K | 6207 | 14963 | 900000 |

MovieLens-1M | 6000 | 4000 | 1 million |

BookCrossing | 71212 | 176272 | 400000 |

**Algorithms.** We implement 3 recommendation algorithms: User-CF, Item-based Collaborative Filtering (Item-CF) and ALS. We perform them in training sets and test sets to compute predicting ratings, weights and hybrid results.

**Nodes.** We compare execution time of the stand-alone system and the distributed system. For the former, we use the server with Xeon(R) dual-core 2.53 GHz processor and 6 GB memory. For the latter, we use a local cluster with 7 nodes (1 master and 6 worker nodes): each node has Xeon(R) dual-core 2.53 GHz processor and 6 GB memory.

### 5.3 Performance Comparision

Figure 3 shows impacts of data scales on accuracy. In the experiment, we performe the combination of User-CF and ALS on four MovieLens datasets. In the Fig. 3, with the increasing of data scale, RMSE generally decreases. Due to the sparsity of MovieLens-700K, the hybrid recommender system obtains the best result. Compare with User-CF and ALS, the system improves accuracy of 8.21

Figure 5 shows correlations between accuracy and combinations of algorithms. In the experiment, four combinations of algorithms are performed on MovieLens-100K, MovieLens-400K and BookCrossing respectively. In Fig. 5, the hybrid recommender system obtains better accuracy than single algorithm. When accuracy of single algorithm is favorable, the hybrid recommender system also obtains better accuracy.

Figure 6 compares execution time of stand-alone mode and local cluster mode. The experiment performs the combination of User-CF and ALS on MovieLens-100K to MovieLens-1M. For the stand-alone system, execution time increases sharply with the expansion of data scale. However, execution time of local cluster mode remains relatively constant. When the data scale is larger than MovieLens-900K, the stand-alone mode couldn’t handle it. The local cluster mode could handle MovieLens-10M or larger datasets. From Fig. 6, we can recognize that memory-based computing, parallel operations and distributed storage of Spark are helpful to decrease execution time and improve scalability.

## 6 Conclusion and Future Work

Improving performance of recommender systems is a crucial solution for information overload. This paper designs a new weighted hybrid recommender system to solve this problem. We are the first to compute weights by using cluster analysis, user similarity and minimum theory. Besides, we deploy the hybrid recommender system on Spark. The system improves accuracy by optimizing weights and reduces execution time by memory-based computing and parallel operations. And distributed storage of the system is helpful to improve scalability. The experiment results demonstrate the performance of our hybrid recommender system.

In future work, we will consider to improve and extend the system: expansion of algorithm to process more complex scenes. Further research on factors influencing weights to improve accuracy. Meanwhile, optimize the implementation of the system on Spark.

## References

- 1.Eppler, M.J., Mengis, J.: The concept of information overload: a review of literature from organization science, accounting, marketing, mis, and related disciplines. Inf. Soc.
**20**(5), 325–344 (2004)CrossRefGoogle Scholar - 2.Cosley, D., Lam, S.K., Albert, I., Konstan, J.A., Riedl, J.: Is seeing believing?: how recommender system interfaces affect users’ opinions. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 585–592. ACM (2003)Google Scholar
- 3.Burke, R.: Hybrid recommender systems: survey and experiments. User Model. User-Adap. Inter.
**12**(4), 331–370 (2002)CrossRefzbMATHGoogle Scholar - 4.Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud
**10**, 10 (2010)Google Scholar - 5.Burke, R.: Hybrid systems for personalized recommendations. In: Mobasher, B., Anand, S.S. (eds.) ITWP 2003. LNCS (LNAI), vol. 3169, pp. 133–152. Springer, Heidelberg (2005)CrossRefGoogle Scholar
- 6.Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to weave an information tapestry. Commun. ACM
**35**(12), 61–70 (1992)CrossRefGoogle Scholar - 7.Shardanand, U., Maes, P.: Social information filtering: algorithms for automating word of mouth. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 210–217. ACM Press/Addison-Wesley Publishing Co. (1995)Google Scholar
- 8.Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: an open architecture for collaborative filtering of netnews. In: Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, pp. 175–186. ACM (1994)Google Scholar
- 9.Pazzani, M.J.: A framework for collaborative, content-based and demographic filtering. Artif. Intell. Rev.
**13**(5–6), 393–408 (1999)CrossRefGoogle Scholar - 10.Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)Google Scholar
- 11.Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. Presented as Part of the (2012)Google Scholar
- 12.Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLlib: machine learning in apache spark. arXiv preprint arXiv:1505.06807 (2015)
- 13.Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, p. 2. ACM (2013)Google Scholar
- 14.Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)Google Scholar
- 15.Borneas, M.: On a generalization of the lagrange function. Am. J. Phys.
**27**(4), 265–267 (1959)CrossRefzbMATHGoogle Scholar - 16.Mitra, N.J., Nguyen, A.: Estimating surface normals in noisy point cloud data. In: Proceedings of the Nineteenth Annual Symposium on Computational Geometry, pp. 322–328. ACM (2003)Google Scholar
- 17.Borthakur, D.: The hadoop distributed file system: architecture and design. Hadoop Project Website
**11**(2007), 21 (2007)Google Scholar - 18.Zhang, D.W., Sun, F.Q., Cheng, X., Liu, C.: Research on hadoop-based enterprise file cloud storage system. In: 2011 3rd International Conference on Awareness Science and Technology (iCAST), pp. 434–437. IEEE (2011)Google Scholar
- 19.Han, J., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: 2011 6th International Conference on Pervasive Computing and Applications (ICPCA), pp. 363–366. IEEE (2011)Google Scholar
- 20.Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow.
**2**(2), 1626–1629 (2009)CrossRefGoogle Scholar - 21.Tsukimoto, H.: Logical regression analysis: from mathematical formulas to linguistic rules. In: Chu, W., Lin, T.Y. (eds.) Foundations and Advances in Data Mining. SFSC, vol. 180, pp. 21–61. Springer, Heidelberg (2005)CrossRefGoogle Scholar
- 22.Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res.
**30**(1), 79 (2005)CrossRefGoogle Scholar