Abstract:
Recommendation systems in the scientific information space serve as essential tools for search and navigation when working with scientific documents. Software code is currently considered as an object of scientific knowledge and, as a result, an important task is to create software lifecycle support systems, in particular, to find similar software solutions, detect code borrowings, analyze and evaluate code quality.
This paper proposes a content-based recommender system that provides users with a personalized list of code fragments that are functionally equivalent to the input query code presented in one of the programming languages from the established set.
The basic algorithm of the system is based on the representation of the program code in the form of an abstract syntax tree followed by the construction of a vector space of program codes. The semantic similarity of program codes is determined by the distance between code vectors in a multidimensional space.
The personalization of recommendations is achieved through a filtering module that ranks the retrieved fragments taking into account the user's profile. The factors under consideration are the language preferences of the user and his areas of scientific interests, extracted through integration with ORCID.
To ensure the system's operation, a specialized dataset was created based on the CodeNet corpus. The problem of automated language detection from a snippet of the presented code in one of the 19 languages included in the current rating list of programming languages has also been solved.