Search

Queries to Non-Relational Data using Natural Language based on a Large Language Model

Adilbek Omirbekovich Erkimbaev, Vladimir Yurievich Zitserman, George Anatolyevich Kobzev

76-98

Abstract:

The main purpose of this work is to explore new opportunities for organizing natural language queries in scientific local databases that are not relational. A brief review of recent research shows that there has been an active introduction of natural language queries into databases of various types, and the use of machine learning methods, such as neural algorithms, is noted. The widespread use of large language models in the last two years for query generation in various language settings and fields of expertise has been demonstrated. A study has been conducted to explore the potential of the AllegroGraph graph database in using large language models for natural language search. The functionality of the database has been examined using the example of a metadata system for thermophysical properties in the form of the "Thermal" domain ontology. Testing search queries in a bilingual (English and Russian) database environment has revealed some general problems that can be overcome, and it gives us good hope for the future application of new services using large language models.

Keywords: natural language query, large language model, embedding, non-relational databases, graph database, domain ontology.

Creation of Query Expansion Based on the Subject Domain Thesaurus in the Ontology of Knowledge of the Semantic Library

271-291

Abstract: Possibilities of query expansion with subject area thesaurus are discussed. The role of the context defined by thesaurus term links is both to refine the query and to increase the size of the sample on the query. Of particular importance is the process of expanding the query for scientific subject areas where the search based on special terminology. In this case, thesauruses of subject areas must be used to minimize the occurrence of information noise. The proposed approach takes into account the application of similar terminology in various subject areas. Examples of the use of thesaurus of separate sections of equations of mathematical physics and related fields demonstrate the effectiveness of the chosen approach of research. By linking to concepts of information resources of other areas of knowledge, the extension of the information query captures search fields of remote subject areas and various types of data, texts, symbolic, audio and video archives. Research shows that expanding the query based on context semantics improves the search quality of scientific publications in digital information and increases the effectiveness of scientific interdisciplinary research.

Keywords: comparison of scientific texts, semantic search, thesaurus for the ontology of knowledge, information query using the thesaurus, LibMeta.

A Recommendation System for Finding Semantically Similar Fragments of Program Code

Vitaly Ivanovich Zorin, Evgeny Konstantinovich Lipachev

751-781

Abstract:

Recommendation systems in the scientific information space serve as essential tools for search and navigation when working with scientific documents. Software code is currently considered as an object of scientific knowledge and, as a result, an important task is to create software lifecycle support systems, in particular, to find similar software solutions, detect code borrowings, analyze and evaluate code quality.

This paper proposes a content-based recommender system that provides users with a personalized list of code fragments that are functionally equivalent to the input query code presented in one of the programming languages from the established set.

The basic algorithm of the system is based on the representation of the program code in the form of an abstract syntax tree followed by the construction of a vector space of program codes. The semantic similarity of program codes is determined by the distance between code vectors in a multidimensional space.

The personalization of recommendations is achieved through a filtering module that ranks the retrieved fragments taking into account the user's profile. The factors under consideration are the language preferences of the user and his areas of scientific interests, extracted through integration with ORCID.

To ensure the system's operation, a specialized dataset was created based on the CodeNet corpus. The problem of automated language detection from a snippet of the presented code in one of the 19 languages included in the current rating list of programming languages has also been solved.

Keywords: abstract syntax tree, code embedding, content-based filtering, cross-language clone, cross language code search, code similarity, recommender system.

International Virtual Observatory: 10 years after

О.Ю. Малков, О.Б. Длужневская, О.С. Бартунов, И.Ю. Золотухин

Abstract: International Virtual Observatory (IVO) is a collection of integrated astronomical data archives and software tools that utilize computer networks to create an environment in which research can be conducted. Several countries have initiated national virtual observatory programs that will combine existing databases from ground-based and orbiting observatories and make them easily accessible to researchers. As a result, data from all the world's major observatories will be available to all users and to the public. This is significant not only because of the immense volume of astronomical data but also because the data on stars and galaxies have been compiled from observations in a variety of wavelengths: optical, radio, infrared, gamma ray, X-ray and more. Each wavelength can provide different information about a celestial event or object, but also requires a special expertise to interpret. In a virtual observatory environment, all of this data is integrated so that it can be synthesized and used in a given study. The International Virtual Observatory Alliance (IVOA) represents 17 international projects working in coordination to realize the essential technologies and interoperability standards necessary to create a new research infrastructure. Russian Virtual Observatory is one of the founders and important members of the IVOA. The International Virtual Observatory project was launched about ten years ago, and major IVO achievements in science and technology in recent years are discussed in this presentation. Standards for accessing large astronomical data sets were developed. Such data sets can accommodate the full range of wavelengths and observational techniques for all types of astronomical data: catalogues, images, spectra and time series. The described standards include standards for metadata, data formats, query language, etc. Services for the federation of massive, distributed data sets, regardless of the wavelength, resolution and type of data were developed. Effective mechanisms for publishing huge data sets and data products, as well as data analysis toolkits and services are provided. The services include source extraction, parameter measurements and classification from data bases, data mining from image, spectra and catalogue domains, multivariate statistical tools and multidimensional visualization techniques. Development of prototype VO services and capabilities implemented within the existing data centers, surveys and observatories are also discussed. We show that the VO has evolved beyond the demonstration level to become a real research tool. Scientific results based on end-to-end use of VO tools are discussed in the presentation.

Keywords: virtual observatory, e-science, astronomical data.

Search Results

Queries to Non-Relational Data using Natural Language based on a Large Language Model

Creation of Query Expansion Based on the Subject Domain Thesaurus in the Ontology of Knowledge of the Semantic Library

A Recommendation System for Finding Semantically Similar Fragments of Program Code

International Virtual Observatory: 10 years after