Abstract:
The effectiveness of the ImageNet diffusion model and CLIP models for image generation based on textual descriptions was investigated. Two experiments were conducted using various textual inputs and different parameters to determine the optimal settings for generating images from text descriptions. The results showed that while ImageNet performed well in generating images, CLIP demonstrated better alignment between textual prompts and relevant images. The obtained results highlight the high potential of combining these mentioned models for creating high-quality and contextually relevant images based on textual descriptions.