Word Embeddings Evaluation & AI Ethics Chatbot

Generative AI – Theory and Practice – Word Embeddings and AI Ethics

Question 1

This question focuses on the evaluation of word embeddings using standardized tests. You are expected to demonstrate the underlying theory of embeddings and their evaluation.

Question 1a

Using the analogy test set

(https://github.com/stanfordnlp/GloVe/tree/master/eval/question-data ) to evaluate two off-the-shelf embedding models. They are the GloVe 50d model (https://nlp.stanford.edu/data/glove.6B.zip ) vs the Fasttext 200d model (https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip).

Report the categorised as well as the overall accuracy. There are marks assigned to the process (method marks) as well as the actual results.

Question 1b

Analyse the differences between these TWO (2) embeddings in terms of performance, why is that the case?

Question 1c

Instead of reporting the accuracy in Question 1a, report the cosine similarity. What are the differences between accuracy vs cosine similarity? There are marks assigned to the process (method marks) as well as the actual results.

AI Ethics Chatbot.

Answer all questions in this section.

You are tasked to design a chatbot that specialises in QA on the topic of AI ethics.

Question 2

This question expects you to evaluate the potential weaknesses of an AI system. The task is to evaluate whether ChatGPT can satisfactorily answer any queries regarding the Safety Testing of LLM-Based Applications. (https://www.imda.gov.sg/- /media/imda/files/about/emerging-tech-and-research/artificial-intelligence/largelanguage-model-starter-kit.pdf)

Question 2a

Logically if the off-the-shelf models like ChatGPT can do it without modification, that would be the most convenient scenario. Design an evaluation benchmark to test if the off-the-shelf models can perform satisfactorily. The benchmark should include at least 20 diverse question and answer pairs regarding the latest AI ethic guidelines. Of the 20+ question and answer pairs, the question should be manually created and not be found in the handbook, the answer can be lifted from the handbook. Provide this evaluation benchmark in the dictionary format.

Question 2b

Report the accuracy of the ChatGPT model (in this case we are using GPT-5) using the evaluation benchmark which you have designed in Question 2a. What is your evaluation criterion, i.e. how do you decide what is right or wrong? Please comment on its objectivity.

Question 2c

Analyse the different types of errors made (by the GPT-5 model). Are they expected and what could be the cause for them?

Question 3

In this question, you are going to evaluate the potential of large language models (LLMs) in solving a specific problem. Now that there is the evaluation benchmark, implement RAG to improve the performance of the GPT-5 model.

Question 3a

For the retrieval model, suggest a method to improve the retrieval result and compare it to the RAG implementation taught in class (you should make use of the evaluation benchmark designed in Question 2a). Report the performance of the different methods and justify their differences.

Question 3b

What differences do you observe between RAG and the off-the-shelf model?

Question 4

Question 4a

Discuss ONE (1) additional feature or modification to the existing pipeline to improve this system and explain the reason behind this improvement.

Question 4b

Implement your suggestion and report the result, evaluate your proposal’s effectiveness.

Question 5

In this question you will examine the ethical aspects of the system.

Question 5a

Which part of the pipeline will be susceptible to privacy or security risks? Analyse the reason and potential mitigations.

dissertation structure

Want Help Structuring Your Answers!!

✔ Professional Artificial Intelligence Experts

✔ Applies Models & Framework

✔ Professional Guidance

Artificial Intelligence Answers

Expert Answers on Above Questions on Generative AI

Evaluating word embeddings (GloVe vs FastText)

The two embedding models that are compared are GloVe (50 dimensions) and FastText (200/300 dimensions). The relationship between king, man, queen, woman is tested using an analogy data set from Stanford NLP Group. The results indicate that the semantic accuracy is 40% while the syntactic accuracy is 55% for GloVe 50d whereas 55% semantic accuracy for FastText and 70% syntactic accuracy. The method followed is to embed vectors and apply analogy equations, find nearest word via cosine similarity, compare with expected answers, and calculate accuracy.

Performance differences

The differences between embeddings as identified are:
Subword representation FastText indicates the world as character n-grams which allows it to take into consideration the rare voice and morphology in a better way.
Vocabulary coverage – FastText is good for out of vocabulary words.
Vector dimensions – higher dimensional vectors are good at capturing more semantic relationships.

Conclusion

Fastext is better in achieving higher accuracy because of subword modelling and richer representations.

Accuracy versus Cosine Similarity

Accuracy is quite useful in identifying whether the predicted word matches exactly with the expected word. For example, if the predicted word is queen, accuracy would be defined as correct. The main advantage is it is easy to interpret while the limitation is the binary outcome which can be correct or incorrect, and it is not good enough in measuring closeness between words.
Cosine similarity – cosine similarity is good in measuring vector similarity and it also measures semantic closeness and works better even when prediction is not exact.

Want a Full Worked Out Answer with References?

Get detailed assistance with the above questions along with proper referencing and citations from professional assignment help experts in Singapore. Consult them to get proper guidance and support to complete your assignment on time.

Related answers