Appendix B — LLMs in programming
Large Language Models (LLMs) like ChatGPT, Claude, and GitHub Copilot are powerful tools when performing programming tasks. LLMs can be used for tasks like:
- Understanding code: LLMs can be used to explain what the code does and how.
- Creating scripts: LLMs can create a starting point for when you need to accomplish a certain task via code.
- Debugging code: LLMs are excellent for debugging code.
LLMs can make errors and sometimes make stuff up (hallucinations). Therefore, programming is still an important skill for statistics such that one can verify the code actually does what is expected.
B.1 Best practices and limitations
B.1.1 Do’s
- Always verify LLM-generated code.
- Specify what you want to achieve and the program you are using.
- Request explanations of code functions.
- Specify your exact package versions when encountering errors.
B.1.2 Don’ts
- Don’t run code without understanding it.
- Don’t assume the first solution is optimal.
- Don’t forget to validate statistical calculations.
- Don’t share sensitive data in prompts.
B.1.3 Common Issues
- Outdated syntax: LLMs may suggest old library versions.
- Statistical validity: Code may work but use inappropriate methods.
B.2 How to prompt
Prompting is the process of providing input text (called a “prompt”) to an LLM to get a desired response. When you type a question or instruction to an LLM, you are creating a prompt. The LLM processes this text and generates a response based on patterns it learned during training. The quality and specificity of your prompt directly influences the usefulness of the response.
Below are some tips and tricks on how to write effective prompts.
B.2.1 Generating code with LLMs
Be specific and contextual
Always mention important specifics such as:
- The programming language and packages you are using.
- How your data is structured.
- The specific function/analysis you need.
Poor prompt
Write some code to so I can analyze my data
Better prompt
I have a CSV file with columns ‘age’, ‘income’, and ‘education_level’. I need R code using tidyverse to calculate mean income by education level and create a bar chart using ggplot2.
Start simple and build on complexity
- Start with a simple code written by the LLM or you.
- This tends to get rid of most hallucinations.
First prompt
I have a dataframe in R with the columns ‘miles-per-gallon’ and ‘price’. Create a simple scatterplot using ggplot2 with the ‘miles-per-gallon’ on the x-axis and ‘price’ on the y-axis.
Second prompt
Add a regression line, confidence intervals, and custom labels to the scatterplot.