Table of Contents
Generating new variables is a foundational task in data manipulation and analysis using SAS. Whether you are importing raw data or transforming an existing dataset, the ability to define, calculate, and format variables accurately is paramount for successful statistical programming. The process involves leveraging core components of the DATA statement, which serves as the execution environment for data creation and modification in SAS. Understanding how variable attributes—such as name, type (character or numeric), length, and format—are established is essential for producing clean, predictable results.
To define a new variable explicitly, several specialized statements are available. The LENGTH statement is used to control the storage size of the variable, which is particularly important for managing memory efficiency and preventing truncation of character strings. Furthermore, the FORMAT statement ensures that the output presentation of the data is consistent and readable, a crucial step for reporting and visualization. Beyond basic definition, variables often require computed values; this is achieved using the powerful assignment operator in the DATA statement.
Advanced creation methods allow programmers to handle complex data logic efficiently. For instance, the RETAIN statement is invaluable for iterative processing, allowing variables to hold their values across subsequent observations, often used for cumulative calculations or sequence numbering. The ARRAY statement simplifies the manipulation of numerous related variables simultaneously, treating them as a single group. Finally, for time-series or lagged analyses, the LAG function enables the retention and retrieval of values from previous observations within the dataset, providing a critical tool for sequential data transformation.
Core Methods for Variable Initialization in SAS
When working within the SAS environment, there are two primary methodologies employed to initialize or populate new variables within a dataset. The choice between these methods depends fundamentally on whether the source data is external (requiring direct input) or already exists within a SAS library (requiring transformation).
These two fundamental approaches govern nearly all variable creation tasks in the DATA statement. Understanding the difference between creating variables from scratch (data generation) versus deriving them from existing data (data transformation) is key to writing efficient and error-free SAS code.
Method 1: Create Variables from Scratch
This method utilizes the DATA statement paired with the INPUT and DATALINES statements. This combination is essential when you need to manually enter data or read raw data directly into the SAS session. The variables are defined implicitly by their appearance in the INPUT statement, and their values are explicitly listed under DATALINES, concluding with a semicolon to signal the end of the input block. This technique is often used for creating small sample datasets or for reading simple fixed-format data.
data original_data;
input var1 $ var2 var3;
datalines;
A 12 6
B 19 5
C 23 4
D 40 4
;
run;
Method 2: Create Variables from Existing Variables
The second, and perhaps more frequent, method involves data transformation. Here, the DATA statement is coupled with the SET statement. The SET statement reads observations from an existing SAS dataset, making all its variables available for processing in the current data step. New variables are then created through assignment statements, where they are calculated as functions of the existing variables. This method forms the backbone of data cleaning, calculation, and preparation steps in any large-scale SAS project.
data new_data;
set original_data;
new_var4 = var2 / 5;
new_var5 = (var2 + var3) * 2;
run;The following detailed examples illustrate how to implement these two critical methods effectively in a practical SAS environment.
Method 1: Generating Variables via Direct Data Input (INPUT/DATALINES)
When generating a dataset from raw input, the structure of the variable definitions must closely align with the physical arrangement of the data provided in the DATALINES section. This approach is most effective when data is simple, delimited by spaces, and relatively small in volume. The order in which variables are listed in the INPUT statement determines the order in which SAS reads the corresponding values from the data lines.
It is important to understand how SAS automatically handles variable typing during the input phase. By default, if a variable name appears in the INPUT statement without modification, SAS assumes it is a numeric variable. Numeric variables can only store numerical values and are used for mathematical operations. If the data to be read consists of letters, symbols, or mixed alphanumeric strings, the variable must be explicitly defined as a character variable.
To define a variable as a character type, a dollar sign ($) must immediately follow the variable name in the INPUT statement. This simple notation instructs the SAS compiler to allocate memory for a character string, ensuring that non-numeric data, such as team names or identification codes, are stored correctly. If a character variable is not explicitly defined with the dollar sign, SAS will attempt to read the data as numeric, resulting in errors or missing values if non-numeric characters are encountered.
Detailed Example 1: Creating a New Dataset with Character and Numeric Variables
The following code shows how to create a dataset with three variables: team (character), points (numeric), and reboundss (numeric). This example utilizes the INPUT and DATALINES statements to define and populate the dataset simultaneously.
/*create dataset*/
data original_data;
input team $ points rebounds;
datalines;
Warriors 25 8
Wizards 18 12
Rockets 22 6
Celtics 24 11
Thunder 27 14
Spurs 33 19
Nets 31 20
;
run;
/*view dataset*/
proc print data=original_data;
The code confirms that variable names are listed directly after the input statement. Note that the team variable is immediately followed by a dollar sign ($) to designate it as a character variable. The values for all variables are created from scratch following the datalines statement. This is the simplest method for manually bootstrapping a dataset for immediate use or testing within SAS.
Note: SAS assumes each new variable listed in the INPUT statement is numeric by default. To create a character variable, you must explicitly type a dollar sign “$” after the variable name, as demonstrated for the team variable in this example.
Method 2: Deriving Variables Through Data Transformation (SET Statement)
The most common scenario in data analysis is not creating data from scratch but performing calculations and transformations on existing datasets. This is accomplished using the SET statement within a new DATA statement. The SET statement tells SAS to read the input dataset observation by observation, making all its variables available for mathematical or logical manipulation.
Creating a new variable derived from existing ones is straightforward, requiring only an assignment statement. The general syntax is New_Variable = Expression;, where the expression can involve arithmetic operations, functions (like SUM, MEAN, LOG), or conditional logic (using IF/THEN statements). SAS handles the creation of the new variable automatically; if the result of the expression is numerical, the new variable is numeric; if the result is a character string, the new variable is character.
This method is highly flexible and efficient for tasks such as normalization, standardization, calculating ratios, or aggregating scores. It allows the analyst to maintain the integrity of the original dataset while adding calculated fields necessary for subsequent analysis steps.
Detailed Example 2: Performing Calculations on Existing Data
The following code shows how to use the set function to create a new dataset whose variables are calculated from existing variables in the original_data dataset. We will derive two new variables: half_points and avg_pts_rebs.
/*create new dataset*/
data new_data;
set original_data;
half_points = points / 2;
avg_pts_rebs = (points + rebounds) / 2;
run;
/*view new dataset*/
proc print data=new_data;The SET original_data; command ensures that all variables from the source dataset are available in new_data. The subsequent assignment statements calculate the values for the two new numeric variables based on the values of points and reboundss.

Controlling Variable Attributes: LENGTH and FORMAT Statements
While SAS often handles variable attributes implicitly, explicitly controlling them ensures data integrity and storage efficiency. Two crucial statements are used for this purpose: LENGTH and FORMAT.
The LENGTH statement defines the number of bytes used to store a variable. For numeric variables, the default length is usually 8 bytes, which is standard for double-precision floating-point numbers. For character variables, however, the length must be specified based on the maximum expected string size. Defining a character variable length that is too short will result in data truncation, leading to data loss. Conversely, defining an excessively long length wastes memory. The LENGTH statement must appear before any assignment or reference to the variable in the DATA statement.
The FORMAT statement controls how the variable values are displayed in output (e.g., using PROC PRINT or PROC REPORT). It does not change the actual stored value of the variable, only its presentation. Standard formats include currency formats (e.g., DOLLARw.d), date formats (e.g., DATE9.), or custom formats defined via PROC FORMAT. Applying a format ensures readability and consistency across all outputs generated from the dataset.
Advanced Variable Creation Techniques: RETAIN, ARRAY, and LAG
For more complex data manipulation tasks, SAS offers specialized statements that manage variable initialization and flow control across observations.
The RETAIN Statement: By default, SAS initializes variables created in the DATA statement to missing values at the beginning of each iteration (i.e., for each new observation). The RETAIN statement overrides this behavior, instructing SAS to keep the variable’s value from the previous observation. This is fundamental for calculating running totals, sequence numbers, or flags that depend on prior row values.
The ARRAY Statement: When dealing with dozens or hundreds of similarly structured variables (e.g.,
score_1throughscore_100), manipulating them individually is cumbersome. The ARRAY statement allows the programmer to define a temporary array that references these variables collectively. This enables the use of iterativeDOloops to perform identical calculations or transformations across all variables in the array efficiently, greatly reducing code length and complexity.The LAG Function: Crucial for time-series analysis or sequential data processing, the LAG function returns the value of a variable from a previous observation. For instance,
LAG(variable, 1)retrieves the value of the variable from the immediately preceding row. This function is essential for calculating differences between consecutive observations, analyzing growth rates, or comparing current values against historical data points.
Best Practices for Defining New SAS Variables
Adhering to best practices ensures maintainable, readable, and efficient SAS code:
Use Descriptive Names: Variable names should clearly indicate the content and purpose of the data (e.g., total_revenue instead of TR). This significantly improves code readability for future users or when revisiting the program.
Explicitly Define Lengths: Always use the LENGTH statement for character variables to prevent truncation errors, especially when dealing with data imported from external sources like databases or spreadsheets.
Document Intent: Use SAS comments (
/* comment */) liberally to explain complex logic, especially when using advanced features like the RETAIN statement or array processing. This aids in debugging and knowledge transfer.
Mastering these variable creation techniques is foundational to effective data processing in SAS. By correctly defining variables, managing their attributes, and utilizing the assignment operator, users can confidently prepare their data for sophisticated statistical analysis.
The following tutorials explain how to perform other common tasks in SAS:
Cite this article
stats writer (2025). How to Easily Create New Variables in SAS: A Step-by-Step Guide. PSYCHOLOGICAL SCALES. Retrieved from https://scales.arabpsychology.com/stats/how-to-create-new-variables-in-sas-with-examples/
stats writer. "How to Easily Create New Variables in SAS: A Step-by-Step Guide." PSYCHOLOGICAL SCALES, 1 Dec. 2025, https://scales.arabpsychology.com/stats/how-to-create-new-variables-in-sas-with-examples/.
stats writer. "How to Easily Create New Variables in SAS: A Step-by-Step Guide." PSYCHOLOGICAL SCALES, 2025. https://scales.arabpsychology.com/stats/how-to-create-new-variables-in-sas-with-examples/.
stats writer (2025) 'How to Easily Create New Variables in SAS: A Step-by-Step Guide', PSYCHOLOGICAL SCALES. Available at: https://scales.arabpsychology.com/stats/how-to-create-new-variables-in-sas-with-examples/.
[1] stats writer, "How to Easily Create New Variables in SAS: A Step-by-Step Guide," PSYCHOLOGICAL SCALES, vol. X, no. Y, ص Z-Z, December, 2025.
stats writer. How to Easily Create New Variables in SAS: A Step-by-Step Guide. PSYCHOLOGICAL SCALES. 2025;vol(issue):pages.