--- title: GLM3 API 使用指北 createTime: 2026/04/02 20:26:36 --- ## FastAPI docs 由FastAPI自动生成的API[说明页面](http://yuany3721.site:6017/docs),由[API部署代码](https://github.com/THUDM/ChatGLM3/blob/main/openai_api_demo/api_server.py)中注释自动生成,基于[OpenAI API](https://platform.openai.com/docs/api-reference/chat)。 API入口: - "/health": 响应API运行状态,返回200则运行正常 - "/v1/chat/completions": 响应文本对话请求,可选是否流式输出 - "/v1/embeddings": 响应一组列表式文本对话请求 更多代码说明: > This script implements an API for the ChatGLM3-6B model, > formatted similarly to OpenAI's API (https://platform.openai.com/docs/api-reference/chat). > It's designed to be run as a web server using FastAPI and uvicorn, > making the ChatGLM3-6B model accessible through OpenAI Client. > > Key Components and Features: > > - Model and Tokenizer Setup: Configures the model and tokenizer paths and loads them. > - FastAPI Configuration: Sets up a FastAPI application with CORS middleware for handling cross-origin requests. > - API Endpoints: > - "/v1/models": Lists the available models, specifically ChatGLM3-6B. > - "/v1/chat/completions": Processes chat completion requests with options for streaming and regular responses. > - "/v1/embeddings": Processes Embedding request of a list of text inputs. > - Token Limit Caution: In the OpenAI API, 'max_tokens' is equivalent to HuggingFace's 'max_new_tokens', not 'max_length'. > For instance, setting 'max_tokens' to 8192 for a 6b model would result in an error due to the model's inability to output > that many tokens after accounting for the history and prompt tokens. > - Stream Handling and Custom Functions: Manages streaming responses and custom function calls within chat responses. > - Pydantic Models: Defines structured models for requests and responses, enhancing API documentation and type safety. > - Main Execution: Initializes the model and tokenizer, and starts the FastAPI app on the designated host and port. ## API调用示例 参考[api-demo](https://github.com/THUDM/ChatGLM3/blob/main/openai_api_demo/openai_api_request.py) 修改第15行`base_url`为目标URL。