Database Ingestion From Spreadsheets: A Developer's Guide
Hey guys! In this article, we will walk through the process of ingesting data from spreadsheets into a database using a practical approach. We'll cover everything from setting up your environment to handling various data types and ensuring data integrity. This guide is designed to help you streamline your data management processes and build robust applications.
Summary
This sprint focuses on implementing and extending spreadsheet upload functionality to send parsed data to our backend API endpoint, which will then insert it into the database. The main goal here is data ingestion, not displaying or styling the data. We're diving deep into the backend, ensuring that the data makes its way into the database smoothly and efficiently.
Goals of this sprint:
- Parse and validate spreadsheet data (from the sample Excel sheet).
- Make a POST request to an API endpoint to insert data into the database.
- Handle response states.
- Use Drizzle ORM within that API route to insert the data into the Neon database.
⚙️ Setup
Before we jump into the code, let's get our development environment ready. Follow these steps to ensure everything is set up correctly:
-
Pull the latest changes from
main:git checkout main git pull -
Create a new branch for this sprint:
git checkout -b sprint2-data-ingestion -
Install dependencies:
npm install # or npm i -
Run the dev server and verify the site builds:
npm run dev -
Visit http://localhost:3000 to confirm everything runs.
-
Get
.envfile from Google Drive folder “.env - ingestion.”
Why Setup Matters
Setting up your environment correctly is crucial for a smooth development process. Imagine building a house on a shaky foundation – it's bound to collapse! Similarly, if your environment isn't properly configured, you'll likely encounter unexpected errors and roadblocks. By following these steps, you ensure that you're working with the latest code, have all the necessary dependencies, and can run the application without any issues. This not only saves time but also reduces frustration, allowing you to focus on the actual implementation.
Moreover, using Git for version control is essential. Pulling the latest changes from the main branch ensures you're working with the most up-to-date codebase, preventing conflicts and ensuring compatibility with other team members' work. Creating a new branch for your sprint keeps your changes isolated, making it easier to manage and review the code before merging it back into the main branch. Trust me, guys, these practices are lifesavers in collaborative projects!
đź§ Implementation
Step 1: Create an Upload Component
The first step in our journey is to create a component that allows users to upload spreadsheet files. This component will be responsible for parsing the file, extracting relevant data, and converting it into a format suitable for our database.
The component should:
- Accept an Excel file upload.
- Parse the file.
- Extract only the columns highlighted in green from the sample spreadsheet.
- Convert any necessary data types to match the Neon database schema. See
src/lib/schema.tsfor the correct data types for each of the columns or Neon for them as well.
Important Note: You don't need to display the data in a table for this sprint. Just confirm the parsing works in the console or as plain JSON output for testing and then populate the database.
Emphasis on Cleanliness: Cleanliness is more important than how much data you are storing! This also means that documenting how you do this is also important.
Diving Deep into the Upload Component
Creating a robust upload component involves several key considerations. First, you'll need to choose a library for handling Excel file parsing. There are many options available, such as xlsx or exceljs. Pick one that suits your needs and integrate it into your component. Next, you'll need to define the structure of the data you expect from the spreadsheet. This involves identifying the columns you need and their corresponding data types.
When parsing the file, make sure to handle potential errors gracefully. For example, the file might be corrupted, or it might not contain the expected columns. Displaying informative error messages to the user can greatly improve the user experience. Also, remember to validate the data as you parse it. This ensures that you're only working with clean, accurate data. Data validation can include checking for missing values, ensuring data types are correct, and verifying that data falls within acceptable ranges.
Data type conversion is another crucial aspect. Spreadsheets often store data as strings, even if they represent numbers or dates. You'll need to convert these values into the appropriate data types before sending them to the database. For example, you might need to convert a string representation of a date into a JavaScript Date object or a string representation of a number into a number. Using the correct data types ensures that your data is stored accurately and can be queried efficiently.
Finally, document your code thoroughly. Explain the purpose of each function, the expected inputs and outputs, and any assumptions you've made. Good documentation makes it easier for others (and your future self) to understand and maintain the code. Remember, clean code with good documentation is always better than clever code that no one can understand!
Step 2: Prepare & Send the POST Request, Populate Database
Now that we can successfully parse and validate the spreadsheet data, the next step is to send it to our backend API endpoint and populate the database. This involves preparing the data, making a POST request, and handling the response.
Notes about the spreadsheet:
- Each row represents a student. For some tables (like the students table), this is a 1:1 addition to the table and is pretty easy. For others, not so much. More on this below.
- The spreadsheet doesn’t have a year. This is something that you will have to track with state on the upload page. See Figma for details.
General notes about populating database:
- Plzzz ensure that when you populate a table with an entry, you check that it does not already exist in the database. It will save a ton of time down the line. More info on how to identify a duplicate entry below. Some tables will not need this but the ones that do will be explicitly mentioned below.
- Each insertion into the database should return the id of the data that was just entered. This will be used as foreign keys to connect tables. Some tables’ return ids will not need to be used though, but as a general rule of thumb, it would be a good idea to just return all.
- The foreign keys that connect the tables will force you to insert data into some tables before other ones. These insertions should return id’s so they can be used to insert other data. You should insert into tables in the order they are specified below.
Approach:
- After converting the data to the correct types, send it in the body of a POST request to the endpoint
/api/import. - Once the data is retrieved from the body on the backend, you should iterate through each row. For each row, you will extract certain data depending on the table you are inserting into as follows:
- schools table:
- This should be the first table that you insert into.
- You will only have to extract the school name and town as of now.
- Make sure to return the ID.
- Pay extra attention to check if a school already exists since schools often persist between years. If it already exists, you just need to add any entry to the yearly school participation data and not this table. To check if a school exists, use its name to check for equality.
- teachers table:
- You will only have to extract the teacher's first name, last name, and email from the row of the data.
- Make sure to return the ID.
- Pay extra attention to check if a teacher already exists since they often persist between years. If they already exist, you just need to add any entry to the yearly teacher participation data and not this table. To check if a teacher exists, use their first name, last name, and email to check for equality.
- projects table:
- You will have to extract the entry ID/project ID, project title, division, category, year, and group (as a boolean).
- You will have to use the school and teacher ID from the previous two tables.
- Projects should be unique to a year but you should definitely still check if they exist in case a duplicate spreadsheet is uploaded. Check for equality using the entry ID.
- students table:
- You won’t have to get any student data from the spreadsheet but will have to put the project and school ID for a student.
- Check for student equality using project ID and school ID.
- yearly-teacher-participation table:
- The only data you will need for this table is the year. You will also need the teacher and school IDs.
- Check for equality with other entries using the teacher ID, school ID, and year.
- yearly-school-participation table:
- The only data you will need for this table is the year. You will also need the school ID.
- Check for equality with other entries using the school ID and year.
- Here is the MHD Database schema to look at the relations between tables: MHD Database.pdf.
- schools table:
- Make sure to handle all possible outcomes:
- Success: show a confirmation message
- Error: display an appropriate error message to the user
- Optionally, you can add a location animation to give the user feedback that the data is being processed
Crafting the POST Request and Database Population
Creating a POST request and populating the database requires careful attention to detail. First, structure the data in a format that the backend API expects, typically JSON. Use a library like axios or the built-in fetch API to send the POST request to the /api/import endpoint. Make sure to set the Content-Type header to application/json.
On the backend, use Drizzle ORM to interact with the Neon database. Drizzle ORM provides a type-safe way to query and manipulate data in the database. Define your database schema using Drizzle ORM's schema definition language, and then use the generated types to ensure that your queries are correct. When inserting data, always check for duplicates to avoid creating redundant entries. This can be done by querying the database for existing entries that match the data you're about to insert.
Data Validation on the Backend:
While we've already validated the data on the frontend, it's crucial to perform additional validation on the backend. This ensures that the data is consistent and accurate, even if the frontend validation is bypassed. Backend validation can include checking data types, verifying that required fields are present, and ensuring that data falls within acceptable ranges. If validation fails, return an appropriate error message to the user.
Handling Different Outcomes:
Handling different outcomes is essential for providing a good user experience. If the POST request is successful, display a confirmation message to the user. If an error occurs, display an informative error message that helps the user understand what went wrong and how to fix it. For example, if the database connection fails, display a message indicating that the server is temporarily unavailable. You can also add a loading animation to provide feedback that the data is being processed.
By following these steps, you can create a robust data ingestion process that ensures data integrity and provides a good user experience.
Acceptance Criteria
To ensure that we're on the right track, let's define some acceptance criteria. These are the conditions that must be met for the sprint to be considered complete:
- [ ] A file is able to be uploaded and reach the database
- [ ] Each field is validated to be the correct type for that column
- [ ] The uploaded file is eventually sent to the database and a corresponding status message is displayed afterward
- [ ] UI matches Figma
- [ ] Code runs locally without errors and builds successfully
- [ ] Merge via PR
Questions
If you get stuck or have questions:
- DM Dan or Shayne on Slack
- Come to office hours or ask during a hack night
- Ask questions during team meetings – we’re here to help!
Resources
Here are some resources that you might find helpful:
- Figma: https://www.figma.com/design/hxC6tNqUqfygC7GOz8fcPG/JumboCode-MHD-Website
- React
useState: https://react.dev/reference/react/useState - React
useEffect: https://react.dev/reference/react/useEffect - Next.js Routing: https://nextjs.org/docs/app/building-your-application/routing
- Tailwind CSS Docs: https://tailwindcss.com/docs
- shadcn: https://ui.shadcn.com/docs/components/data-table
- npm: https://www.npmjs.com/
- Sample data: in Google Drive > Resources
- Drizzle ORM: https://orm.drizzle.team/