The speedy progress of internet content material presents the challenges of effectively extracting and summarizing related info. This tutorial exhibits methods to use it Firecrawl Course of extracted knowledge utilizing AI fashions reminiscent of Google Gemini for internet scraping. By integrating these instruments into Google Colab, you create an end-to-end workflow that removes internet pages, retrieves significant content material, and generates concise summaries utilizing cutting-edge language fashions. Whether or not you automate your analysis, extract insights from articles, or construct an AI-powered software, this tutorial provides a strong and adaptable resolution.
!pip set up google-generativeai firecrawl-py
First, set up Google-Generativeai firecrawl-py. This installs two necessary libraries wanted for this tutorial. Google-Generativeai offers entry to Google’s Gemini API for AI-powered textual content technology, whereas Firecrawl-Py permits internet scraping by retrieving content material from internet pages in a structured format.
import os
from getpass import getpass
# Enter your API keys (they are going to be hidden as you kind)
os.environ["FIRECRAWL_API_KEY"] = getpass("Enter your Firecrawl API key: ")
Subsequent, securely set the Firecrawl API key as an atmosphere variable in Google Colab. Use getPass() to encourage customers with out displaying API keys and guarantee confidentiality. Storing keys in os.environ permits for seamless authentication of firecrawl’s internet scraping capabilities all through the session.
from firecrawl import FirecrawlApp
firecrawl_app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
target_url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
outcome = firecrawl_app.scrape_url(target_url)
page_content = outcome.get("markdown", "")
print("Scraped content material size:", len(page_content))
Initialize Firecrawl by making a Firecrawlapp occasion utilizing the saved API key. It then removes the content material of the required internet web page (on this case, Wikipedia’s Python programming language web page) and extracts the info in markdown format. Lastly, you possibly can print the size of the scraped content material to validate profitable searches earlier than additional processing.
import google.generativeai as genai
from getpass import getpass
# Securely enter your Gemini API Key
GEMINI_API_KEY = getpass("Enter your Google Gemini API Key: ")
genai.configure(api_key=GEMINI_API_KEY)
By utilizing GetPass() to securely seize the API key, initializing the Google Gemini API and stopping it from showing in plain textual content. The genai.configure(api_key = gemini_api_key) command units the API consumer and permits for seamless interplay with Google’s Gemini AI for textual content technology and abstract duties. This ensures safe authentication earlier than making requests to the AI mannequin.
for mannequin in genai.list_models():
print(mannequin.identify)
Repeat the fashions obtainable within the Google Gemini API utilizing Genai.list_models() and print the identify. This permits customers to see which fashions are accessible with their API keys and choose the mannequin that’s appropriate for duties reminiscent of textual content technology and abstract. If no mannequin is discovered, this step will enable you to debug and select alternate options.
mannequin = genai.GenerativeModel("gemini-1.5-pro")
response = mannequin.generate_content(f"Summarize this:nn{page_content[:4000]}")
print("Abstract:n", response.textual content)
Lastly, we initialize the Gemini 1.5 Professional mannequin utilizing Genai.generativeModel (“Gemini-1.5-Professional”) and ship a request to generate a abstract of the scraped content material. Restrict enter textual content to 4,000 characters and stays inside API constraints. This mannequin processes requests, returns a concise abstract, after which offers a structured AI-generated overview of the extracted internet web page content material.
In conclusion, by combining Firecrawl and Google Gemini, we created an automatic pipeline that cuts down internet content material and generates significant summaries with minimal effort. This tutorial confirmed a number of AI-powered options, permitting for flexibility primarily based on API availability and quota constraints. Whether or not you’re engaged on NLP purposes, analysis automation, or content material aggregation, this method permits for large-scale knowledge extraction and abstract.
Right here is Colove Notebook. Additionally, do not forget to observe us Twitter And be part of us Telegram Channel and LinkedIn grOUP. Do not forget to affix us 80k+ ml subreddit.
Asif Razzaq is CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, ASIF is dedicated to leveraging the probabilities of synthetic intelligence for social advantages. His newest efforts are the launch of MarkTechPost, a man-made intelligence media platform. That is distinguished by its detailed protection of machine studying and deep studying information, and is simple to know by a technically sound and extensive viewers. The platform has over 2 million views every month, indicating its recognition amongst viewers.

